Slides from the Live Webcast on Apr. 25, 2012
Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget?
Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.
For more information visit: http://www.databaserevolution.com
Watch this and the entire series at : http://www.youtube.com/playlist?list=PLE1A2D56295866394
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Fit For Purpose: The New Database Revolution Findings Webcast
1. One Size Doesn’t Fit All
The database revolution
April 25, 2012
Mark R. Madsen
http://ThirdNature.net
Robin Bloor
http://Bloorgroup.com
Wednesday, April 25, 12
2. Your Host
Eric.kavanagh@bloorgroup.com
Wednesday, April 25, 12
4. Introduction
Significant and revolutionary changes are taking place
in database technology
In order to investigate and analyze these changes and
where they may lead, The Bloor Group has teamed up
with Third Nature to launch an Open Research
project.
This is the final webinar in a series of webinars and
research activities that have comprised part of the
project
All published research will be made available through
our web site: Databaserevolution.com
Wednesday, April 25, 12
6. General Webinar Structure
Market Changes, Database Changes
(Some Of The Findings)
Let’s Talk About Performance
How to Select A Database
Wednesday, April 25, 12
8. Database Performance Bottlenecks
CPU saturation
Memory saturation
Disk I/O channel saturation
Locking
Network saturation
Parallelism – inefficient load balancing
Wednesday, April 25, 12
9. Multiple Database Roles
Transactional Systems BI and Analytics Systems
BI BI BI
BI BI
App App App
App App
Unstructured Structured
Data Data
Personal
BI
App App Operational Data Personal BI
App App Data Data App
App App Data Marts Data App
Marts Stores
Store Stores
File or
File or DBMS Staging Data OLAP
DBMS or
File DBMS OLAP
DBMS Area Warehouse Cubes
Cubes
DBMS DBMS
Content BI File or BI
DBMS App DBMS App
Now there are more...
Wednesday, April 25, 12
10. The Origin of Big Data
Corporate
Databases
+ Unstructured Data
+ Personal Data
+ Supply Chain & Cust. Data
+ Web Data
+ Social Network Data
+ Embedded Systems Data
Wednesday, April 25, 12
12. Big Data = Scale Out
The query is decomposed
into a sub-query Query
for each node
The columnar database
scales up and out by
Database Sub Sub adding more servers
Table Query 1 Query 2
Server 1 Server 2 Server 1
CPU CPU CPU CPU CPU CPU
Common Common Common
Memory Memory Memory
Cache Cache Cache
Data is compressed and DataData DataData DataData
DataData DataData DataData
partitioned on disk by
column and by range
Wednesday, April 25, 12
13. Let’s Stop Using the Term NoSQL
Single Table
As the graph Star Schema
indicates, it’s just not
oldsql newsql
Snow Flake
helpful. In fact it’s TNF Schema Data
Volume
downright confusing. OLAP
Nested Data
nosql
Graph Data
Complex Data
Wednesday, April 25, 12
15. NoSQL Directions
Some NDBMS do not attempt to provide all ACID properties.
(Atomicity, Consistency, Isolation, Durability)
Some NDBMS deploy a distributed scale-out architecture with data
redundancy.
XML DBMS using XQuery are NDBMS.
Some documents stores are NDBMS (OrientDB, Terrastore, etc.)
Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.)
Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley
DB, etc.)
Graph DBMS (DEX, OrientDB, etc.) are NDMBS
Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS
Wednesday, April 25, 12
16. The Joys of SQL?
SQL: very good for set manipulation.
Works for OLTP and many query
environments.
Not good for nested data structures
(documents, web pages, etc.)
Not good for ordered data sets
Not good for data graphs (networks of
values)
Wednesday, April 25, 12
18. The “Impedance Mismatch”
The RDBMS stores data organized
according to table structures
The OO programmer manipulates data
organized according to complex object
structures, which may have specific
methods associated with them.
The data does not simply map to the
structure it has within the database
Consequently a mapping activity is
necessary to get and put data
Basically: hierarchies, types, result sets,
crappy APIs, language bindings, tools
Wednesday, April 25, 12
20. The SQL Barrier
SQL has:
DDL (for data definition)
DML (for Select, Project and Join)
But it has no MML (Math) or TML
(Time)
Usually result sets are brought to
the client for further analytical
manipulation, but this creates
problems
Alternatively doing all analytical
manipulation in the database
creates problems
Wednesday, April 25, 12
22. Hadoop/MapReduce
Hadoop is a parallel Map Partition Combine Reduce
processing environment BackUp
/Recov
Scheduler
Node i+1
Reducing
Map/Reduce is a parallel
BackUp
Process /Recov
BackUp
processing framework /Recov
Node 1
Hbase turns Hadoop into
Node j
Mapping
HDFS Process Reducing BackUp
a database of a kind
Process /Recov
Hive adds an SQL BackUp
/Recov Node k
capability Reducing
Process
BackUp
/Recov
Node i
Pig adds analytics
Mapping
HDFS Process
Wednesday, April 25, 12
24. Market Forces
A new set of products appear
They include some fundamental innovations
A few are sufficiently popular to last
Fashion and marketing drive greater adoption
Products defects begin to be addressed
They eventually challenge the dominant products
Wednesday, April 25, 12
35. “Linear%Scalability”
%
This"is"the"part"of"the"chart"most"vendors"show.
"
If you’re lucky they leave the bottom axis on so you
know where their system flatlines.
42. Improving%Query%Performance:%Columnar%Databases
%
ID% Name% Salary% Posi<on% In a row-store model
1" Marge"Inovera" $150,000" Sta5s5cian" these three rows
2" Anita"Bath" $120,000" Sewer"inspector" would be stored in
3" Ivan"Awfulitch" $160,000" Dermatologist" sequential order as
4" Nadia"Geddit" $36,000" DBA"
shown here, packed
into a block.
1" Marge"Inovera" $150,000" Sta5s5cian" In a column store
2" Anita"Bath" $120,000" Sewer"inspector" they would be
3" Ivan"Awfulitch" $166,000" Dermatologist" divided into columns
4" Nadia"Geddit" $36,000" DBA" and stored in
different blocks.
43. Inser<ng%data%into%a%columnar%database%
Each column is stored in its own set
of blocks, written to disk separately.
Extra work for writes over rowstore,
update complexity, delete complexity.
1" Marge"Inovera" $150,000" Sta5s5cian"
2" Anita"Bath" $120,000" Sewer"inspector"
3" Ivan"Awfulitch" $166,000" Dermatologist"
4" Nadia"Geddit" $36,000" DBA"
52. MPP%Database%Architecture%
Leader"node(s)"
used"by"some"
Worker"nodes"
High"speed"interconnect"
Some"use"separate"loader"nodes"
Some database are symmetric (all nodes are the same).
Some allow mixed worker node sizes. Some are leaderless.
Some problems with leaders, loaders, e.g. less automated
management of the environment, treating bottlenecks
Copyright"Third"Nature,"Inc." Slide 38
53. Key%to%MPP:%data%distribu<on%
Single logical view of a table
Table data is evenly spread across all nodes.
The good: scalability to petabyte range, much faster filtering and
selection on scans.
The bad: data skew (values, not rowcounts), aggregate function
bottlenecks, concurrency challenges, complex multi-table joins
with unlike distributions.
Copyright"Third"Nature,"Inc." Slide 39
68. Two%useful%concepts%to%characterize%queries
%
Selec7vity"–"The"restric5veness"of"a"query"when"
accessing"data."A"highly"selec5ve"query"filters"out"most"
rows."Low"selec5ve"queries"read"most"of"the"rows."
"High "Low"
SELECT SUM(salary) SELECT SUM(salary)
FROM emp WHERE ID = 1 FROM emp
69. Two%useful%concepts%to%characterize%queries
%
Retrieval"–"The"restric5veness"of"a"query"when"returning"
data."High"retrieval"brings"back"most"of"the"rows."Low"
retrieval"brings"back"rela5vely"few"rows."
"High "Low"
SELECT name, salary SELECT SUM(salary)
FROM emp FROM emp
72. Characteris<cs%of%read_write%workloads
%
Workload% Selec<vity% Retrieval% Repe<<on% Complexity%
Online%OLTP% High% Low% High% Low%
Batch%OLTP% Moderate%to% Moderate% High% Moderate%to%
low% to%high% high%
Object% High% Low% High% Low%
persistence%
Bulk%ingest% Low%(write)% n/a% High% Low%
Real<me%ingest% High%(write)% n/a% High% Low%
With ingest workloads we’re dealing with write-only, so selectivity and
retrieval don’t apply in the same way, instead it’s write volume.
73. Workload%parameters%and%DB%types%at"data"scale"
Workload% Write_ Read_ Updateable% Eventual% Un_ Compute%
parameters% biased% biased% data% consistency% predictable% intensive%
ok?% query%path%
Standard%
RDBMS%
Parallel%
RDBMS%
NoSQL%(kv,%
dht,%obj)%
Hadoop*%
Streaming%
database%
You see the problem: it’s an intersection of multiple parameters, and
this chart only includes the first tier of parameters. Plus, workload
factors can completely invert these general rules of thumb.
74. Workload%parameters%and%DB%types%at"data"scale"
Workload% Complex% Selec<ve% Low%latency% High% High%ingest%
parameters% queries% queries% queries% concurrency% rate%
Standard%
RDBMS%
Parallel%RDBMS%
NoSQL%(kv,%dht,%
obj)%
Hadoop%
Streaming%
database%
You have to look at the combination of workload factors: data scale,
concurrency, latency & response time, then chart the parameters.
80. How To Select A Database - (1)
1.What are the data management requirements and policies (if any) in
respect of:
- Data security (including regulatory requirements)?
- Data cleansing?
- Data governance?
- Deployment of solutions in the cloud?
- If a deployment environment is mandated, what are its technical
characteristics and limitations? Best of breed, no standards for
anything, “polyglot persistence” = silos on steroids, data integration
challenges, shifting data movement architectures
2. What kind of data will be stored and used?
- Is it structured or unstructured?
- Is it likely to be one big table or many tables?
Wednesday, April 25, 12
81. How To Select A Database - (2)
3.What are the data volumes expected to be?
- What is the expected daily ingest rate?
- What will the data retention/archiving policy be?
- How big do we expect the database to grow to? (estimate a range).
4. What are the applications that will use the database?
- Estimate by user numbers and transaction numbers
- Roughly classify transactions as OLTP, short query, long query, long
query with analytics.
- What are the expectations in respect of growth of usage (per user) and
growth of user population?
5. What are the expected service levels?
- Classify according to availability service levels
- Classify according to response time service levels
- Classify on throughput where appropriate
Wednesday, April 25, 12
82. How To Select A Database - (3)
6. What is the budget for this project and what does that cover?
7. What is the outline project plan?
- Timescales
- Delivery of benefits
- When are costs incurred?
8. Who will make up the project team?
- Internal staff
- External consultants
- Vendor consultants
9. What is the policy in respect of external support, possibly including vendor
consultancy for the early stages of the project?
Wednesday, April 25, 12
83. How To Select A Database - (4)
10.What are the business benefits?
- Which ones can be quantified financially?
- Which ones can only be guessed at (financially)?
- Are there opportunity costs?
Wednesday, April 25, 12
84. A random selection of databases
Sybase IQ, ASE EnterpriseDB Algebraix
Teradata, Aster Data LucidDB Intersystems Caché
Oracle, RAC Vectorwise Streambase
Microsoft SQLServer, PDW MonetDB SQLStream
IBM DB2s, Netezza Exasol Coral8
Paraccel Illuminate Ingres
Kognitio Vertica Postgres
EMC/Greenplum InfiniDB Cassandra
Oracle Exadata 1010 Data CouchDB
SAP HANA SAND Mongo
Infobright Endeca Hbase
MySQL Xtreme Data Redis
MarkLogic IMS RainStor
Tokyo Cabinet Hive Scalaris
And a few hundred more…
Wednesday, April 25, 12
86. Product Selection
Preliminary investigation
Short-list (usually arrived at by elimination)
Be sure to set the goals and control the process.
Evaluation by technical analysis and modeling
Evaluation by proof of concept.
Do not be afraid to change your mind
Negotiation
Wednesday, April 25, 12
87. Conclusion
Wherein all is revealed, or ignorance exposed
Wednesday, April 25, 12