Data lakes, data exhaust, web scale, data is the new oil. Vendors are throwing new terms and analogies at us to convince us to buy their products as the market around data technologies grows. We change data persistence and transaction layers because "databases don't scale" or because data is "unstructured". If data had no structure then it wouldn't be data, it would be noise. Schema on read, schema on write, schemaless databases; they imply structure underlying the data. All data has schema, but that word may not mean what you think it means.
This presentation will describe concepts of data storage and retrieval from technology prehistory (i.e. before the 1980s) and examine the design principles behind both old and new technology for managing data because sometimes post-relational is actually pre-relational. It is important to separate what is identical to things that were tried in the past from new twists on old topics that deliver new capabilities.
Directly related to these topics are performance, scalability and the realities of what organizations do with data over time. All of these topics should guide architecture decisions to avoid the trap of creating technical debts that must be paid later, after systems are in place and change is difficult.
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Big Data and Bad Analogies
1. Big Data, Bad Analogies
GOTO Copenhagen
September, 2014
Mark Madsen
www.ThirdNature.net
@markmadsen
2. Copyright Third Nature, Inc.
The problem with bad framing
s
Leads to bad assumptions about use, inappropriate features,
poor understanding of substitutability and the impacts it will have.
10. Copyright Third Nature, Inc.
The Elizabethan Era
Commercial printing presses
Data management tech:
▪ Perfect copies
▪ Topical catalogs
▪ Font standardization
▪ Taxonomy ascends
Information explosion:
▪ 8M books in 1500
▪ 200M by 1600
▪ Commoditization
▪ Overload
15. Copyright Third Nature, Inc.
The Victorian Era
The powered printing
information explosion:
▪ Card catalogs, cross-
referencing, random
access metadata
▪ Universal classification
▪ Extended information
management debates
▪ Trading effort and
flexibility for storage
and retrieval
▪ Stereotyping
16. Copyright Third Nature, Inc.
Melvil Dewey
Dewey Decimal System
Top down orientation
Static structure
Descriptive rather than
explanatory
Taxonomic classification
17. Copyright Third Nature, Inc.
Cutter Expansive
Classification System
(~1882)
Bottom up orientation
More flexible structure
Explanatory, descriptive
Faceted classification
Charles Ammi Cutter
18. Copyright Third Nature, Inc.
So why did Dewey beat Cutter?
Pragmatism
Good enough
wins the day
It wasn’t solving
the problem you
thought it was.
X
In every choice, something is lost when something is gained.
19. Copyright Third Nature, Inc.
What lessons does this history teach us?
1. Information requires organizing principles at
multiple levels from items to collections.
2. Differences in scale require different principles.
3. At a key point in the adoption cycle, emphasis
shifts from management of information to its
dissemination and consumption.
First we record, then we use and share.
Like transaction processing and analysis.
20. Copyright Third Nature, Inc.
What has this to do with data and persistence?
“schema” is a broad term, a way of organizing and
making something relatable and findable.
“Data” (or object) is to “Database”
as
“Books” are to “Library”
21. Copyright Third Nature, Inc.
The printed became
more important than
the printer.
The book outlived
generations of presses.
Just like data is now.
Which means we should
pay attention to the
broader organization
and use of data, and
persistence layers.
22. Copyright Third Nature, Inc.
Order Entry
Order
Database
Customer
Service
Interface
Program
Inventory
Database
Distribution
Interface
Program
Receivables
Database
Accounts
Receivable
Data
Warehouse
Analysts &
users
Someone else always wants to use your data
23. Copyright Third Nature, Inc.
Context (one company)
"In an infinite universe, the one thing sentient life cannot afford to have is
a sense of proportion.” – Douglas Adams
24. Copyright Third Nature, Inc.
Monthly
Production plans
Weekly pre-
orders for
bulk cheese
Availability
confirmation
and location
In store system
Store
Stock
Management
Store EPOS
data
Category
Supervis
or
Stock
adjustments/
order
interventions
Order
adjustment
Stock/order
interventions
*
*
Orders
(based on 6
day
forecast)
Dallas
Distrib Centre
WMS
Picking/load
teams
Pos/Pick
lists/Load
sheets
Confirmed
Deliveries/
Confirmed
picks +
loads
Farmers
Milk intake/
silos Cheese plant
Plant
Processor
In-house Cheese
store
Contract Cheese
store
Processor
Packing plant
Processor
National
Distribution
Centre
Retailer
RDC
Retailer Stores
(550)
Retailer HQ
Consolidated
Demand
Ordering
Processor NDC
Customer
Services
Daily order -
SKU/Depot/
Vol
Sent @ 12.30-13:00
Delivery
orders
Processor HQ
Sales
Team/
Account
Manager
Processor HQ
Forecasting
Team
Processor HQ
Bulk Planning
Team
Cheese plant
Planner/Stock
office
Processor HQ
Milk
Purchasing
Team
Cheese plant
Transport
Manager
Actual
daily
delivery
figures
Daily
collection
planning
Weekly order for delivery to
Packing plant
Daily &
weekly Call-
off
Daily Call-off
15/day
22 pallet loads 15/day
A80
Shortages/
Allocation
instructions
Annual
Buying plan
Milk Availability
Forecast
Annual
prediction
of milk
production
Shortages/
Allocation
instructions
Daily milk
intake
Weekly milk
shortages
shortages
Spot mkt or
Processor
ingredients
Packing plant
Planning
Team
Processor HQ
JBA Invoicing
and
Sales Monitor
FGI and Last 5
weeks sales
Expedite
Changes
to existing
forecast -
exceptions
Retailer HQ
Retailer Buyer
Meeting
every 6
weeks
Packing plant
Cheese
ordering
10 day stock
plan
On line
stock info
7 day order
plan for bulk
cheese
Arrange
daily
delivery
schedule
Emergency
call-off
Daily
optimisation
of loads
Service
Monitor
Despatch and
delivery
confirmations
Processor NDC
Transport
Planning
Transport
Plan
Processor NDC
Inventory
MonitoringStock and
delivery
monitoring
Processor NDC
Warehouse
management
syatem
Operation
Instructions
Key
RetailerCheese ProcessorFarms
Schedule
weekly &
Daily
10 Day
plan(wed) and
daily plan
15/day
Changes
to existing
forecast -
exceptions
Stock
availability
Monthly
review
Annual
f/cast
Source: IGD Food Chain Centre, February 2008
Context (multiple company supply chain)
A value chain diagram, showing the data supply chain for cheese.
The side effects of a single bug can be massive.
25. Copyright Third Nature, Inc.
Aside: technical debt: what those diagrams show
tek-ni-kuh l det: the cost that accrues due to
decisions made in software design and coding.
Look at the choices and mistakes in development:
Intentional Unintentional
Purposeful
choices to
optimize
schedule,
budget,
satisfaction
Missed
requirements,
poor code
quality, poor
design
26. Copyright Third Nature, Inc.
It is a poor carpenter who blames his tools*
*but sometimes it is the tools
27. Copyright Third Nature, Inc.
Technical Debt
The cost of some choices can be dealt with in the
short term (e.g. the next sprint) and some only in
the long term (redesign, start over)
Short term
Long term
Mostly about the
application code
Mostly about architecture,
design, and infrastructure
28. Copyright Third Nature, Inc.
Code flaws
(i.e. bugs)
Design flaws
If you enter into decisions knowing the true nature
of your coding alternatives, you will be better off
Green: these are deliberate, the tradeoffs known
Yellow : these are minor defects
Red: these are the things that kill a system
Short term
Long term
Intentional Unintentional
Code choices
Design
choices
29. Copyright Third Nature, Inc.
Technical Debt can’t be avoided
Development
methods
Experience,
education
(but it can be managed)
Sometimes you think it’s intentional: incremental design
Long term debts can only be dealt with through planning
Short term
Long term
Intentional Unintentional
Agile
methods
Redesign
30. Copyright Third Nature, Inc.
Technical Debt can’t be avoided
Let’s move this
to the left
What you believe about the technology underlying your
system has a big influence on design choices, so the
focus of this part is on architecture and design with the
hope it will help reduce or avoid long term debt.
Short term
Long term
Intentional Unintentional
31. Copyright Third Nature, Inc.
How did we get here?
There’s a difference
between having no past
and actively rejecting it.
32. Copyright Third Nature, Inc.
MultiValue, Hierarchical
PICK
IMS
IDS
ADABAS
Relational
CODASYL
System R (SEQUEL)
SQL/DS
INGRES (QUEL)
Mimer
Oracle
RDBMS, SQL standard
DB2
Teradata
Informix
Sybase
Postgres
OODBMS, ORDBMS
Versant
Objectivity
Gemstone
Informix*
Oracle*
MPP Query, NoSQL
Netezza
Paraccel
Vertica
MongoDB
CouchBase
Riak
Cassandra
NewSQL
SciDB
MonetDB
NuoDB
CitusDB
1960s 1980s 2000s
1970s 1990s 2010s
Spanner
F1
33. Copyright Third Nature, Inc.
In the beginning: RMSs and pre-relational DBs
At first, common code libraries so there was
reusability for file ops.
Problems:
▪ Portability across languages, OSs
▪ Queries of more than one file
▪ No metadata, what’s in there? Who wrote it?
▪ Concurrency
The databases brought things like recoverability,
durability, ACID transactions. But they were rigid,
prone to breakage.
Operations:
First
Next
Prev
Last
37. Copyright Third Nature, Inc.
The miracle of pre-relational DB: schema
Loose coupling – the physical model of data
structures and physical placement are no longer a
program’s responsibility
Reusability – More than one program can access the
same data, and no more custom coding for each
application or OS
Scalability – Constraints of schema and typing reduce
resource usage, have finer granularity for concurrent
access, multiple online users.
38. Copyright Third Nature, Inc.
Key schema flexibility tradeoff for data management
Global validation
vs
contextual
validation
=
Strict rules
vs
lenient rules
=
Write rules
vs
read rules
39. Copyright Third Nature, Inc.
Schema on write vs schema on read
Match the shape to the hole
or
Match the hole to the shape
Predicate schemas for write flexibility (agility) and speed
40. Copyright Third Nature, Inc.
“Flexibility” – a recent experience in query
The problem with many of these pr-relational
databases is tight coupling between a program
and data structures.
The physical model leaks into the logical with
potentially career-ending effects if the DB is used
for the wrong thing.
And it’s happening again.
It’s a poor carpenter who blames his tools. Or the users.
41. Copyright Third Nature, Inc.
Trading for consistency for performance: CAP theorem
http://blog.nahurst.com/visual-guide-to-nosql-systems
42. Copyright Third Nature, Inc.
Services provided
Standard API/query layer*
Transaction / consistency
Query optimization
Data navigation, joins
Data access
Storage management
Database
Database
Tradeoffs: In NoSQL the DBMS is in your code
SQL database NoSQL database
Application Application
Anything not done by the DB becomes a developer’s task.
Welcome to 1985!
43. Copyright Third Nature, Inc.
Simplifying ACID vs BASE
Remember: it’s a poor carpenter who blames his tools.
44. Copyright Third Nature, Inc.
Google on eventual consistency:
“F1: A Distributed SQL Database That Scales”, Proceedings of the
VLDB Endowment, Vol. 6, No. 11, 2013
45. Copyright Third Nature, Inc.
Party like it’s 1985
Fastest TPC-B benchmark in 1985 was IMS running on
an IBM 370, 100 TPS, 400 iops, 30 iops/disk
The best relational vendors could muster was 10 TPS
25 years later, SQLServer on an Intel box ran the TPC-B
at 25,000 TPS, 100,000 iops, 300 iops/disk
It looks like you’re
running a benchmark...
46. Copyright Third Nature, Inc.
Joins. Wait, there’s more than one?
Wait, there’s more
than one?
1986: SQL, joins!
47. Copyright Third Nature, Inc.
Hipster bullshit
I can’t get MySQL to scale
therefore
Relational databases don’t scale
therefore
We must use NoSQL* for everything
*including Hadoop and related
48. Copyright Third Nature, Inc.
Enumerate logically equivalent plans by applying
equivalence rules
For each logically equivalent plan, enumerate all
alternative physical query plans
Estimate the cost of each of the alternative physical
query plans
Run the plan with lowest estimated overall cost
2
1
3
4
Logical vs physical is an important thing. The CBO turns a
SQL query into an optimal* execution plan for a parallel
pipelined dataflow engine.
Diagram: David J. DeWitt
The relational gift: declarative language + CBO
49. Copyright Third Nature, Inc.
A simple 3 table join: a programmer’s job?
SELECT C.name, O.num
FROM Orders O, Lines L, Customers C
WHERE C.City = “Copenhagen” AND L.status = “X”
AND O.num = L.num AND C.cid = O.cid
Number of logical plans: 9
Ways to join (hash, merge, nested): 3
For each plan, there are multiple physical plans: 36
That makes a total of 324 physical plans, the
efficiency of which changes based on cardinality.
50. Copyright Third Nature, Inc.
Scalability? ZOMG just add nodes!
"The most amazing achievement of the computer
software industry is its continuing cancellation of
the steady and staggering gains made by the
computer hardware industry.“ – Henry Peteroski
After all, your database is web scale, isn’t it?
51. Copyright Third Nature, Inc.
Just add hardware?
No amount of hardware will make incorrectly
coded software run in parallel.
Declarative languages make this easier by turning
the problem over to the computer to resolve.
Guess which runs in parallel:
Open cursor C
Loop (Fetch row C)
Open cursor O
Loop (Fetch row O)
Open cursor L
Output (Do-things)
End loop
End loop ...
SELECT C.name, O.num
FROM Orders O, Lines L,
Customers C
WHERE C.City = “Aarhus” AND
C.cid = O.cid AND O.num =
L.num AND L.status = “X”
52. Copyright Third Nature, Inc.
A more realistic example: TPC-H query #8
Select o_year,
sum(case
when nation = 'BRAZIL' then volume
else 0
end) / sum(volume)
from
(
select YEAR(O_ORDERDATE) as o_year,
L_EXTENDEDPRICE * (1 - L_DISCOUNT) as volume,
n2.N_NAME as nation
from PART, SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION n1,
NATION n2, REGION
where
P_PARTKEY = L_PARTKEY and S_SUPPKEY = L_SUPPKEY
and L_ORDERKEY = O_ORDERKEY and O_CUSTKEY = C_CUSTKEY
and C_NATIONKEY = n1.N_NATIONKEY and n1.N_REGIONKEY = R_REGIONKEY
and R_NAME = 'AMERICA‘ and S_NATIONKEY = n2.N_NATIONKEY
and O_ORDERDATE between '1995-01-01' and '1996-12-31'
and P_TYPE = 'ECONOMY ANODIZED STEEL'
and S_ACCTBAL <= constant-1
and L_EXTENDEDPRICE <= constant-2
) as all_nations
group by o_year order by o_year
22 million possible execution plans, please find the
best one in 4 milliseconds
53. Copyright Third Nature, Inc.
0%
100%
200%
300%
400%
500%
600%
700%
800%
900%
1000%
1 2 3 4 5 6 7 8 9 10
10% overhead Linear
Just add hardware?
Number of processors
Speedup
(amdahl’s law)
54. Data Warehouse +
Behavioral
Singularity
Data Warehouse
Semi-Structured
SQL++
Structured
SQL
Low End Enterprise-class System
Contextual-Complex Analytics
Deep, Seasonal, Consumable Data Sets
Production Data Warehousing
Large Concurrent User-base
+ +
150+
concurrent users
500+
concurrent users
Enterprise-class System
5-10
concurrent users
Unstructured
Java/C
Structure the Unstructured
Detect Patterns
Commodity Hardware System
6+PB 40+PB 20+PB
Hadoop
Analyze & Report
Discover & Explore
Parallel Efficiency and Platform Costs
56. Copyright Third Nature, Inc.
There are really three workload classes to consider
1. Operational: OLTP systems
2. Analytic: Query systems
3. Scientific: Computational systems
Unit of focus:
1. Transaction
2. Query
3. Computation
Different problems require different platforms
57. Copyright Third Nature, Inc.
Workloads
OLTP BI Analytics
Access Read-Write Read-only Read-mostly
Predictability Fixed path Unpredictable All data
Selectivity High Low Low
Retrieval Low Low High
Latency Milliseconds <seconds msecs to days
Concurrency Huge Moderate 1 to huge
Model 3NF, nested object Dim, denorm BWT
Task sizw Small Large Small to huge
58. Copyright Third Nature, Inc.
A key point worth remembering:
Performance over size <> performance over complexity
OLTP performance is mostly related to transaction
coordination and writing under high concurrency.
BI performance is mostly related to data volume and
query complexity.
Analytics performance is about the intersection of
these with computational complexity.
59. Copyright Third Nature, Inc.
A history of databases in No notation
1970s: NoSQL = We have no SQL
1980s: NoSQL = Know SQL
2000s: NoSQL = No SQL!
2005s: NoSQL = Not only SQL
2010s: NoSQL = No, SQL!
(R)DB(MS)
60. Copyright Third Nature, Inc.
TANSTAAFL
Technologies are not
perfect replacements for
one another. Often not
better, only different.
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
62. Copyright Third Nature, Inc.
Away from “one throat to choke”, back to best of breed
Tight coupling leads to slow
changes. The market is not
in the tight coupling phase
In a rapidly evolving
market, componentized
architectures, modularity
and loose coupling are
favorable over monolithic
stacks, single-vendor
architectures and tight
coupling.
63. Copyright Third Nature, Inc.
Don’t follow the market
Some people can’t resist
getting the next new thing
because it’s new and new is
always better.
Many IT organizations are like
this, promoting a solution and
hunting for the problem that
matches it.
Better to ask “What is the
problem for which this
technology is the answer?”
Copyright Third Nature, Inc.
64. Copyright Third Nature, Inc.
How we develop best practices: survival bias
We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
65. Copyright Third Nature, Inc.
Summarizing some key points
1. Data lives longer than code. It pays to focus on
data when choosing and using persistence
layers.
2. Different persistence layer approaches have
different tradeoffs (like performance vs app
complexity in BASE models).
3. Don’t throw away 30 years of engineering
unless know why it was put there.
4. Decoupling, reusability (of data) and scalability
are nice things to have.
66. Copyright Third Nature, Inc.
References (things worth reading on the way home)
A relational model for large shared data banks, Communications of the ACM, June, 1970,
http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
Column-Oriented Database Systems, Stavros Harizopoulos, Daniel Abadi, Peter Boncz, VLDB 2009 Tutorial
http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
Nobody ever got fired for using Hadoop on a cluster, 1st International Workshop on Hot Topics in Cloud Data
ProcessingApril 10, 2012, Bern, Switzerland.
A co-Relational Model of Data for Large Shared Data Banks, ACM Queue, 2012,
http://queue.acm.org/detail.cfm?id=1961297
A query language for multidimensional arrays: design, implementation and optimization techniques, SIGMOD,
1996
Probabilistically Bounded Staleness for Practical Partial Quorums, Proceedings of the VLDB Endowment, Vol. 5,
No. 8, http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
“Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al
MapReduce: Simplified Data Processing on Large Clusters,
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapre
duce-osdi04.pdf
Dremel: Interactive Analysis of Web-Scale Datasets, Proceedings of the VLDB Endowment, Vol. 3, No. 1, 2010
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/3
6632.pdf
Spanner: Google’s Globally-Distributed Database, SIGMOD, May, 2012,
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/es//archive/spanne
r-osdi2012.pdf
F1: A Distributed SQL Database That Scales, Proceedings of the VLDB Endowment, Vol. 6, No. 11, 2013,
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive
/41344.pdf
67. Copyright Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business use
of data and analytics. Mark is an award-
winning author, architect and CTO
whose work has been featured in
numerous industry publications. Over
the past ten years Mark received awards
for his work from the American
Productivity & Quality Center, TDWI, and
the Smithsonian Institute. He is an
international speaker, a contributor to
Forbes Online and on the O’Reilly Strata
program committee. For more
information or to contact Mark, follow
@markmadsen on Twitter or visit
http://ThirdNature.net
68. Copyright Third Nature, Inc.
About Third Nature
Third Nature is a research and consulting firm focused on new and
emerging technology and practices in business intelligence, analytics and
performance management. If your question is related to BI, analytics,
information strategy and data then you‘re at the right place.
Our goal is to help companies take advantage of information-driven
management practices and applications. We offer education, consulting
and research services to support business and IT organizations as well as
technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating technology and hw it is
applied rather than vendor market positions.
69. Copyright Third Nature, Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
bookshelf by spectrum.jpg - http://flickr.com/photos/santos/1704875109/
Vatican library - http://www.flickr.com/photos/paullew/1550844955
round hole square peg - https://www.flickr.com/photos/epublicist/3546059144
text composition - http://flickr.com/photos/candiedwomanire/60224567/
twitter_network_bw.jpg - http://www.flickr.com/photos/dr/2048034334/
round hole square peg - https://www.flickr.com/photos/epublicist/3546059144
glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721
CAP diagram - http://blog.nahurst.com/visual-guide-to-nosql-systems
refinery-hdr.jpg - http://www.flickr.com/photos/vermininc/2477872191/
refinery-night.jpg - http://www.flickr.com/photos/vermininc/2485448766/