SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Downloaden Sie, um offline zu lesen
Big Data, Bad Analogies
GOTO Copenhagen
September, 2014
Mark Madsen
www.ThirdNature.net
@markmadsen
Copyright Third Nature, Inc.
The problem with bad framing
s
Leads to bad assumptions about use, inappropriate features,
poor understanding of substitutability and the impacts it will have.
Copyright Third Nature, Inc.
The data lake
Copyright Third Nature, Inc.
The data lake after a little while
Copyright Third Nature, Inc.
Data Exhaust
Copyright Third Nature, Inc.
Data is the new oil
Copyright Third Nature, Inc.
Reality: data is a choice
Copyright Third Nature, Inc.
“There is nothing new under the sun
but there are lots of old things we
don't know.”
Ambrose Bierce
Copyright Third Nature, Inc.
Looking at past ways of organizing data
Copyright Third Nature, Inc.
The Elizabethan Era
Commercial printing presses
Data management tech:
▪ Perfect copies
▪ Topical catalogs
▪ Font standardization
▪ Taxonomy ascends
Information explosion:
▪ 8M books in 1500
▪ 200M by 1600
▪ Commoditization
▪ Overload
Copyright Third Nature, Inc.
Elizabethan Era Storage and Retrieval
Copyright Third Nature, Inc.
Elizabethan Era Storage and Retrieval
Copyright Third Nature, Inc.
Elizabethan Era Storage and Retrieval
Copyright Third Nature, Inc.
The Georgian Era: The Explosion of Natural Philosophy
Copyright Third Nature, Inc.
The Victorian Era
The powered printing
information explosion:
▪ Card catalogs, cross-
referencing, random
access metadata
▪ Universal classification
▪ Extended information
management debates
▪ Trading effort and
flexibility for storage
and retrieval
▪ Stereotyping
Copyright Third Nature, Inc.
Melvil Dewey
Dewey Decimal System
Top down orientation
Static structure
Descriptive rather than
explanatory
Taxonomic classification
Copyright Third Nature, Inc.
Cutter Expansive
Classification System
(~1882)
Bottom up orientation
More flexible structure
Explanatory, descriptive
Faceted classification
Charles Ammi Cutter
Copyright Third Nature, Inc.
So why did Dewey beat Cutter?
Pragmatism
Good enough
wins the day
It wasn’t solving
the problem you
thought it was.
X
In every choice, something is lost when something is gained.
Copyright Third Nature, Inc.
What lessons does this history teach us?
1. Information requires organizing principles at
multiple levels from items to collections.
2. Differences in scale require different principles.
3. At a key point in the adoption cycle, emphasis
shifts from management of information to its
dissemination and consumption.
First we record, then we use and share.
Like transaction processing and analysis.
Copyright Third Nature, Inc.
What has this to do with data and persistence?
“schema” is a broad term, a way of organizing and
making something relatable and findable.
“Data” (or object) is to “Database”
as
“Books” are to “Library”
Copyright Third Nature, Inc.
The printed became
more important than
the printer.
The book outlived
generations of presses.
Just like data is now.
Which means we should
pay attention to the
broader organization
and use of data, and
persistence layers.
Copyright Third Nature, Inc.
Order Entry
Order
Database
Customer
Service
Interface
Program
Inventory
Database
Distribution
Interface
Program
Receivables
Database
Accounts
Receivable
Data
Warehouse
Analysts &
users
Someone else always wants to use your data
Copyright Third Nature, Inc.
Context (one company)
"In an infinite universe, the one thing sentient life cannot afford to have is
a sense of proportion.” – Douglas Adams
Copyright Third Nature, Inc.
Monthly
Production plans
Weekly pre-
orders for
bulk cheese
Availability
confirmation
and location
In store system
Store
Stock
Management
Store EPOS
data
Category
Supervis
or
Stock
adjustments/
order
interventions
Order
adjustment
Stock/order
interventions
*
*
Orders
(based on 6
day
forecast)
Dallas
Distrib Centre
WMS
Picking/load
teams
Pos/Pick
lists/Load
sheets
Confirmed
Deliveries/
Confirmed
picks +
loads
Farmers
Milk intake/
silos Cheese plant
Plant
Processor
In-house Cheese
store
Contract Cheese
store
Processor
Packing plant
Processor
National
Distribution
Centre
Retailer
RDC
Retailer Stores
(550)
Retailer HQ
Consolidated
Demand
Ordering
Processor NDC
Customer
Services
Daily order -
SKU/Depot/
Vol
Sent @ 12.30-13:00
Delivery
orders
Processor HQ
Sales
Team/
Account
Manager
Processor HQ
Forecasting
Team
Processor HQ
Bulk Planning
Team
Cheese plant
Planner/Stock
office
Processor HQ
Milk
Purchasing
Team
Cheese plant
Transport
Manager
Actual
daily
delivery
figures
Daily
collection
planning
Weekly order for delivery to
Packing plant
Daily &
weekly Call-
off
Daily Call-off
15/day
22 pallet loads 15/day
A80
Shortages/
Allocation
instructions
Annual
Buying plan
Milk Availability
Forecast
Annual
prediction
of milk
production
Shortages/
Allocation
instructions
Daily milk
intake
Weekly milk
shortages
shortages
Spot mkt or
Processor
ingredients
Packing plant
Planning
Team
Processor HQ
JBA Invoicing
and
Sales Monitor
FGI and Last 5
weeks sales
Expedite
Changes
to existing
forecast -
exceptions
Retailer HQ
Retailer Buyer
Meeting
every 6
weeks
Packing plant
Cheese
ordering
10 day stock
plan
On line
stock info
7 day order
plan for bulk
cheese
Arrange
daily
delivery
schedule
Emergency
call-off
Daily
optimisation
of loads
Service
Monitor
Despatch and
delivery
confirmations
Processor NDC
Transport
Planning
Transport
Plan
Processor NDC
Inventory
MonitoringStock and
delivery
monitoring
Processor NDC
Warehouse
management
syatem
Operation
Instructions
Key
RetailerCheese ProcessorFarms
Schedule
weekly &
Daily
10 Day
plan(wed) and
daily plan
15/day
Changes
to existing
forecast -
exceptions
Stock
availability
Monthly
review
Annual
f/cast
Source: IGD Food Chain Centre, February 2008
Context (multiple company supply chain)
A value chain diagram, showing the data supply chain for cheese.
The side effects of a single bug can be massive.
Copyright Third Nature, Inc.
Aside: technical debt: what those diagrams show
tek-ni-kuh l det: the cost that accrues due to
decisions made in software design and coding.
Look at the choices and mistakes in development:
Intentional Unintentional
Purposeful
choices to
optimize
schedule,
budget,
satisfaction
Missed
requirements,
poor code
quality, poor
design
Copyright Third Nature, Inc.
It is a poor carpenter who blames his tools*
*but sometimes it is the tools
Copyright Third Nature, Inc.
Technical Debt
The cost of some choices can be dealt with in the
short term (e.g. the next sprint) and some only in
the long term (redesign, start over)
Short term
Long term
Mostly about the
application code
Mostly about architecture,
design, and infrastructure
Copyright Third Nature, Inc.
Code flaws
(i.e. bugs)
Design flaws
If you enter into decisions knowing the true nature
of your coding alternatives, you will be better off
Green: these are deliberate, the tradeoffs known
Yellow : these are minor defects
Red: these are the things that kill a system
Short term
Long term
Intentional Unintentional
Code choices
Design
choices
Copyright Third Nature, Inc.
Technical Debt can’t be avoided
Development
methods
Experience,
education
(but it can be managed)
Sometimes you think it’s intentional: incremental design
Long term debts can only be dealt with through planning
Short term
Long term
Intentional Unintentional
Agile
methods
Redesign
Copyright Third Nature, Inc.
Technical Debt can’t be avoided
Let’s move this
to the left
What you believe about the technology underlying your
system has a big influence on design choices, so the
focus of this part is on architecture and design with the
hope it will help reduce or avoid long term debt.
Short term
Long term
Intentional Unintentional
Copyright Third Nature, Inc.
How did we get here?
There’s a difference
between having no past
and actively rejecting it.
Copyright Third Nature, Inc.
MultiValue, Hierarchical
PICK
IMS
IDS
ADABAS
Relational
CODASYL
System R (SEQUEL)
SQL/DS
INGRES (QUEL)
Mimer
Oracle
RDBMS, SQL standard
DB2
Teradata
Informix
Sybase
Postgres
OODBMS, ORDBMS
Versant
Objectivity
Gemstone
Informix*
Oracle*
MPP Query, NoSQL
Netezza
Paraccel
Vertica
MongoDB
CouchBase
Riak
Cassandra
NewSQL
SciDB
MonetDB
NuoDB
CitusDB
1960s 1980s 2000s
1970s 1990s 2010s
Spanner
F1
Copyright Third Nature, Inc.
In the beginning: RMSs and pre-relational DBs
At first, common code libraries so there was
reusability for file ops.
Problems:
▪ Portability across languages, OSs
▪ Queries of more than one file
▪ No metadata, what’s in there? Who wrote it?
▪ Concurrency
The databases brought things like recoverability,
durability, ACID transactions. But they were rigid,
prone to breakage.
Operations:
First
Next
Prev
Last
Copyright Third Nature, Inc.
What is
best in life?
Copyright Third Nature, Inc.
To crush the vendors,
see them driven
before you, and hear
the lamentation of
their salespeople.
Copyright Third Nature, Inc.
Wrong!
Loose coupling.
Scalability.
Reusability.
Copyright Third Nature, Inc.
The miracle of pre-relational DB: schema
Loose coupling – the physical model of data
structures and physical placement are no longer a
program’s responsibility
Reusability – More than one program can access the
same data, and no more custom coding for each
application or OS
Scalability – Constraints of schema and typing reduce
resource usage, have finer granularity for concurrent
access, multiple online users.
Copyright Third Nature, Inc.
Key schema flexibility tradeoff for data management
Global validation
vs
contextual
validation
=
Strict rules
vs
lenient rules
=
Write rules
vs
read rules
Copyright Third Nature, Inc.
Schema on write vs schema on read
Match the shape to the hole
or
Match the hole to the shape
Predicate schemas for write flexibility (agility) and speed
Copyright Third Nature, Inc.
“Flexibility” – a recent experience in query
The problem with many of these pr-relational
databases is tight coupling between a program
and data structures.
The physical model leaks into the logical with
potentially career-ending effects if the DB is used
for the wrong thing.
And it’s happening again.
It’s a poor carpenter who blames his tools. Or the users.
Copyright Third Nature, Inc.
Trading for consistency for performance: CAP theorem
http://blog.nahurst.com/visual-guide-to-nosql-systems
Copyright Third Nature, Inc.
Services provided
Standard API/query layer*
Transaction / consistency
Query optimization
Data navigation, joins
Data access
Storage management
Database
Database
Tradeoffs: In NoSQL the DBMS is in your code
SQL database NoSQL database
Application Application
Anything not done by the DB becomes a developer’s task.
Welcome to 1985!
Copyright Third Nature, Inc.
Simplifying ACID vs BASE
Remember: it’s a poor carpenter who blames his tools.
Copyright Third Nature, Inc.
Google on eventual consistency:
“F1: A Distributed SQL Database That Scales”, Proceedings of the
VLDB Endowment, Vol. 6, No. 11, 2013
Copyright Third Nature, Inc.
Party like it’s 1985
Fastest TPC-B benchmark in 1985 was IMS running on
an IBM 370, 100 TPS, 400 iops, 30 iops/disk
The best relational vendors could muster was 10 TPS
25 years later, SQLServer on an Intel box ran the TPC-B
at 25,000 TPS, 100,000 iops, 300 iops/disk
It looks like you’re
running a benchmark...
Copyright Third Nature, Inc.
Joins. Wait, there’s more than one?
Wait, there’s more
than one?
1986: SQL, joins!
Copyright Third Nature, Inc.
Hipster bullshit
I can’t get MySQL to scale
therefore
Relational databases don’t scale
therefore
We must use NoSQL* for everything
*including Hadoop and related
Copyright Third Nature, Inc.
Enumerate logically equivalent plans by applying
equivalence rules
For each logically equivalent plan, enumerate all
alternative physical query plans
Estimate the cost of each of the alternative physical
query plans
Run the plan with lowest estimated overall cost
2
1
3
4
Logical vs physical is an important thing. The CBO turns a
SQL query into an optimal* execution plan for a parallel
pipelined dataflow engine.
Diagram: David J. DeWitt
The relational gift: declarative language + CBO
Copyright Third Nature, Inc.
A simple 3 table join: a programmer’s job?
SELECT C.name, O.num
FROM Orders O, Lines L, Customers C
WHERE C.City = “Copenhagen” AND L.status = “X”
AND O.num = L.num AND C.cid = O.cid
Number of logical plans: 9
Ways to join (hash, merge, nested): 3
For each plan, there are multiple physical plans: 36
That makes a total of 324 physical plans, the
efficiency of which changes based on cardinality.
Copyright Third Nature, Inc.
Scalability? ZOMG just add nodes!
"The most amazing achievement of the computer
software industry is its continuing cancellation of
the steady and staggering gains made by the
computer hardware industry.“ – Henry Peteroski
After all, your database is web scale, isn’t it?
Copyright Third Nature, Inc.
Just add hardware?
No amount of hardware will make incorrectly
coded software run in parallel.
Declarative languages make this easier by turning
the problem over to the computer to resolve.
Guess which runs in parallel:
Open cursor C
Loop (Fetch row C)
Open cursor O
Loop (Fetch row O)
Open cursor L
Output (Do-things)
End loop
End loop ...
SELECT C.name, O.num
FROM Orders O, Lines L,
Customers C
WHERE C.City = “Aarhus” AND
C.cid = O.cid AND O.num =
L.num AND L.status = “X”
Copyright Third Nature, Inc.
A more realistic example: TPC-H query #8
Select o_year,
sum(case
when nation = 'BRAZIL' then volume
else 0
end) / sum(volume)
from
(
select YEAR(O_ORDERDATE) as o_year,
L_EXTENDEDPRICE * (1 - L_DISCOUNT) as volume,
n2.N_NAME as nation
from PART, SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION n1,
NATION n2, REGION
where
P_PARTKEY = L_PARTKEY and S_SUPPKEY = L_SUPPKEY
and L_ORDERKEY = O_ORDERKEY and O_CUSTKEY = C_CUSTKEY
and C_NATIONKEY = n1.N_NATIONKEY and n1.N_REGIONKEY = R_REGIONKEY
and R_NAME = 'AMERICA‘ and S_NATIONKEY = n2.N_NATIONKEY
and O_ORDERDATE between '1995-01-01' and '1996-12-31'
and P_TYPE = 'ECONOMY ANODIZED STEEL'
and S_ACCTBAL <= constant-1
and L_EXTENDEDPRICE <= constant-2
) as all_nations
group by o_year order by o_year
22 million possible execution plans, please find the
best one in 4 milliseconds
Copyright Third Nature, Inc.
0%
100%
200%
300%
400%
500%
600%
700%
800%
900%
1000%
1 2 3 4 5 6 7 8 9 10
10% overhead Linear
Just add hardware?
Number of processors
Speedup
(amdahl’s law)
Data Warehouse +
Behavioral
Singularity
Data Warehouse
Semi-Structured
SQL++
Structured
SQL
Low End Enterprise-class System
Contextual-Complex Analytics
Deep, Seasonal, Consumable Data Sets
Production Data Warehousing
Large Concurrent User-base
+ +
150+
concurrent users
500+
concurrent users
Enterprise-class System
5-10
concurrent users
Unstructured
Java/C
Structure the Unstructured
Detect Patterns
Commodity Hardware System
6+PB 40+PB 20+PB
Hadoop
Analyze & Report
Discover & Explore
Parallel Efficiency and Platform Costs
Platform Metrics for Table Scan and Sum, Hadoop vs Teradata
Copyright Third Nature, Inc.
There are really three workload classes to consider
1. Operational: OLTP systems
2. Analytic: Query systems
3. Scientific: Computational systems
Unit of focus:
1. Transaction
2. Query
3. Computation
Different problems require different platforms
Copyright Third Nature, Inc.
Workloads
OLTP BI Analytics
Access Read-Write Read-only Read-mostly
Predictability Fixed path Unpredictable All data
Selectivity High Low Low
Retrieval Low Low High
Latency Milliseconds <seconds msecs to days
Concurrency Huge Moderate 1 to huge
Model 3NF, nested object Dim, denorm BWT
Task sizw Small Large Small to huge
Copyright Third Nature, Inc.
A key point worth remembering:
Performance over size <> performance over complexity
OLTP performance is mostly related to transaction
coordination and writing under high concurrency.
BI performance is mostly related to data volume and
query complexity.
Analytics performance is about the intersection of
these with computational complexity.
Copyright Third Nature, Inc.
A history of databases in No notation
1970s: NoSQL = We have no SQL
1980s: NoSQL = Know SQL
2000s: NoSQL = No SQL!
2005s: NoSQL = Not only SQL
2010s: NoSQL = No, SQL!
(R)DB(MS)
Copyright Third Nature, Inc.
TANSTAAFL
Technologies are not
perfect replacements for
one another. Often not
better, only different.
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
Copyright Third Nature, Inc.
Unintended
consequencesUnintended consequences
Copyright Third Nature, Inc.
Away from “one throat to choke”, back to best of breed
Tight coupling leads to slow
changes. The market is not
in the tight coupling phase
In a rapidly evolving
market, componentized
architectures, modularity
and loose coupling are
favorable over monolithic
stacks, single-vendor
architectures and tight
coupling.
Copyright Third Nature, Inc.
Don’t follow the market
Some people can’t resist
getting the next new thing
because it’s new and new is
always better.
Many IT organizations are like
this, promoting a solution and
hunting for the problem that
matches it.
Better to ask “What is the
problem for which this
technology is the answer?”
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
How we develop best practices: survival bias
We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Summarizing some key points
1. Data lives longer than code. It pays to focus on
data when choosing and using persistence
layers.
2. Different persistence layer approaches have
different tradeoffs (like performance vs app
complexity in BASE models).
3. Don’t throw away 30 years of engineering
unless know why it was put there.
4. Decoupling, reusability (of data) and scalability
are nice things to have.
Copyright Third Nature, Inc.
References (things worth reading on the way home)
A relational model for large shared data banks, Communications of the ACM, June, 1970,
http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
Column-Oriented Database Systems, Stavros Harizopoulos, Daniel Abadi, Peter Boncz, VLDB 2009 Tutorial
http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
Nobody ever got fired for using Hadoop on a cluster, 1st International Workshop on Hot Topics in Cloud Data
ProcessingApril 10, 2012, Bern, Switzerland.
A co-Relational Model of Data for Large Shared Data Banks, ACM Queue, 2012,
http://queue.acm.org/detail.cfm?id=1961297
A query language for multidimensional arrays: design, implementation and optimization techniques, SIGMOD,
1996
Probabilistically Bounded Staleness for Practical Partial Quorums, Proceedings of the VLDB Endowment, Vol. 5,
No. 8, http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
“Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al
MapReduce: Simplified Data Processing on Large Clusters,
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapre
duce-osdi04.pdf
Dremel: Interactive Analysis of Web-Scale Datasets, Proceedings of the VLDB Endowment, Vol. 3, No. 1, 2010
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/3
6632.pdf
Spanner: Google’s Globally-Distributed Database, SIGMOD, May, 2012,
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/es//archive/spanne
r-osdi2012.pdf
F1: A Distributed SQL Database That Scales, Proceedings of the VLDB Endowment, Vol. 6, No. 11, 2013,
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive
/41344.pdf
Copyright Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business use
of data and analytics. Mark is an award-
winning author, architect and CTO
whose work has been featured in
numerous industry publications. Over
the past ten years Mark received awards
for his work from the American
Productivity & Quality Center, TDWI, and
the Smithsonian Institute. He is an
international speaker, a contributor to
Forbes Online and on the O’Reilly Strata
program committee. For more
information or to contact Mark, follow
@markmadsen on Twitter or visit
http://ThirdNature.net
Copyright Third Nature, Inc.
About Third Nature
Third Nature is a research and consulting firm focused on new and
emerging technology and practices in business intelligence, analytics and
performance management. If your question is related to BI, analytics,
information strategy and data then you‘re at the right place.
Our goal is to help companies take advantage of information-driven
management practices and applications. We offer education, consulting
and research services to support business and IT organizations as well as
technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating technology and hw it is
applied rather than vendor market positions.
Copyright Third Nature, Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
bookshelf by spectrum.jpg - http://flickr.com/photos/santos/1704875109/
Vatican library - http://www.flickr.com/photos/paullew/1550844955
round hole square peg - https://www.flickr.com/photos/epublicist/3546059144
text composition - http://flickr.com/photos/candiedwomanire/60224567/
twitter_network_bw.jpg - http://www.flickr.com/photos/dr/2048034334/
round hole square peg - https://www.flickr.com/photos/epublicist/3546059144
glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721
CAP diagram - http://blog.nahurst.com/visual-guide-to-nosql-systems
refinery-hdr.jpg - http://www.flickr.com/photos/vermininc/2477872191/
refinery-night.jpg - http://www.flickr.com/photos/vermininc/2485448766/

Weitere ähnliche Inhalte

Was ist angesagt?

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
mark madsen
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
Chris Dwan
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
mark madsen
 

Was ist angesagt? (20)

2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged Infrastructure
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software market
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 

Andere mochten auch

001 intro psycho cognitive
001 intro psycho cognitive001 intro psycho cognitive
001 intro psycho cognitive
François Dehan
 
Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advice
The Marketing Distillery
 

Andere mochten auch (8)

001 intro psycho cognitive
001 intro psycho cognitive001 intro psycho cognitive
001 intro psycho cognitive
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Data
 
Music video analysis
Music video analysisMusic video analysis
Music video analysis
 
Avicii
AviciiAvicii
Avicii
 
Les retours accélérés : ce que l’Intelligence Artificielle va changer pour vo...
Les retours accélérés : ce que l’Intelligence Artificielle va changer pour vo...Les retours accélérés : ce que l’Intelligence Artificielle va changer pour vo...
Les retours accélérés : ce que l’Intelligence Artificielle va changer pour vo...
 
Predictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advicePredictive analytics in action real-world examples and advice
Predictive analytics in action real-world examples and advice
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 

Ähnlich wie Big Data and Bad Analogies

Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
butest
 

Ähnlich wie Big Data and Bad Analogies (20)

Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Following Google: Don’t Follow the Followers, Follow the Leaders
Following Google: Don’t Follow the Followers, Follow the LeadersFollowing Google: Don’t Follow the Followers, Follow the Leaders
Following Google: Don’t Follow the Followers, Follow the Leaders
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
 
2007 Mark Logic User Conference Keynote
2007 Mark Logic User Conference Keynote2007 Mark Logic User Conference Keynote
2007 Mark Logic User Conference Keynote
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Bigdata notes
Bigdata notesBigdata notes
Bigdata notes
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery Platform
 
Benefits of big data
Benefits of big dataBenefits of big data
Benefits of big data
 
Big Data
Big DataBig Data
Big Data
 
CERT Data Science in Cybersecurity Symposium
CERT Data Science in Cybersecurity SymposiumCERT Data Science in Cybersecurity Symposium
CERT Data Science in Cybersecurity Symposium
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 

Mehr von mark madsen

Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
mark madsen
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
mark madsen
 

Mehr von mark madsen (17)

Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good story
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolution
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Data
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 

Kürzlich hochgeladen

VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 

Kürzlich hochgeladen (20)

Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 

Big Data and Bad Analogies

  • 1. Big Data, Bad Analogies GOTO Copenhagen September, 2014 Mark Madsen www.ThirdNature.net @markmadsen
  • 2. Copyright Third Nature, Inc. The problem with bad framing s Leads to bad assumptions about use, inappropriate features, poor understanding of substitutability and the impacts it will have.
  • 3. Copyright Third Nature, Inc. The data lake
  • 4. Copyright Third Nature, Inc. The data lake after a little while
  • 5. Copyright Third Nature, Inc. Data Exhaust
  • 6. Copyright Third Nature, Inc. Data is the new oil
  • 7. Copyright Third Nature, Inc. Reality: data is a choice
  • 8. Copyright Third Nature, Inc. “There is nothing new under the sun but there are lots of old things we don't know.” Ambrose Bierce
  • 9. Copyright Third Nature, Inc. Looking at past ways of organizing data
  • 10. Copyright Third Nature, Inc. The Elizabethan Era Commercial printing presses Data management tech: ▪ Perfect copies ▪ Topical catalogs ▪ Font standardization ▪ Taxonomy ascends Information explosion: ▪ 8M books in 1500 ▪ 200M by 1600 ▪ Commoditization ▪ Overload
  • 11. Copyright Third Nature, Inc. Elizabethan Era Storage and Retrieval
  • 12. Copyright Third Nature, Inc. Elizabethan Era Storage and Retrieval
  • 13. Copyright Third Nature, Inc. Elizabethan Era Storage and Retrieval
  • 14. Copyright Third Nature, Inc. The Georgian Era: The Explosion of Natural Philosophy
  • 15. Copyright Third Nature, Inc. The Victorian Era The powered printing information explosion: ▪ Card catalogs, cross- referencing, random access metadata ▪ Universal classification ▪ Extended information management debates ▪ Trading effort and flexibility for storage and retrieval ▪ Stereotyping
  • 16. Copyright Third Nature, Inc. Melvil Dewey Dewey Decimal System Top down orientation Static structure Descriptive rather than explanatory Taxonomic classification
  • 17. Copyright Third Nature, Inc. Cutter Expansive Classification System (~1882) Bottom up orientation More flexible structure Explanatory, descriptive Faceted classification Charles Ammi Cutter
  • 18. Copyright Third Nature, Inc. So why did Dewey beat Cutter? Pragmatism Good enough wins the day It wasn’t solving the problem you thought it was. X In every choice, something is lost when something is gained.
  • 19. Copyright Third Nature, Inc. What lessons does this history teach us? 1. Information requires organizing principles at multiple levels from items to collections. 2. Differences in scale require different principles. 3. At a key point in the adoption cycle, emphasis shifts from management of information to its dissemination and consumption. First we record, then we use and share. Like transaction processing and analysis.
  • 20. Copyright Third Nature, Inc. What has this to do with data and persistence? “schema” is a broad term, a way of organizing and making something relatable and findable. “Data” (or object) is to “Database” as “Books” are to “Library”
  • 21. Copyright Third Nature, Inc. The printed became more important than the printer. The book outlived generations of presses. Just like data is now. Which means we should pay attention to the broader organization and use of data, and persistence layers.
  • 22. Copyright Third Nature, Inc. Order Entry Order Database Customer Service Interface Program Inventory Database Distribution Interface Program Receivables Database Accounts Receivable Data Warehouse Analysts & users Someone else always wants to use your data
  • 23. Copyright Third Nature, Inc. Context (one company) "In an infinite universe, the one thing sentient life cannot afford to have is a sense of proportion.” – Douglas Adams
  • 24. Copyright Third Nature, Inc. Monthly Production plans Weekly pre- orders for bulk cheese Availability confirmation and location In store system Store Stock Management Store EPOS data Category Supervis or Stock adjustments/ order interventions Order adjustment Stock/order interventions * * Orders (based on 6 day forecast) Dallas Distrib Centre WMS Picking/load teams Pos/Pick lists/Load sheets Confirmed Deliveries/ Confirmed picks + loads Farmers Milk intake/ silos Cheese plant Plant Processor In-house Cheese store Contract Cheese store Processor Packing plant Processor National Distribution Centre Retailer RDC Retailer Stores (550) Retailer HQ Consolidated Demand Ordering Processor NDC Customer Services Daily order - SKU/Depot/ Vol Sent @ 12.30-13:00 Delivery orders Processor HQ Sales Team/ Account Manager Processor HQ Forecasting Team Processor HQ Bulk Planning Team Cheese plant Planner/Stock office Processor HQ Milk Purchasing Team Cheese plant Transport Manager Actual daily delivery figures Daily collection planning Weekly order for delivery to Packing plant Daily & weekly Call- off Daily Call-off 15/day 22 pallet loads 15/day A80 Shortages/ Allocation instructions Annual Buying plan Milk Availability Forecast Annual prediction of milk production Shortages/ Allocation instructions Daily milk intake Weekly milk shortages shortages Spot mkt or Processor ingredients Packing plant Planning Team Processor HQ JBA Invoicing and Sales Monitor FGI and Last 5 weeks sales Expedite Changes to existing forecast - exceptions Retailer HQ Retailer Buyer Meeting every 6 weeks Packing plant Cheese ordering 10 day stock plan On line stock info 7 day order plan for bulk cheese Arrange daily delivery schedule Emergency call-off Daily optimisation of loads Service Monitor Despatch and delivery confirmations Processor NDC Transport Planning Transport Plan Processor NDC Inventory MonitoringStock and delivery monitoring Processor NDC Warehouse management syatem Operation Instructions Key RetailerCheese ProcessorFarms Schedule weekly & Daily 10 Day plan(wed) and daily plan 15/day Changes to existing forecast - exceptions Stock availability Monthly review Annual f/cast Source: IGD Food Chain Centre, February 2008 Context (multiple company supply chain) A value chain diagram, showing the data supply chain for cheese. The side effects of a single bug can be massive.
  • 25. Copyright Third Nature, Inc. Aside: technical debt: what those diagrams show tek-ni-kuh l det: the cost that accrues due to decisions made in software design and coding. Look at the choices and mistakes in development: Intentional Unintentional Purposeful choices to optimize schedule, budget, satisfaction Missed requirements, poor code quality, poor design
  • 26. Copyright Third Nature, Inc. It is a poor carpenter who blames his tools* *but sometimes it is the tools
  • 27. Copyright Third Nature, Inc. Technical Debt The cost of some choices can be dealt with in the short term (e.g. the next sprint) and some only in the long term (redesign, start over) Short term Long term Mostly about the application code Mostly about architecture, design, and infrastructure
  • 28. Copyright Third Nature, Inc. Code flaws (i.e. bugs) Design flaws If you enter into decisions knowing the true nature of your coding alternatives, you will be better off Green: these are deliberate, the tradeoffs known Yellow : these are minor defects Red: these are the things that kill a system Short term Long term Intentional Unintentional Code choices Design choices
  • 29. Copyright Third Nature, Inc. Technical Debt can’t be avoided Development methods Experience, education (but it can be managed) Sometimes you think it’s intentional: incremental design Long term debts can only be dealt with through planning Short term Long term Intentional Unintentional Agile methods Redesign
  • 30. Copyright Third Nature, Inc. Technical Debt can’t be avoided Let’s move this to the left What you believe about the technology underlying your system has a big influence on design choices, so the focus of this part is on architecture and design with the hope it will help reduce or avoid long term debt. Short term Long term Intentional Unintentional
  • 31. Copyright Third Nature, Inc. How did we get here? There’s a difference between having no past and actively rejecting it.
  • 32. Copyright Third Nature, Inc. MultiValue, Hierarchical PICK IMS IDS ADABAS Relational CODASYL System R (SEQUEL) SQL/DS INGRES (QUEL) Mimer Oracle RDBMS, SQL standard DB2 Teradata Informix Sybase Postgres OODBMS, ORDBMS Versant Objectivity Gemstone Informix* Oracle* MPP Query, NoSQL Netezza Paraccel Vertica MongoDB CouchBase Riak Cassandra NewSQL SciDB MonetDB NuoDB CitusDB 1960s 1980s 2000s 1970s 1990s 2010s Spanner F1
  • 33. Copyright Third Nature, Inc. In the beginning: RMSs and pre-relational DBs At first, common code libraries so there was reusability for file ops. Problems: ▪ Portability across languages, OSs ▪ Queries of more than one file ▪ No metadata, what’s in there? Who wrote it? ▪ Concurrency The databases brought things like recoverability, durability, ACID transactions. But they were rigid, prone to breakage. Operations: First Next Prev Last
  • 34. Copyright Third Nature, Inc. What is best in life?
  • 35. Copyright Third Nature, Inc. To crush the vendors, see them driven before you, and hear the lamentation of their salespeople.
  • 36. Copyright Third Nature, Inc. Wrong! Loose coupling. Scalability. Reusability.
  • 37. Copyright Third Nature, Inc. The miracle of pre-relational DB: schema Loose coupling – the physical model of data structures and physical placement are no longer a program’s responsibility Reusability – More than one program can access the same data, and no more custom coding for each application or OS Scalability – Constraints of schema and typing reduce resource usage, have finer granularity for concurrent access, multiple online users.
  • 38. Copyright Third Nature, Inc. Key schema flexibility tradeoff for data management Global validation vs contextual validation = Strict rules vs lenient rules = Write rules vs read rules
  • 39. Copyright Third Nature, Inc. Schema on write vs schema on read Match the shape to the hole or Match the hole to the shape Predicate schemas for write flexibility (agility) and speed
  • 40. Copyright Third Nature, Inc. “Flexibility” – a recent experience in query The problem with many of these pr-relational databases is tight coupling between a program and data structures. The physical model leaks into the logical with potentially career-ending effects if the DB is used for the wrong thing. And it’s happening again. It’s a poor carpenter who blames his tools. Or the users.
  • 41. Copyright Third Nature, Inc. Trading for consistency for performance: CAP theorem http://blog.nahurst.com/visual-guide-to-nosql-systems
  • 42. Copyright Third Nature, Inc. Services provided Standard API/query layer* Transaction / consistency Query optimization Data navigation, joins Data access Storage management Database Database Tradeoffs: In NoSQL the DBMS is in your code SQL database NoSQL database Application Application Anything not done by the DB becomes a developer’s task. Welcome to 1985!
  • 43. Copyright Third Nature, Inc. Simplifying ACID vs BASE Remember: it’s a poor carpenter who blames his tools.
  • 44. Copyright Third Nature, Inc. Google on eventual consistency: “F1: A Distributed SQL Database That Scales”, Proceedings of the VLDB Endowment, Vol. 6, No. 11, 2013
  • 45. Copyright Third Nature, Inc. Party like it’s 1985 Fastest TPC-B benchmark in 1985 was IMS running on an IBM 370, 100 TPS, 400 iops, 30 iops/disk The best relational vendors could muster was 10 TPS 25 years later, SQLServer on an Intel box ran the TPC-B at 25,000 TPS, 100,000 iops, 300 iops/disk It looks like you’re running a benchmark...
  • 46. Copyright Third Nature, Inc. Joins. Wait, there’s more than one? Wait, there’s more than one? 1986: SQL, joins!
  • 47. Copyright Third Nature, Inc. Hipster bullshit I can’t get MySQL to scale therefore Relational databases don’t scale therefore We must use NoSQL* for everything *including Hadoop and related
  • 48. Copyright Third Nature, Inc. Enumerate logically equivalent plans by applying equivalence rules For each logically equivalent plan, enumerate all alternative physical query plans Estimate the cost of each of the alternative physical query plans Run the plan with lowest estimated overall cost 2 1 3 4 Logical vs physical is an important thing. The CBO turns a SQL query into an optimal* execution plan for a parallel pipelined dataflow engine. Diagram: David J. DeWitt The relational gift: declarative language + CBO
  • 49. Copyright Third Nature, Inc. A simple 3 table join: a programmer’s job? SELECT C.name, O.num FROM Orders O, Lines L, Customers C WHERE C.City = “Copenhagen” AND L.status = “X” AND O.num = L.num AND C.cid = O.cid Number of logical plans: 9 Ways to join (hash, merge, nested): 3 For each plan, there are multiple physical plans: 36 That makes a total of 324 physical plans, the efficiency of which changes based on cardinality.
  • 50. Copyright Third Nature, Inc. Scalability? ZOMG just add nodes! "The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.“ – Henry Peteroski After all, your database is web scale, isn’t it?
  • 51. Copyright Third Nature, Inc. Just add hardware? No amount of hardware will make incorrectly coded software run in parallel. Declarative languages make this easier by turning the problem over to the computer to resolve. Guess which runs in parallel: Open cursor C Loop (Fetch row C) Open cursor O Loop (Fetch row O) Open cursor L Output (Do-things) End loop End loop ... SELECT C.name, O.num FROM Orders O, Lines L, Customers C WHERE C.City = “Aarhus” AND C.cid = O.cid AND O.num = L.num AND L.status = “X”
  • 52. Copyright Third Nature, Inc. A more realistic example: TPC-H query #8 Select o_year, sum(case when nation = 'BRAZIL' then volume else 0 end) / sum(volume) from ( select YEAR(O_ORDERDATE) as o_year, L_EXTENDEDPRICE * (1 - L_DISCOUNT) as volume, n2.N_NAME as nation from PART, SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION n1, NATION n2, REGION where P_PARTKEY = L_PARTKEY and S_SUPPKEY = L_SUPPKEY and L_ORDERKEY = O_ORDERKEY and O_CUSTKEY = C_CUSTKEY and C_NATIONKEY = n1.N_NATIONKEY and n1.N_REGIONKEY = R_REGIONKEY and R_NAME = 'AMERICA‘ and S_NATIONKEY = n2.N_NATIONKEY and O_ORDERDATE between '1995-01-01' and '1996-12-31' and P_TYPE = 'ECONOMY ANODIZED STEEL' and S_ACCTBAL <= constant-1 and L_EXTENDEDPRICE <= constant-2 ) as all_nations group by o_year order by o_year 22 million possible execution plans, please find the best one in 4 milliseconds
  • 53. Copyright Third Nature, Inc. 0% 100% 200% 300% 400% 500% 600% 700% 800% 900% 1000% 1 2 3 4 5 6 7 8 9 10 10% overhead Linear Just add hardware? Number of processors Speedup (amdahl’s law)
  • 54. Data Warehouse + Behavioral Singularity Data Warehouse Semi-Structured SQL++ Structured SQL Low End Enterprise-class System Contextual-Complex Analytics Deep, Seasonal, Consumable Data Sets Production Data Warehousing Large Concurrent User-base + + 150+ concurrent users 500+ concurrent users Enterprise-class System 5-10 concurrent users Unstructured Java/C Structure the Unstructured Detect Patterns Commodity Hardware System 6+PB 40+PB 20+PB Hadoop Analyze & Report Discover & Explore Parallel Efficiency and Platform Costs
  • 55. Platform Metrics for Table Scan and Sum, Hadoop vs Teradata
  • 56. Copyright Third Nature, Inc. There are really three workload classes to consider 1. Operational: OLTP systems 2. Analytic: Query systems 3. Scientific: Computational systems Unit of focus: 1. Transaction 2. Query 3. Computation Different problems require different platforms
  • 57. Copyright Third Nature, Inc. Workloads OLTP BI Analytics Access Read-Write Read-only Read-mostly Predictability Fixed path Unpredictable All data Selectivity High Low Low Retrieval Low Low High Latency Milliseconds <seconds msecs to days Concurrency Huge Moderate 1 to huge Model 3NF, nested object Dim, denorm BWT Task sizw Small Large Small to huge
  • 58. Copyright Third Nature, Inc. A key point worth remembering: Performance over size <> performance over complexity OLTP performance is mostly related to transaction coordination and writing under high concurrency. BI performance is mostly related to data volume and query complexity. Analytics performance is about the intersection of these with computational complexity.
  • 59. Copyright Third Nature, Inc. A history of databases in No notation 1970s: NoSQL = We have no SQL 1980s: NoSQL = Know SQL 2000s: NoSQL = No SQL! 2005s: NoSQL = Not only SQL 2010s: NoSQL = No, SQL! (R)DB(MS)
  • 60. Copyright Third Nature, Inc. TANSTAAFL Technologies are not perfect replacements for one another. Often not better, only different. When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time.
  • 61. Copyright Third Nature, Inc. Unintended consequencesUnintended consequences
  • 62. Copyright Third Nature, Inc. Away from “one throat to choke”, back to best of breed Tight coupling leads to slow changes. The market is not in the tight coupling phase In a rapidly evolving market, componentized architectures, modularity and loose coupling are favorable over monolithic stacks, single-vendor architectures and tight coupling.
  • 63. Copyright Third Nature, Inc. Don’t follow the market Some people can’t resist getting the next new thing because it’s new and new is always better. Many IT organizations are like this, promoting a solution and hunting for the problem that matches it. Better to ask “What is the problem for which this technology is the answer?” Copyright Third Nature, Inc.
  • 64. Copyright Third Nature, Inc. How we develop best practices: survival bias We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
  • 65. Copyright Third Nature, Inc. Summarizing some key points 1. Data lives longer than code. It pays to focus on data when choosing and using persistence layers. 2. Different persistence layer approaches have different tradeoffs (like performance vs app complexity in BASE models). 3. Don’t throw away 30 years of engineering unless know why it was put there. 4. Decoupling, reusability (of data) and scalability are nice things to have.
  • 66. Copyright Third Nature, Inc. References (things worth reading on the way home) A relational model for large shared data banks, Communications of the ACM, June, 1970, http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf Column-Oriented Database Systems, Stavros Harizopoulos, Daniel Abadi, Peter Boncz, VLDB 2009 Tutorial http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf Nobody ever got fired for using Hadoop on a cluster, 1st International Workshop on Hot Topics in Cloud Data ProcessingApril 10, 2012, Bern, Switzerland. A co-Relational Model of Data for Large Shared Data Banks, ACM Queue, 2012, http://queue.acm.org/detail.cfm?id=1961297 A query language for multidimensional arrays: design, implementation and optimization techniques, SIGMOD, 1996 Probabilistically Bounded Staleness for Practical Partial Quorums, Proceedings of the VLDB Endowment, Vol. 5, No. 8, http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf “Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al MapReduce: Simplified Data Processing on Large Clusters, http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapre duce-osdi04.pdf Dremel: Interactive Analysis of Web-Scale Datasets, Proceedings of the VLDB Endowment, Vol. 3, No. 1, 2010 http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/3 6632.pdf Spanner: Google’s Globally-Distributed Database, SIGMOD, May, 2012, http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/es//archive/spanne r-osdi2012.pdf F1: A Distributed SQL Database That Scales, Proceedings of the VLDB Endowment, Vol. 6, No. 11, 2013, http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive /41344.pdf
  • 67. Copyright Third Nature, Inc. About the Presenter Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business use of data and analytics. Mark is an award- winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
  • 68. Copyright Third Nature, Inc. About Third Nature Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, analytics and performance management. If your question is related to BI, analytics, information strategy and data then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.
  • 69. Copyright Third Nature, Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: bookshelf by spectrum.jpg - http://flickr.com/photos/santos/1704875109/ Vatican library - http://www.flickr.com/photos/paullew/1550844955 round hole square peg - https://www.flickr.com/photos/epublicist/3546059144 text composition - http://flickr.com/photos/candiedwomanire/60224567/ twitter_network_bw.jpg - http://www.flickr.com/photos/dr/2048034334/ round hole square peg - https://www.flickr.com/photos/epublicist/3546059144 glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721 CAP diagram - http://blog.nahurst.com/visual-guide-to-nosql-systems refinery-hdr.jpg - http://www.flickr.com/photos/vermininc/2477872191/ refinery-night.jpg - http://www.flickr.com/photos/vermininc/2485448766/