SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Downloaden Sie, um offline zu lesen
SQL..
.
SQL!
SQL?
SQL
Hadoop
BI Isn’t Big Data, Big Data Isn’t BI
June, 2014
Mark Madsen
www.ThirdNature.net
@markmadsen
© Third Nature Inc.
Summary
Common uses and commodity technology
lead to
Novel practices
lead to
Different data and different technology needs
lead to
New architectures
Lead to
Common uses and commodity technology
© Third Nature Inc.
Our ideas about
information and
how it’s used are
outdated.
© Third Nature Inc.
How We Think of Users
Our design point is the
passive consumer of
information.
Proof: methodology
▪ IT role is requirements,
design, build, deploy,
administer
▪ User role is run reports
Self-serve BI is not like
picking the right doughnut
from a box.
Slide 4
© Third Nature Inc.
How We Think of Users
Our design point is the
passive consumer of
information.
Proof: methodology
▪ IT role is requirements,
design, build, deploy,
administer
▪ User role is run reports
Self-serve BI is not like
picking the right doughnut
from a box.
How We Want Users to
Think of Us
© Third Nature Inc.
How We Think of Users What Users Really Think
© Third Nature Inc.
We think of BI as publishing, an old metaphor.
Publishing has value, but may not be actionable.
© Third Nature Inc.
When you first give people access to information
that was unavailable…
OH GOD
I can see into forever
© Third Nature Inc.
After a while it becomes the new normal
© Third Nature Inc.
Planning data strategy means understanding the
context of data use so we can build infrastructure
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
We need to focus on what people do with information
as the primary task, not on the data or the technology.
© Third Nature Inc.
General model for organizational use of data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act within the process
Usually real-time to daily
© Third Nature Inc.
Origin of BI and data warehouse concepts
The general concept of a
separate architecture for BI
has been around longer, but
this paper by Devlin and
Murphy is the first formal
data warehouse architecture
and definition published.
12
“An architecture for a business and
information system”, B. A. Devlin,
P. T. Murphy, IBM Systems Journal,
Vol.27, No. 1, (1988)
Slide 12Copyright Third Nature, Inc.
© Third Nature Inc.
Origins: in 1988 there was only big hair.
▪ No real commercial email, public internet barely started
▪ Storage state of the art: 100MB, cost $10,000/GB
▪ Oracle Applications v1 GL released; SAP goes public,
enters US market
▪ Unix is mostly run by long-haired freaks
▪ Mobile was this
This is the context: scarcity of data, of system resources, of automated
systems outside core financials, of money to pay for storage.
© Third Nature Inc.
General model for organizational use of data
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Copyright Third Nature, Inc.
© Third Nature Inc.
Straight line vs closed loop causality
BI is suited for stability, analytics for adaptation
© Third Nature Inc.
You need to be able to support both paths
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
Act on the process
Act within the process
Conventional BI, addition of EDM
Causal analysis, “data science”
Copyright Third Nature, Inc.
© Third Nature Inc.
The usage models for conventional BI
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Act within the process
Usually real-time to daily
This is what we’ve been
doing with BI so far: static
reporting, dashboards,
ad-hoc query, OLAP
Copyright Third Nature, Inc.
© Third Nature Inc.
The usage models for analytics and “big data”
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Act within the process
Usually real-time to daily
Analytics and big data is
focused on new use
cases: deeper analysis,
causes, prediction,
optimizing decisions
This isn’t ad-hoc,
reporting, or OLAP.
Copyright Third Nature, Inc.
© Third Nature Inc.
Simple Complicated Complex
Assumption: Order Assumption: Unorder Assumption: Disorder
Cause and effect is repeatable
& predictable
Cause and effect is separated
in time & space, repeatable,
learnable
Cause and effect is coherent
in retrospect only, modelable
but changing
Known Knowable Unpredictable
Standard processes, clear
metrics, best practice
Analytical techniques to
determine options, effects
Experiment to create possible
options
Sense, categorize, respond Sense, analyze, respond Test, sense, respond
Like a bike Like a car Like economic systems
Copyright Third Nature, Inc.
Over time, processes are (usually) better understood,
so there is a movement of decisions from right to left,
where support is more automated.
© Third Nature Inc.
As practices evolve based on new capabilities…
A new level of
complexity
develops over
top of the
older, now
better
understood
processes,
leading to new
data and
analysis needs.
© Third Nature Inc.
Technology captures observations. These change our
understanding. New understanding changes practices.
Practices drive changes to technology, needing more data
Complexity will continue to increase
© Third Nature Inc.
I never said the
“E” in EDW meant
“everything”…
What do you
mean, “Just
doughnuts?”
© Third Nature Inc.
It’s going to get a lot worse
Not E
E
Conclusion: any methodology built on the premise that you
must know and model all the data first is untenable
© Third Nature Inc.
“Big data is unprecedented.”
- Anyone involved with big data in even the
most barely perceptible way
© Third Nature Inc.
We’ve been here before
Source: Bill Schmarzo, EMC
© Third Nature Inc.
“Big” is well supported by databases now
Source:Noumenal,Inc.
© Third Nature Inc.
Orders of magnitude: 20 years ago TB, today PB
Shifts in data availability by orders of magnitude
necessitate new means of managing and using it.
© Third Nature Inc.
Analytics embiggens the data volume problem
Many of the processing problems are O(n2) or worse, so
moderate data can be a problem for DB-based platforms
© Third Nature Inc.
Much of the big data value comes from analytics
BI is a retrieval problem, not a computational problem.
Five basic things you can do with analytics
▪Prediction – what is most likely to happen?
▪Estimation – what’s the future value of a variable?
▪Description – what relationships exist in the data?
▪Simulation – what could happen?
▪Prescription – what should you do?
Slide 29
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
© Third Nature Inc.
Most people do not need special technologyNumberofpeople
The distribution of data
size is about normal, yet
these guys set the tone of
the market today.
Bigness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
Analytics: This is really raw data under storageNumberofjobs
Microsoft study of 174,000 analytic
jobs in their cluster: median size ???
Bigness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
Working data for analytics most often not bigNumberofjobs
14 GB
Smallness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
A Simple Division of the Problem SpaceComputation
LittleLots
Data volume
Little Lots
Big analytics, little data
Specialized computing,
modeling problems:
supercomputing, GPUs
Big analytics, big data
Complex math over large
data volumes requires
shared nothing architectures
Little analytics, little data
The entry point; SAS, SMP
databases, even OLAP
can work
Little analytics, big data
The BI/DW space, for the
most part
© Third Nature Inc.© Third Nature Inc.
What makes the actual data “big”?
Very large amounts
Hierarchical structures
Nested structures
Encoded values
Non-standard (for a database)
types
Deep structure
Human authored text
“big” is better off being defined as “complex” or “hard to manage”
Copyright Third Nature, Inc.
© Third Nature Inc.
Three kinds of measurement data we collect
The convenient data is
transactional data.
▪ Goes in the DW and is used, even
if it isn’t the right measurement.
The inconvenient data is
observational data.
▪ It’s not neat, clean, or designed
into most systems of operation.
The difficult and misleading data
is declarative data.
▪ What people say and what they
do require ground truth.
We need to make use of all three.
Copyright Third Nature, Inc.
© Third Nature Inc.
“big data” vs. transactions
The classic example of “structured data”
Transaction data includes:
▪ quantification details (date, value, count)
▪ reference data for explanation (product,
customer, account)
▪ Lots of meaningful information
Reference data is usually shared across the
organization, hence its importance. There
are two parts:
▪ identifier to uniquely identify the subject
▪ descriptive attributes with common or
standardized value domains
Transaction details
Reference data
© Third Nature Inc.
Today it’s different data: observations, not transactions
Sensor data doesn’t fit well with current methods of collection and
storage, or with the technology to process and analyze it.
Copyright Third Nature, Inc.
© Third Nature Inc.
Big data as a type of data: Transactions vs. Events
Transactions:
▪ Each one is valuable
▪ Mutable
▪ The elements of a transaction can be aggregated easily
▪ A set of transactions does not usually have important ordering
or dependency
Events:
▪ A single event often has no value, e.g. what is the value of one
click in a series? Some events are extremely valuable, but this
is only detectable within the context of other events.
▪ Elements of events are often not easily aggregated
▪ A set of events usually has a natural order and dependencies
▪ Immutable
© Third Nature Inc.
Example “big data”: Web tracking data
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link-src-email-_-m100109-_-44IOJ1-_-shop&langId=-
1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Google.com
REFERRAL_URL
http://www.google.com/search?sourceid=navclient&aq=0h&
oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia
n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP-BY-HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
BROWSER_OS_NAME
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS
NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
© Third Nature Inc.
Web tracking data has a nested structure
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link-src-email-_-m100109-_-44IOJ1-_-shop&langId=-
1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Direct
REFERRAL_URL -
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP-BY-HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
BROWSER_OS_NAME
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS
NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
“unstructured” data
embedded in the
logged message:
complex strings
© Third Nature Inc.
The missing ingredient from most big data
© Third Nature Inc.
The creation and flow of data is different for
observations, interactions and declarations
Data entry Extract Cleanse Load Use
Data
Generation
Store
Store
Use
Use
The process for most human-entered data
The process for much machine-generated data
Cleanse
Copyright Third Nature, Inc.
We collect large volumes of text, a rare practice
ten years ago. Today we can turn text into data.
Categories,
taxonomies
Topics, genres,
relationships,
abstracts
Sentiment, tone,
opinion
Words & counts,
keywords, tags
Entities
people, places,
things, events, IDs
Copyright Third Nature, Inc.
© Third Nature Inc.
You can store this data in an RDBMS, but…
Example data: Twitter Message API Payload
Looks like:
This is really just a record format
much like a DB row.
Datetime, userID, name, location,
description, message, message
metadata, etc.
But it’s In json or xml.
© Third Nature Inc.
@markmadsen Check out: From #MongoDB to #Cassandra:
Why The Atlas Platform Is Migrating http://owl.li/cvxFK
A tweet has lots of fields, but one important one
The payload is free text but has other elements:
From these things you likely want to generate or link to
reference data.
‘To’ username Hashtag HashtagURL
© Third Nature Inc.© Third Nature Inc.
Internal payload elements form a new graph
The @elements point to
other records and create a
deeply linked structure.
You have to assemble the
linked structure to see
what’s really there, which
means repeated scanning
some/all of the data.
The derived pattern is
interesting data,
sometimes more than the
individual messages.
© Third Nature Inc.© Third Nature Inc.
There are many patterns in the data
Follower / following networks are easy – they are explicit
and independent of the events.
Community detection requires looking at patterns of @
communication in addition to follow relationships.
What do you do with these after discovery?
Follower network Conversational communities
© Third Nature Inc.
More data: patterns emerge from lots of event data
Patterns emerge from
the underlying structure
of the entire dataset.
The patterns are more
interesting than sums
and counts of the events.
Web paths: clicks in a
session as network node
traversal.
Email: traffic analysis
producing a network
The event stream is a source for analysis, generating
another set of data that is the source for different analysis.
© Third Nature Inc.
Big changes for data warehousing workloads
The results of analytic
processing can, often do,
feed back into the
system from which they
originate.
Much of the data is being
read, written and
processed in real time.
Our design point was not
changing tables and
ephemeral patterns.
Unstructured is Not Really Unstructured
Slide 51
Unstructured data isn’t
really unstructured:
language has structure.
Text can contain traditional
structured data elements.
The problem is that the
content is unmodeled.
© Third Nature Inc. Slide 52
THE BIG CHANGE ISN’T
TECHNOLOGY, IT’S ARCHITECTURE
© Third Nature Inc.
There are really three workloads to consider, not two
1. Operational: OLTP systems
2. Analytic: OLAP systems
3. Scientific: Computational systems
Unit of focus:
1. Transaction
2. Query
3. Computation
Different problems require different platforms
© Third Nature Inc.
The geography has been redefined
The box we created:
• not any data, rigidly typed data
• not any form, tabular rows and
columns of typed data
• not any latency, persist what the
DB can keep up with
• not any process, only queries
The digital world was diminished
to only what’s inside the box until
we forgot the box was there.
© Third Nature Inc.
Layered data architecture
The DW assumed a single flat
model of data, DB in the center.
New technology enables new
ways to organize data:
▪ Raw – straight from the source
▪ Enhanced –cleaned, standardized
▪ Integrated – modeled,
augmented, ~semi-persistent
▪ Derived – analytic output,
pattern based sets, ephemeral
Implies a new technology architecture
and data modeling approaches.
© Third Nature Inc.
Food supply chain: an analogy for data
Multiple contexts of use, differing quality levels
© Third Nature Inc.
Data infrastructure is a platform
▪ Any data – structures, forms
▪ Any latency –in motion, at rest
▪ Any process – query, algorithm, transformation
▪ Any access – SQL, API, queue, file movement
© Third Nature Inc.
The evolution of BI is to a data platform, which means
separating application from infrastructure.
Derived data
Raw data
Infrastructure layer:
Process and analyze
Store and manage
Application layer:
Deliver and use
The new model also encompasses data at rest and data in motion
Multiple access methods
Enhanced
data
Multiple ingest methods
BI, data extracts,
analytics, applications
The platform has to do more than serve queries; it has to be read-write.
© Third Nature Inc.© Third Nature Inc.
IT reality is multiple data stores and systems
Separate, purpose-built databases and processing systems for
different types of data and query / computing workloads, plus
any access method, is the new norm for information delivery.
BI, Reporting,
Dashboards, apps
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
Data
Warehouse
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
Data
processing Stream
processing
© Third Nature Inc.
Away from “one throat to choke”, back to best of breed
“The extremely specialized
nature of mass production
raises the costs of product
change and therefore slows
down innovation.”
- Abernathy, 1978
Tight coupling leads to slow
changes.
In a rapidly evolving market
componentized architectures,
modularity and loose coupling
are favorable over monolithic
stacks, single-vendor
architectures and tight
coupling.
© Third Nature Inc.
Staff and skills are a problem in a build market
@BigDataBorat: Give man Hadoop
cluster he gain insight for a day. Teach
man build Hadoop cluster he soon
leave for better job #bigdata
© Third Nature Inc.
Technology Adoption
Some people can’t resist
getting the next new thing
because it’s new and new is
always better.
Many IT organizations are like
this, promoting a solution and
hunting for the problem that
matches it.
Better to ask “What is the
problem for which this
technology is the answer?”
Copyright Third Nature, Inc.
© Third Nature Inc.
Four core capabilities big data technologies add
1. Unlimited scale of storage, processing
▪ Agility, faster turnaround for new data requests (but not a
replacement for BI)
▪ Fewer staff to accomplish same goals
2. New data accessibility
▪ More data retained for longer period
▪ Access to data unused due to cost or processing limits
▪ Any digital information becomes usable data
3. Scalable realtime processing
▪ Brings ability to monitor and act on data as events occur
4. Arbitrary analytics
▪ Faster analysis
▪ Deeper analysis
▪ More broadly accessible analytics
© Third Nature Inc.
An important Hadoop + cloud computing benefit
Scalability is free – if your task requires 10 units of
work, you can decide when you want results:
10 servers, 1 unit of time
Cost is the same. Not true of the conventional IT model
Time
1 server, 10 units of time
X X
© Third Nature Inc.
Hadoop: a summary of the magic
1. Provides both storage and complex processing as part
of the same platform
2. Makes parallel programming more accessible
3. Schemaless and file-based, therefore flexible
4. Inexpensive, reliable scale-out
5. Potential for fast, scalable ingest
6. Low-cost
© Third Nature Inc.
As a technology moves from emerging to commodity the
nature of acquiring, using and managing it changes
Generate
options
Innovation
Novel practice
Maximize value
Maturation
Constrain
choices
Adaptation
Good practice
Optimize
Standardize /
minimize choice
Acquisition
Best practice
Minimize costs
SaturationInnovation
Copyright Third Nature, Inc.
Agile & open
source* methods
6 Sigma & process
methods
© Third Nature Inc.
Today: repeating the experience of the 80s & 90s
This is the turbulent
phase of the market
as it goes through
rapid development,
then product and
service changes.
Copyright Third Nature, Inc.
The Internet combined with commodity computing is forcing a new
business and IT structural evolution, already underway.
Maturation SaturationInnovation
© Third Nature Inc.
How we develop best practices: survival bias
We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
© Third Nature Inc.
TANSTAAFL
Technologies are not
perfect replacements for
one another.
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t
see them for a long time.
© Third Nature Inc.
Questions?“When a new technology rolls over you, you're either part of
the steamroller or part of the road.” – Stewart Brand
© Third Nature Inc.
Welcome to the big data revolution, more of an evolution
Be pragmatic, not dogmatic
© Third Nature Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
acorn_blue.jpg - http://www.flickr.com/photos/rogersmith/314324893/
wheat_field.jpg - http://www.flickr.com/photos/ecstaticist/1120119742/
Phone dump - Richard Barnes
ponies in field.jpg - http://www.flickr.com/photos/bulle_de/352732514/
straw men.jpg - http://www.flickr.com/photos/robinellis/6034919721/
text composition - http://flickr.com/photos/candiedwomanire/60224567/
girl on cell tokyo .jpg - http://flickr.com/photos/8024992@N06/986538717/
hamadan people mosaic.jpg - http://flickr.com/photos/hamed/225868856/
twitter_network_bw.jpg - http://www.flickr.com/photos/dr/2048034334/
klein_bottle_red.jpg - http://flickr.com/photos/sveinhal/2081201200/
donuts_4_views.jpg - http://www.flickr.com/photos/le_hibou/76718773/
subway dc metro - http://flickr.com/photos/musaeum/509899161/
About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business
intelligence, data integration and data
management. Mark is an award-winning
author, architect and CTO whose work
has been featured in numerous industry
publications. Over the past ten years
Mark received awards for his work from
the American Productivity & Quality
Center, TDWI, and the Smithsonian
Institute. He is an international speaker,
a contributor to Forbes Online and on
the O’Reilly Strata program committee.
For more information or to contact
Mark, follow @markmadsen on Twitter
or visit http://ThirdNature.net
© Third Nature Inc.
About Third Nature
Third Nature is a research and consulting firm focused on new and
emerging technology and practices in analytics, business intelligence,
information strategy and data management. If your question is related to
data, analytics, information strategy and technology infrastructure then
you‘re at the right place.
Our goal is to help organizations solve problems using data. We offer
education, consulting and research services to support business and IT
organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating technology and hw it is
applied rather than vendor market positions.

Weitere ähnliche Inhalte

Andere mochten auch

Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Evolution Of The Computers
Evolution Of The ComputersEvolution Of The Computers
Evolution Of The Computerspanitiaict
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An OverviewOliver Stadie
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Svetlin Nakov
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Big Data Analytics in Energy & Utilities
Big Data Analytics in Energy & UtilitiesBig Data Analytics in Energy & Utilities
Big Data Analytics in Energy & UtilitiesAnders Quitzau
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (18)

Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Evolution Of The Computers
Evolution Of The ComputersEvolution Of The Computers
Evolution Of The Computers
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Layered Software Architecture
Layered Software ArchitectureLayered Software Architecture
Layered Software Architecture
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An Overview
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics in Energy & Utilities
Big Data Analytics in Energy & UtilitiesBig Data Analytics in Energy & Utilities
Big Data Analytics in Energy & Utilities
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Mehr von mark madsen

Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of Peoplemark madsen
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humansmark madsen
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprisemark madsen
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
 
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Rangemark madsen
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software marketmark madsen
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customersmark madsen
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionmark madsen
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsmark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)mark madsen
 

Mehr von mark madsen (20)

Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of People
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humans
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software market
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
 

Kürzlich hochgeladen

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Kürzlich hochgeladen (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Bi isn't big data and big data isn't BI

  • 1. SQL.. . SQL! SQL? SQL Hadoop BI Isn’t Big Data, Big Data Isn’t BI June, 2014 Mark Madsen www.ThirdNature.net @markmadsen
  • 2. © Third Nature Inc. Summary Common uses and commodity technology lead to Novel practices lead to Different data and different technology needs lead to New architectures Lead to Common uses and commodity technology
  • 3. © Third Nature Inc. Our ideas about information and how it’s used are outdated.
  • 4. © Third Nature Inc. How We Think of Users Our design point is the passive consumer of information. Proof: methodology ▪ IT role is requirements, design, build, deploy, administer ▪ User role is run reports Self-serve BI is not like picking the right doughnut from a box. Slide 4
  • 5. © Third Nature Inc. How We Think of Users Our design point is the passive consumer of information. Proof: methodology ▪ IT role is requirements, design, build, deploy, administer ▪ User role is run reports Self-serve BI is not like picking the right doughnut from a box. How We Want Users to Think of Us
  • 6. © Third Nature Inc. How We Think of Users What Users Really Think
  • 7. © Third Nature Inc. We think of BI as publishing, an old metaphor. Publishing has value, but may not be actionable.
  • 8. © Third Nature Inc. When you first give people access to information that was unavailable… OH GOD I can see into forever
  • 9. © Third Nature Inc. After a while it becomes the new normal
  • 10. © Third Nature Inc. Planning data strategy means understanding the context of data use so we can build infrastructure Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing We need to focus on what people do with information as the primary task, not on the data or the technology.
  • 11. © Third Nature Inc. General model for organizational use of data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act within the process Usually real-time to daily
  • 12. © Third Nature Inc. Origin of BI and data warehouse concepts The general concept of a separate architecture for BI has been around longer, but this paper by Devlin and Murphy is the first formal data warehouse architecture and definition published. 12 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 12Copyright Third Nature, Inc.
  • 13. © Third Nature Inc. Origins: in 1988 there was only big hair. ▪ No real commercial email, public internet barely started ▪ Storage state of the art: 100MB, cost $10,000/GB ▪ Oracle Applications v1 GL released; SAP goes public, enters US market ▪ Unix is mostly run by long-haired freaks ▪ Mobile was this This is the context: scarcity of data, of system resources, of automated systems outside core financials, of money to pay for storage.
  • 14. © Third Nature Inc. General model for organizational use of data Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Copyright Third Nature, Inc.
  • 15. © Third Nature Inc. Straight line vs closed loop causality BI is suited for stability, analytics for adaptation
  • 16. © Third Nature Inc. You need to be able to support both paths Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process Conventional BI, addition of EDM Causal analysis, “data science” Copyright Third Nature, Inc.
  • 17. © Third Nature Inc. The usage models for conventional BI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP Copyright Third Nature, Inc.
  • 18. © Third Nature Inc. The usage models for analytics and “big data” Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isn’t ad-hoc, reporting, or OLAP. Copyright Third Nature, Inc.
  • 19. © Third Nature Inc. Simple Complicated Complex Assumption: Order Assumption: Unorder Assumption: Disorder Cause and effect is repeatable & predictable Cause and effect is separated in time & space, repeatable, learnable Cause and effect is coherent in retrospect only, modelable but changing Known Knowable Unpredictable Standard processes, clear metrics, best practice Analytical techniques to determine options, effects Experiment to create possible options Sense, categorize, respond Sense, analyze, respond Test, sense, respond Like a bike Like a car Like economic systems Copyright Third Nature, Inc. Over time, processes are (usually) better understood, so there is a movement of decisions from right to left, where support is more automated.
  • 20. © Third Nature Inc. As practices evolve based on new capabilities… A new level of complexity develops over top of the older, now better understood processes, leading to new data and analysis needs.
  • 21. © Third Nature Inc. Technology captures observations. These change our understanding. New understanding changes practices. Practices drive changes to technology, needing more data Complexity will continue to increase
  • 22. © Third Nature Inc. I never said the “E” in EDW meant “everything”… What do you mean, “Just doughnuts?”
  • 23. © Third Nature Inc. It’s going to get a lot worse Not E E Conclusion: any methodology built on the premise that you must know and model all the data first is untenable
  • 24. © Third Nature Inc. “Big data is unprecedented.” - Anyone involved with big data in even the most barely perceptible way
  • 25. © Third Nature Inc. We’ve been here before Source: Bill Schmarzo, EMC
  • 26. © Third Nature Inc. “Big” is well supported by databases now Source:Noumenal,Inc.
  • 27. © Third Nature Inc. Orders of magnitude: 20 years ago TB, today PB Shifts in data availability by orders of magnitude necessitate new means of managing and using it.
  • 28. © Third Nature Inc. Analytics embiggens the data volume problem Many of the processing problems are O(n2) or worse, so moderate data can be a problem for DB-based platforms
  • 29. © Third Nature Inc. Much of the big data value comes from analytics BI is a retrieval problem, not a computational problem. Five basic things you can do with analytics ▪Prediction – what is most likely to happen? ▪Estimation – what’s the future value of a variable? ▪Description – what relationships exist in the data? ▪Simulation – what could happen? ▪Prescription – what should you do? Slide 29 Copyright Third Nature, Inc. Copyright Third Nature, Inc.
  • 30. © Third Nature Inc. Most people do not need special technologyNumberofpeople The distribution of data size is about normal, yet these guys set the tone of the market today. Bigness of data Copyright Third Nature, Inc.
  • 31. © Third Nature Inc. Analytics: This is really raw data under storageNumberofjobs Microsoft study of 174,000 analytic jobs in their cluster: median size ??? Bigness of data Copyright Third Nature, Inc.
  • 32. © Third Nature Inc. Working data for analytics most often not bigNumberofjobs 14 GB Smallness of data Copyright Third Nature, Inc.
  • 33. © Third Nature Inc. A Simple Division of the Problem SpaceComputation LittleLots Data volume Little Lots Big analytics, little data Specialized computing, modeling problems: supercomputing, GPUs Big analytics, big data Complex math over large data volumes requires shared nothing architectures Little analytics, little data The entry point; SAS, SMP databases, even OLAP can work Little analytics, big data The BI/DW space, for the most part
  • 34. © Third Nature Inc.© Third Nature Inc. What makes the actual data “big”? Very large amounts Hierarchical structures Nested structures Encoded values Non-standard (for a database) types Deep structure Human authored text “big” is better off being defined as “complex” or “hard to manage” Copyright Third Nature, Inc.
  • 35. © Third Nature Inc. Three kinds of measurement data we collect The convenient data is transactional data. ▪ Goes in the DW and is used, even if it isn’t the right measurement. The inconvenient data is observational data. ▪ It’s not neat, clean, or designed into most systems of operation. The difficult and misleading data is declarative data. ▪ What people say and what they do require ground truth. We need to make use of all three. Copyright Third Nature, Inc.
  • 36. © Third Nature Inc. “big data” vs. transactions The classic example of “structured data” Transaction data includes: ▪ quantification details (date, value, count) ▪ reference data for explanation (product, customer, account) ▪ Lots of meaningful information Reference data is usually shared across the organization, hence its importance. There are two parts: ▪ identifier to uniquely identify the subject ▪ descriptive attributes with common or standardized value domains Transaction details Reference data
  • 37. © Third Nature Inc. Today it’s different data: observations, not transactions Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  • 38. © Third Nature Inc. Big data as a type of data: Transactions vs. Events Transactions: ▪ Each one is valuable ▪ Mutable ▪ The elements of a transaction can be aggregated easily ▪ A set of transactions does not usually have important ordering or dependency Events: ▪ A single event often has no value, e.g. what is the value of one click in a series? Some events are extremely valuable, but this is only detectable within the context of other events. ▪ Elements of events are often not easily aggregated ▪ A set of events usually has a natural order and dependencies ▪ Immutable
  • 39. © Third Nature Inc. Example “big data”: Web tracking data USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= link-src-email-_-m100109-_-44IOJ1-_-shop&langId=- 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Google.com REFERRAL_URL http://www.google.com/search?sourceid=navclient&aq=0h& oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP-BY-HOLIDAY VALENTINES DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
  • 40. © Third Nature Inc. Web tracking data has a nested structure USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= link-src-email-_-m100109-_-44IOJ1-_-shop&langId=- 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Direct REFERRAL_URL - PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP-BY-HOLIDAY VALENTINES DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) “unstructured” data embedded in the logged message: complex strings
  • 41. © Third Nature Inc. The missing ingredient from most big data
  • 42. © Third Nature Inc. The creation and flow of data is different for observations, interactions and declarations Data entry Extract Cleanse Load Use Data Generation Store Store Use Use The process for most human-entered data The process for much machine-generated data Cleanse Copyright Third Nature, Inc.
  • 43. We collect large volumes of text, a rare practice ten years ago. Today we can turn text into data. Categories, taxonomies Topics, genres, relationships, abstracts Sentiment, tone, opinion Words & counts, keywords, tags Entities people, places, things, events, IDs Copyright Third Nature, Inc.
  • 44. © Third Nature Inc. You can store this data in an RDBMS, but…
  • 45. Example data: Twitter Message API Payload Looks like: This is really just a record format much like a DB row. Datetime, userID, name, location, description, message, message metadata, etc. But it’s In json or xml.
  • 46. © Third Nature Inc. @markmadsen Check out: From #MongoDB to #Cassandra: Why The Atlas Platform Is Migrating http://owl.li/cvxFK A tweet has lots of fields, but one important one The payload is free text but has other elements: From these things you likely want to generate or link to reference data. ‘To’ username Hashtag HashtagURL
  • 47. © Third Nature Inc.© Third Nature Inc. Internal payload elements form a new graph The @elements point to other records and create a deeply linked structure. You have to assemble the linked structure to see what’s really there, which means repeated scanning some/all of the data. The derived pattern is interesting data, sometimes more than the individual messages.
  • 48. © Third Nature Inc.© Third Nature Inc. There are many patterns in the data Follower / following networks are easy – they are explicit and independent of the events. Community detection requires looking at patterns of @ communication in addition to follow relationships. What do you do with these after discovery? Follower network Conversational communities
  • 49. © Third Nature Inc. More data: patterns emerge from lots of event data Patterns emerge from the underlying structure of the entire dataset. The patterns are more interesting than sums and counts of the events. Web paths: clicks in a session as network node traversal. Email: traffic analysis producing a network The event stream is a source for analysis, generating another set of data that is the source for different analysis.
  • 50. © Third Nature Inc. Big changes for data warehousing workloads The results of analytic processing can, often do, feed back into the system from which they originate. Much of the data is being read, written and processed in real time. Our design point was not changing tables and ephemeral patterns.
  • 51. Unstructured is Not Really Unstructured Slide 51 Unstructured data isn’t really unstructured: language has structure. Text can contain traditional structured data elements. The problem is that the content is unmodeled.
  • 52. © Third Nature Inc. Slide 52 THE BIG CHANGE ISN’T TECHNOLOGY, IT’S ARCHITECTURE
  • 53. © Third Nature Inc. There are really three workloads to consider, not two 1. Operational: OLTP systems 2. Analytic: OLAP systems 3. Scientific: Computational systems Unit of focus: 1. Transaction 2. Query 3. Computation Different problems require different platforms
  • 54. © Third Nature Inc. The geography has been redefined The box we created: • not any data, rigidly typed data • not any form, tabular rows and columns of typed data • not any latency, persist what the DB can keep up with • not any process, only queries The digital world was diminished to only what’s inside the box until we forgot the box was there.
  • 55. © Third Nature Inc. Layered data architecture The DW assumed a single flat model of data, DB in the center. New technology enables new ways to organize data: ▪ Raw – straight from the source ▪ Enhanced –cleaned, standardized ▪ Integrated – modeled, augmented, ~semi-persistent ▪ Derived – analytic output, pattern based sets, ephemeral Implies a new technology architecture and data modeling approaches.
  • 56. © Third Nature Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels
  • 57. © Third Nature Inc. Data infrastructure is a platform ▪ Any data – structures, forms ▪ Any latency –in motion, at rest ▪ Any process – query, algorithm, transformation ▪ Any access – SQL, API, queue, file movement
  • 58. © Third Nature Inc. The evolution of BI is to a data platform, which means separating application from infrastructure. Derived data Raw data Infrastructure layer: Process and analyze Store and manage Application layer: Deliver and use The new model also encompasses data at rest and data in motion Multiple access methods Enhanced data Multiple ingest methods BI, data extracts, analytics, applications The platform has to do more than serve queries; it has to be read-write.
  • 59. © Third Nature Inc.© Third Nature Inc. IT reality is multiple data stores and systems Separate, purpose-built databases and processing systems for different types of data and query / computing workloads, plus any access method, is the new norm for information delivery. BI, Reporting, Dashboards, apps 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA Data Warehouse Databases Documents Flat Files XML Queues ERP Applications Source Environments Data processing Stream processing
  • 60. © Third Nature Inc. Away from “one throat to choke”, back to best of breed “The extremely specialized nature of mass production raises the costs of product change and therefore slows down innovation.” - Abernathy, 1978 Tight coupling leads to slow changes. In a rapidly evolving market componentized architectures, modularity and loose coupling are favorable over monolithic stacks, single-vendor architectures and tight coupling.
  • 61. © Third Nature Inc. Staff and skills are a problem in a build market @BigDataBorat: Give man Hadoop cluster he gain insight for a day. Teach man build Hadoop cluster he soon leave for better job #bigdata
  • 62. © Third Nature Inc. Technology Adoption Some people can’t resist getting the next new thing because it’s new and new is always better. Many IT organizations are like this, promoting a solution and hunting for the problem that matches it. Better to ask “What is the problem for which this technology is the answer?” Copyright Third Nature, Inc.
  • 63. © Third Nature Inc. Four core capabilities big data technologies add 1. Unlimited scale of storage, processing ▪ Agility, faster turnaround for new data requests (but not a replacement for BI) ▪ Fewer staff to accomplish same goals 2. New data accessibility ▪ More data retained for longer period ▪ Access to data unused due to cost or processing limits ▪ Any digital information becomes usable data 3. Scalable realtime processing ▪ Brings ability to monitor and act on data as events occur 4. Arbitrary analytics ▪ Faster analysis ▪ Deeper analysis ▪ More broadly accessible analytics
  • 64. © Third Nature Inc. An important Hadoop + cloud computing benefit Scalability is free – if your task requires 10 units of work, you can decide when you want results: 10 servers, 1 unit of time Cost is the same. Not true of the conventional IT model Time 1 server, 10 units of time X X
  • 65. © Third Nature Inc. Hadoop: a summary of the magic 1. Provides both storage and complex processing as part of the same platform 2. Makes parallel programming more accessible 3. Schemaless and file-based, therefore flexible 4. Inexpensive, reliable scale-out 5. Potential for fast, scalable ingest 6. Low-cost
  • 66. © Third Nature Inc. As a technology moves from emerging to commodity the nature of acquiring, using and managing it changes Generate options Innovation Novel practice Maximize value Maturation Constrain choices Adaptation Good practice Optimize Standardize / minimize choice Acquisition Best practice Minimize costs SaturationInnovation Copyright Third Nature, Inc. Agile & open source* methods 6 Sigma & process methods
  • 67. © Third Nature Inc. Today: repeating the experience of the 80s & 90s This is the turbulent phase of the market as it goes through rapid development, then product and service changes. Copyright Third Nature, Inc. The Internet combined with commodity computing is forcing a new business and IT structural evolution, already underway. Maturation SaturationInnovation
  • 68. © Third Nature Inc. How we develop best practices: survival bias We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
  • 69. © Third Nature Inc. TANSTAAFL Technologies are not perfect replacements for one another. When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time.
  • 70. © Third Nature Inc. Questions?“When a new technology rolls over you, you're either part of the steamroller or part of the road.” – Stewart Brand
  • 71. © Third Nature Inc. Welcome to the big data revolution, more of an evolution Be pragmatic, not dogmatic
  • 72. © Third Nature Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: acorn_blue.jpg - http://www.flickr.com/photos/rogersmith/314324893/ wheat_field.jpg - http://www.flickr.com/photos/ecstaticist/1120119742/ Phone dump - Richard Barnes ponies in field.jpg - http://www.flickr.com/photos/bulle_de/352732514/ straw men.jpg - http://www.flickr.com/photos/robinellis/6034919721/ text composition - http://flickr.com/photos/candiedwomanire/60224567/ girl on cell tokyo .jpg - http://flickr.com/photos/8024992@N06/986538717/ hamadan people mosaic.jpg - http://flickr.com/photos/hamed/225868856/ twitter_network_bw.jpg - http://www.flickr.com/photos/dr/2048034334/ klein_bottle_red.jpg - http://flickr.com/photos/sveinhal/2081201200/ donuts_4_views.jpg - http://www.flickr.com/photos/le_hibou/76718773/ subway dc metro - http://flickr.com/photos/musaeum/509899161/
  • 73. About the Presenter Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
  • 74. © Third Nature Inc. About Third Nature Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.