SlideShare ist ein Scribd-Unternehmen logo
1 von 121
Downloaden Sie, um offline zu lesen
Copyright Third Nature, Inc.
Architecting a Data
Platform For
Enterprise Use
September 2018
Mark Madsen
Todd Walter
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
What do we hear?
"I want to do analytics, beyond what I can do with BI tools"
"I need an analytics strategy"
"I need an analytics roadmap"
"I want to modernize my DW" aka "I want to speed up my DW"
"I have a specific analytics project of type..."
“We have a data lake. What should we do with it?”
“Technology Y is our replacement for technology X”
“Don’t need schema, curation, ETL, governance … anymore”
“How do I ensure the data is good enough for analytics?”
Good judgement is the result of experience.
Experience is the result of bad judgement.
—Fred Brooks
4 © 2017 Gartner, Inc. and/or its affiliates. All rights reserved.
Gartner statement in 2018:
only 15% are reported
to be successful
only
17%of Hadoop deployments
are in production in 2017
Survey Analysis: BI and Analytics
Spending Intentions, 2017
A McKinsey survey this
year asked executives if
their company had
achieved a positive ROI
with their big data projects:
7% answered “yes”
Gartner Finding:
in 2017
Copyright Third Nature, Inc.
The business complaints about data and analytics
You really
should
fire IT.
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data
deficient
Takes
too long
Costs
too much
Function
deficient
IT root
causes
IT proximate
causes
What is said
in disputes
Lack of
agility
People & vendor
cost basis
Client-server
infrastructure
Lack of
“good enough”
competency
1980s-era
methods
Inappropriate
technology
Data hygiene
fetishes
Vendor lock
1990s-era
procurement
IT skills
deficit
Dysfunctional
OLTP portfolio
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data
deficient
Takes
too long
Costs
too much
Function
deficient
IT root
causes
IT proximate
causes
What is said
in disputes
How business
responds
How IT
responsds
Lack of
agility
People & vendor
cost basis
Client-server
infrastructure
Lack of
“good enough”
competency
SaaS /
Cloud BI
Consultants
Self-service
BI
Shadow BI
Workarounds &
spreadmarts
Hidden costs
Loss of
control
Loss of
visibility
Loss of
knowledge
Security risks
Bad data risks
1980s-era
methods
Inappropriate
technology
Data hygiene
fetishes
Vendor lock
1990s-era
procurement
IT skills
deficit
Dysfunctional
OLTP portfolio
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data
deficient
Takes
too long
Costs
too much
Function
deficient
IT root
causes
IT proximate
causes
What is said
in disputes
How business
responds
How IT
responds
Lack of
agility
People & vendor
cost basis
Client-server
infrastructure
Lack of
“good enough”
competency
SaaS /
Cloud BI
Consultants
Self-service
BI
Shadow BI
Workarounds &
spreadmarts
Hidden costs
Loss of
control
Loss of
visibility
Loss of
knowledge
Security risks
Bad data risks
1980s-era
methods
Inappropriate
technology
Data hygiene
fetishes
Vendor lock
1990s-era
procurement
IT skills
deficit
Dysfunctional
OLTP portfolio
This is not a technology problem –
it is an architecture problem.
Copyright Third Nature, Inc.
DW: Centralize, that solves all problems!
Creates bottlenecks
Causes scale problems
Availability?
Copyright Third Nature, Inc.
The data lake solution: no central authority
wtf, it was fully
operational!
Copyright Third Nature, Inc.
The data lake solution?
There’s a problem: as
the lake is envisioned,
it is still a centralized
data architecture, but
this time there is no
single global model.
Instead it’s files and
not modeled. It can be
operational while
under construction.
It’s still a death star.
Copyright Third Nature, Inc.
Eventually we run into the same problems
Seriously, wtf?
It was agile
and operational
Rising complexity and scale break centralized models
Copyright Third Nature, Inc.
“The problem is the tools. Standardize on one tech!”
It’s called stack think. Pick your vendor. Cede all architecture.
Copyright Third Nature, Inc.
The problem is methods and process: agile your way
into the Analytics Shantytown
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
“Infrastructure one PoC at a time”
• Use case driven
• Leverage (any) new
technology
• Re-use of the
technology stack
• Data is captive to the
application or to the
technology stack
• Developers tend to
think “function first”
• Analytics people also
think “function first”,
but generally require
integrated data
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
A key aspect of operational analytics is “end-to-end”
E2E does not allow you to take a random
walk through technologies and BYOT
because the complexity of integration leads
to a mountain of technical debt.
• Most data
applications are
organized around a
process, not a task or
function.
• Real-time analytics
systems pass their
data over the
network, but the
dependencies can
cross application
boundaries, as well
as requiring
persistence.
• Developers are
being forced to think
of data outside the
local context.
Copyright Third Nature, Inc.
“Start with the platform. The rest will follow.”
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Big data reality: What users seeBig data promise: What you see
Copyright Third Nature, Inc.
Persisting data
is not the end
of the line.
If you stop
here you win
the battle and
lose the war
Copyright Third Nature, Inc.
“Begin with the end in mind”
The starting point can’t be with
technology. That’s like starting
with bricks when designing a
house. You may get lucky but…
The goals and specific uses are
the place to start
▪ Use dictates need
▪ Need dictates capabilities
▪ Capabilities are solved with
technology
This is how you avoid spending $2M on a Hadoop and spark cluster in order to serve
data to analysts whose primary requirements are met with laptops.
Copyright Third Nature, Inc.
We don’t have an analytics problem, just like we
didn’t have a BI problem
The origin of analytics as “business
intelligence” was stated well in 1958:
…the ability to apprehend the
interrelationships of presented facts in
such a way as to guide action towards a
desired goal. ~ H. P. Luhn
“A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf
”
“
Our goal is analytics as a capability, not a technology
© Third Nature Inc.
B I
© Third Nature Inc.
The old problem was access, the new problem is analysis
Copyright Third Nature, Inc.
Three constituencies
Stakeholder Analyst Builder
aka the recipient aka the data scientist aka the engineer
Copyright Third Nature, Inc.
Starting points for analytics strategy
Many organizations choose to start with
the analysts. Create a data science team.
Turn them loose to find a problem.
Many more start with builders: technology
solutions looking for problems, e.g. 65% of
the IT driven Hadoop and Spark projects
over the last five years.
The right place to start? Stakeholders. The
goal to achieve, the problem to solve.
Copyright Third Nature, Inc.
NATURE OF THE PROBLEM FROM
THE STAKEHOLDER’S PERSPECTIVE
Each constituency has their own set of problems to deal with
The myth that still drives analytics – analytic gold
All we need is a fat
pipe and pans
working in parallel…
Analytic insights that result in no action are expensive trivia.
It’s not the insight, but what you do with it, that matters
Applying analytics is not an analytics problem
Applying analytics is not in the
analyst’s control.
It’s not in the engineer’s control.
It’s in the control of the people
involved in the process.
Failures are often in execution, not
in analytics development.
For example, we saw unexpectedly
poor performance in a number of
geographies. Was it the new
analytics we tried? Was it a data
problem? No, it was a simple
compliance problem.
BI and Analysis: Repeatability and Discoverability
Discover
Explore
Analyze
Model
Consume
Promote
Focus is on repeatability
Application cycle time
90% of users just want answers
Focus is on discoverability
Analyst cycle time
9% analysts, 1% data scientists
80% of data use
is repeatable
80% of analysis
is novel
NATURE OF THE PROBLEM FROM
THE ANALYST’S PERSPECTIVE
The analytics process at a high level
Diagram: Kate Matsudaira
The nature of analytics problems is researching the
unknown rather than accessing the known.
Repeat for each new problem (unlike BI)
Diagram: Kate Matsudaira
The main hurdle: just getting the data
Do you know where to find it? Because it’s
may not be in the data warehouse.
Do you have access to it?
Is access fast enough? Because DWs are for
QRD, not for moving huge piles of data. And
ERP systems and SaaS apps are right out.
These are why developers jump on “data lake”
as the solution. But that’s a small, easy part.
Do you have the right data?
Many machine learning
techniques require labeled
(known good) training data:
Supervised learning: a person
has to define the correct
output for some portion of
the data. Data is divided into
training sets used for model
building and test sets for
validating the results.
• What is spam and what isn’t?
• What does a fraudulent
transaction look like
40
Do you have enough of the right data?
ML needs a lot, you may be disappointed in your efforts
Where do analysts spend their time? mostly data work
Slide 42
Define the
business problem
Translate the
problem into an
analytic context
Select appropriate
data
Learn the data
Create a model set
Fix problems with
data
Transform data
Build models
Assess models
Deploy models
Assess results
% of time spent
70% 30%
Source: Michael Berry, Data Miners Inc.
Where do most of the analytics tools focus?
Slide 43
Define the
business problem
Translate the
problem into an
analytic context Select
appropriate data
Learn the data
Create a model
set
Fix problems
with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Where do most of the analytics aaS focus?
Slide 44
Define the
business problem
Translate the
problem into an
analytic context Select
appropriate data
Learn the data
Create a model
set
Fix problems
with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
The analyst’s workspace in BI is relatively spare
The analyst’s workspace needs to be more like a
kitchen than like BI vending machines
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Do you treat this as a production workload? Storage control,
space limitations, resource limitations, production lockdown
Analytics work is a production workload, but…
Hi, we’re from
IT, and we’re
here to help!
NATURE OF THE PROBLEM FROM
THE BUILDER’S PERSPECTIVE
IT and Ops people want to know “what to build?”
Giant data platform? Self service tools?
Analytics has different processes and workloads
None of this analytics work
is the same as what IT
considered “analysis” to be,
which is usually equated
with BI or ad-hoc query.
Ad-hoc analysis =
Exploratory data analysis =
Batch analytics =
Real-time analytics
A real analytics production workflow
Hatch, CIKM ‘11 Slide 50
Things engineering and operations worry about
Engineering time and effort
▪ Introduction of new technology, complexity
▪ Integration - Deployment of models requirements linking different types of
environments, creating supportable workflows for the analysts
▪ Ability to develop and deploy at the required speed
Supportability
▪ Automation
▪ The environment requires additional monitoring, other technology and
processes, particularly for customer-facing work
▪ Support costs (time and money)
SLAs:
▪ Availability – if analytics are tied to production operations, particularly
customer facing, this becomes important and difficult because it’s not
standard application work
▪ Performance and scalability – have to manage unpredictable workloads,
resource conflicts between model development with model execution
The world changes, do the models?
In BI you maintain ETL and
schemas, in ML you maintain
models and maybe pipelines.
“Model decay” happens as the
assumptions around which a
model is built change, e.g. spam
techniques change.
When you adjust the model you
need to know it is normal again
▪ Better save the data used to
build the model
▪ Better save the model
▪ Baseline and measurements
THREE PERSPECTIVES, ONE SOLUTION?
There are requirements from all constituents. You need to put them
together to have a complete picture of what’s needed.
The missing stakeholder
There is another stakeholder:
analytics management - the
CAO, CDO, VP of analytics, aka
“your boss” if you’re a data
scientist.
This is the perspective and
problems of the person
responsible for oversight of
the team and efforts is across
the organization and across
multiple projects
Job #1 - Repeatability
Job # 2 - Operational predictability
Job #3 - Reproducibility
They need a system of record for analytics
There is an extensive list of requirements to support
Primary requirements needed by constituents S D E
Data catalog and ability to search it for datasets X X
Self-service access to curated data X
Self-service access to uncurated (unknown, new) data X X
Temporary storage for working with data X
Data integration, cleaning, transformation, preparation tools and environment X X
Persistent storage for source data used by production models X X
Persistent storage for training, testing, production data used by models X X
Storage and management of models X X
Deployment, monitoring, decommissioning models X
Lineage, traceability of changes made for data used by models X X
Lineage, traceability for model changes X X X
Managing baseline data / metrics for comparing model performance X X X
Managing ongoing data / metrics for tracking ongoing model performance X X X
S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
© Third Nature Inc.
How did we get to this state
with BI & analytics?
There’s a difference
between having no past
and actively rejecting it.
61
100
1,000
10,000
1 2 3 4 5 6 7 8 9 10
Customer Data Plateaus
Add users,
accumulate
history, add
attributes, add
dimensions
Entirely new data
set enabled by
new technology at
new price point –
value has to
exceed cost (ROI).
Earliest adopters
have vision ahead
of ROI
62
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13
User Adoption of New Data
Customer Data
Users - Data Set 1
Users - Data Set 2
Users of Integrated Data
User adoption of new data sets
starts over. Very small number of
experts growing to wider
audience, sophisticated users
moving to business analysts, then
business users, B2B customers and
even to consumers
Greater value is derived when
data sets are linked – see bigger
picture (eg who buys what,
when). Comes after initial
extraction of easy value from
standalone data
63
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Data Size Plateaus over Time
Lather, Rinse,
Repeat. Trend
line is roughly
Moore’s Law.
Delay or skip a
generation if new
data set is two
orders of
magnitude
instead of one.
64
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
10,000,000,000,000
100,000,000,000,000
1,000,000,000,000,000
10,000,000,000,000,000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Customer Data Size vs. 2x per 12mo
Customer Data
Moore's Law - 2x/18mo
2x/12 mo
But all studies show
demand/creation of
data increasing
significantly faster
than Moore’s law
If we started only 1
OOM behind, we are
now 5 OOM away from
user demand for data
Data
Exhaust
65
• <Store, Item, Week>
• <Store, Item, Day>
– Simple aggregations
• Market Basket
– Affinity
– Link to person, demographics, HR
• Inventory by SKU by store
– Temporal, time series, forecasting
– Link to product, marketing, market basket
Retail Plateaus
2B records
total for 9
quarters
2B records
per day,
keep 9
quarters
66
• Web Logs and traffic
– Behavioral patterns – eg path linked to person, offers, other channels
– Operations of the web site
• Supply chain sensors – sampled at major event
– Activity Based Costing
– Link to customer, product, HR, planning
• Social Media
– Text analysis, Filtering, languages
– Link to customer, sales, other channel interactions
• Supply chain sensors – sampled at minutes or seconds
– Telematics
– Real time, Event detection, trending, static and dynamic rules
– Link to HR, thresholds, forecasts, routing, planning
Retail Plateaus
30B records per day
67
FINANCE
Revenue
Expenses
Customers
CUSTOMER
CARE
Customer
Products
Orders
Case History
SALES
Orders
Customers
Products
MARKETING
Customers
Orders
Campaign
History
OPERATIONS
Inventory
Returns
Manufacturing
Supply Chain
How many
batteries
are in
inventory
by plant?
What is the
trend of
warranty
costs?
How many
people
made a
warranty
claim last
week?
How many
sales have
been
made
quarter to
date?
Which
customers
should get a
communica
tion on
extended
warranties?
54 32 29 49 66
68
2954 32 49 41
69
Given the rise in warranty costs, isolate the problem to be a
plant, then to a battery lot.
Communicate with affected customers, who have not made
a warranty claim on batteries, through Marketing and
Customer Service channels to recall cars with affected
batteries.
2855
Inventory
Returns
Manufacturing
Supply Chain
Customer Service
Orders
Revenue
Expenses
Case History
Customers
Products
Pipeline
Customers
Campaign
History
FINANCE
SALESMARKETING
OPERATIONS
CUSTOMER
EXPERIENCE
2855
70
Manufacturing: Data Overlap Analysis
New Business Improvement Opportunities through Data Leverage
Sales Force Profitability Analysis 100% 80% 66% 24% 41% 66% 0% 24% 0% 24%
Transportation Planning 11% 100% 87% 28% 64% 56% 22% 34% 13% 45%
Production Planning 6% 57% 100% 19% 83% 35% 17% 28% 9% 40%
Vendor Managed Inventory 12% 100% 100% 100% 78% 100% 61% 100% 39% 100%
Global Pricing Rationalization 4% 50% 100% 17% 100% 27% 15% 29% 11% 41%
Fulfillment (Perfect Order) 16% 94% 90% 48% 58% 100% 35% 53% 19% 56%
Manufacturing Quality
Optimization
0% 76% 88% 60% 66% 71% 100% 79% 43% 73%
Preventative Maintenance Analysis 7% 75% 93% 61% 80% 68% 50% 100% 27% 82%
Warranty Claims Analysis 0% 94% 100% 83% 100% 83% 94% 93% 100% 98%
Quality Life Cycle Improvement 4% 57% 75% 35% 64% 41% 26% 47% 16% 100%
SalesForce
ProfitabilityAnalysis
ProductionPlanning
VendorManaged
Inventory
GlobalPricing
Rationalization
TransportationPlanning
Fulfillment
(PerfectOrder)
ManufacturingQuality
Optimization
Preventative
Maintenance
Analysis
WarrantyClaims
Analysis
QualityLifeCycle
Improvement
If
Then
72
PRODUCT
SENSOR
SOCIAL
MEDIA
CUSTOMER
CARE AUDIO
RECORDINGS
DIGITAL
ADVERTISING
CLICKSTREAM
65 41 32 19 28
How many
visitors did
we have to
our hybrid
cars
microsite
yesterday?
What are
the
temperature
readings for
batteries by
Manufactur
er?
What is the
sentiment
towards line
of hybrid
vehicles?
Which
customers
likely
expressed
anger with
customer
care?
Which ad
creative
generated
the most
clicks?
73
2855
High, Well
Known Quality
Directional
Quality
Unknown/Low
quality
Well Defined
Data Model
Curated JSON,
XML, DB
Extracted
Attributes
Curation Required
74
2855
SENSOR
DIGITAL
ADVERTISING
CLICKSTREAM
INTERACTIONS
RATINGS &
REVIEWS
CUSTOMER PORTAL
INTERACTIONS
EXTERNAL
INTERACTIONS
SOCIAL MEDIA
IVR Routing
RFID
ELECTRONIC
COMMERCE
FINANCE
SALESMARKETING
Inventory
Returns
Manufacturing
Supply Chain
Customer
Service
Orders
Revenue
Expenses
Case History
Customers
Products
Pipeline
Customers
Campaign History
OPERATIONS
CUSTOMER
CARE – AUDIO
RECORDINGS
Maps
Telemetry
SERVER LOGS
CUSTOMER
EXPERIENCE
Enterprise Data
75
2855
Schema on
Write
Evolving
Schema
Schema on
Read
Enterprise
Data
Evolving
Data
New Data
Sources
Schema?
76
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
Social Media
Influencers
FINANCE
11234
Minimum Viable
Curation
Minimum Viable
Data Quality
77
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER
EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
Social Media
Influencers
FINANCE
11234
Sensor data
formatting
Unit
Normalization
Vehicle/version
normalization
Make/model/year
selection
Data Comprehension,
Pipelines
78
2855
Enterprise
Wide Use
10s to 100s
of Users
10s Users
Analysts /
engineers
Business
Analysts
Action list, Report,
Dashboard Users
Share across
Enterprise
Share in
department
Share over
cube wall
User Base and Sharing
79
Evolving
Consumption
2855
Enterprise
production
analytics
Departmental
analytics
Exploratory
analytics
Repeatable,
auditable
results
Ad-Hoc
Query, Self
Service
Data Labs
80
Evolving Consumption
Requirements
2855
Production tools,
curated data,
integrated across
business areas
Wide variety
of tools and
data forms
Targeted data access,
many applications,
response time SLAs
High CPU,
moderate to
low IO
Bulk scans, large
computation,
transformation, data access,
specific integration
Moderate CPU,
High IO, Resource
management
81
Technology,
Capacity
2855
Relational
store more
appropriate
Hybrid
File Store more
appropriate
100s TB -
PBs
10s PB
10 TB
82
MARKETING
FINANCE
SALES
OPERATIONS
CUSTOMER
EXPERIENCE
Given the rise in warranty
costs, isolate the
problem to be a plant
and the specific lot.
Exclude 2/3rd of the
batteries from the lot
that are fine.
Communicate with
affected customers, who
have not made a
warranty claim, through
Marketing and Customer
Service channels to
recall cars with affected
batteries.
2855
MANUFACTURING
CAMPAIGN
HISTORY
COSTS
PRODUCTS
CUSTOMERS
CASE HISTORY
SENSOR
Access Wide Variety of Data
to Answer a Question
Copyright Third Nature, Inc.
An event contains mainly IDs…that reference other data
2010.01.10.01:41:4468 67.189.110.179 Ser40489a07-7f6e4251801a
13ae51a6d092 301212631165031 590387153892659 DNIS5555685981-
UTF8&1T4ACGW_enUS386US387&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
m100109_44IOJ1-Id=105543CD1A7B8322
timestamp
user ID
device ID
game ID event details
IP adr
log ID
Log de-referencing and enrichment is difficult since you can’t
enforce integrity like you can in a DB.
What’s the glue that holds it together?
It’s just keys to other data.
e.g. remember that device ID 0 problem?
Copyright Third Nature, Inc.© Third Nature Inc.
It’s not just the log data, it’s big data + small data
Device
Event logs
Event logs
Event logs
A product has a complex set of master
data and metadata and this needs to
be added back to the bare events.
Copyright Third Nature, Inc.© Third Nature Inc.
Where does the reference data come from?
The keys come from somewhere. You don’t just make up
system-wide unique identifiers in your code.
The lack of local lookup data at event generation leads to
development practices that lead to inconsistencies.
Event logs
Event logs
Event logs
Problem we had: product
identifiers that didn’t match
any known products
Had to fix by analyzing each
log of bad-ID events.
Because developers used
config files they copied from
the PMS and put in the code
Break here and
you have nothing
Copyright Third Nature, Inc.
MDM again
If you want to link datasets then you must manage the keys
You need canonical forms for common data (in code too)
user ID
IP adr
2010.01.10.01:41:4468 67.189.110.179 Ser40489a07-7f6e4251801a
13ae51a6d092 301212631165031 590387153892659 DNIS5555685981-
UTF8&1T4ACGW_enUS386US387&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
m100109_44IOJ1-Id=105543CD1A7B8322
UID CID Email City State Country
590387153892659 10098213 barry.dylan@odin.com Paris Île-de-FranceFrance
2010.01.10.14:26:2468 67.189.110.179 10098213 5046876319474403 MOZILLA/4.0
(COMPATIBLE; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) https://www.phisherking.com/
gifts/store/LogonForm?mmc=link-src-email_m100109 http://www.google.com/search?
sourceid=navclient&aq=0h&oq=Italian&ie=UTF8&pid=1T4ACGW_13ae51a6d092&q=ita
lian+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
game ID
date
IP adr
customer ID
Event
Click
Cust-
user
Copyright Third Nature, Inc.
Data hoarding is not a data management strategy
Copyright Third Nature, Inc.
The missing ingredient from most big data
Specifically,
metadata kept
separate from
the data.
Copyright Third Nature, Inc.
The solution to our problems isn’t
technology, it’s architecture.
Copyright Third Nature, Inc.
Architecture is an abstraction – it’s a purpose
Copyright Third Nature, Inc.
Blueprints are not architectures
Copyright Third Nature, Inc.
History: This is how BI was done through the 80s
First there were files and reporting programs.
Application files feed through a data processing pipeline to
generate an output file. The file is used by a report formatter
for print/screen. Files are largely single-purpose use.
Every report is a program written by a developer.
Data pipeline
code
Copyright Third Nature, Inc.
History: This is how BI ended the 80s
The inevitable situation was...
Data pipeline
code
Copyright Third Nature, Inc.
History: This is how we started the 90s
Collect data in a database. Queries replaced a LOT of
application code because much was just joins. We
learned about “dead code”
SQL
SQL
SQL
SQL
SQL
Copyright Third Nature, Inc.
Pragmatism and Data
Lessons learned during
the ad-hoc SQL era of the
DW market:
When the technology is
awkward for the users, the
users will stop trying to use
it.
Even “simple” schemas
weren’t enough for anyone
other than analysts and
their Brio…
Led to the evolution of
metadata-driven SQL-
generating BI tools, ETL
tools.
Copyright Third Nature, Inc.
BI evolved to hiding query generation for end users
With more regular schema models, in particular
dimensional models that didn’t contain cyclic join paths, it
was possible to automate SQL generation via semantic
mapping layers.
We developed data pipeline building tools (ETL).
Query via business terms made BI usable by non-technical
people.
ETL
SQL
Life got much easier…for a while
Copyright Third Nature, Inc.
Today’s model: Lake + data engineers, looks familiar…
The Lake with data pipelines to files or Hive tables
is exactly the same pattern as the COBOL batch..
Dataflow
code
We already know that people don’t scale…
Copyright Third Nature, Inc.
DATA ARCHITECTURE
We’re so focused on the light switch that we’re not
talking about the light
Copyright Third Nature, Inc.
The architecture from 1988 we SHOULD HAVE BEEN USING
The general concept of a
separate architecture for BI
has been around longer, but
this paper by Devlin and
Murphy is the first formal
data warehouse architecture
and definition published.
101
“An architecture for a business and
information system”, B. A. Devlin,
P. T. Murphy, IBM Systems Journal,
Vol.27, No. 1, (1988)
Slide 101Copyright Third Nature, Inc.
But 30 years ago we did not expect so many different
models of deployment, execution and use. Needs change
DeployETL
Data Data Storage
Alerts / Reports/
Decisioning
Deploy
f
Data Streams Intelligent Filter
/ Transform Model Execution
Analytics for eyeballs and analytics for machines are different
Copyright Third Nature, Inc.
Decouple the Architecture
The core of a data warehouse isn’t the database, it’s the
data architecture that the database and tools implement.
We need a new data architecture that is not limiting:
▪ Deals with change more easily and at scale
▪ Does not enforce requirements and models up front
▪ Does not limit the format or structure of data
▪ Assumes the range of data latencies in and out, from streaming
to one-time bulk
▪ Allows both reading and writing of data
▪ Makes data linkable, and provide governance where required
▪ Does not give up the gains of the last 25 years
Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately, independently
Platform Services
Data Access
Deliver & Use
Data storage
This separates uses of data from each other, allowing each type of
use to structure the data specific to its own requirements.
Data access is already
somewhat separate
today. Make the
separation of different
access methods a formal
part of the architecture.
Don’t force one model.
Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately, independently
Platform Services
Data Management
Process & Integrate
Data storage
Data management should not be subject to the constraints of a single use
Data management
has historically been
blended with both
data acquisition and
structuring data for
client tools. It should
be an independent
function.
Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately, independently
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Platform Services
Data storage
Data arrives in many latencies, from real-time to one-time. Acquisition
can’t be limited by the management or consumption layers.
Data acquisition should not be
directly tied to the needs of
consumption. It must operate
independently of data use.
Copyright Third Nature, Inc.
The full analytic environment subsumes all the functions
of a data lake and a data warehouse, and extends them
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Platform Services
Data Management
Process & Integrate
Data Access
Deliver & Use
Data storage
The platform has to do more than serve queries; it has to be read-write.
Copyright Third Nature, Inc.
Food supply chain: an analogy for analytic data
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
Copyright Third Nature, Inc.
The data architecture must align with system components
because each of them addresses different data needs
Incremental
Collect
Batch
One-time copy
Real time
Manage & Integrate
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
Separating concerns is part of the mechanism for change isolation
Copyright Third Nature, Inc.
The design focus is different in each area
Ingredients
Goal: available
User needs a recipe
in order to make
use of the data.
Pre-mixed
Goal: discoverable
and integrateable
User needs a menu
to choose from the
data available
Meals
Goal: usable
User needs utensils
but is given a
finished meal
Copyright Third Nature, Inc.
The data is in zones of management, not isolating layers
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific
data
Transient data
Relax control to enable self-service while avoiding a mess.
Do not constrain access to one zone or to a single tool.
Focus on visibility of data use, not control of data.
Copyright Third Nature, Inc.
This data architecture resolves rate of change problems
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
New data of
unknown value,
simple requests for
new data can land
here first, with little
work by IT.
More effort applied to
management, slower.
Optimized for
specific uses /
workloads. Generally
the slowest change.
Not fast vs slow:
fast vs right
Not flexibility vs
control: flexibility
vs repeatability
Agile for
structure change
vs agile for
questions / use
Copyright Third Nature, Inc.
SQLServer
RDBMS
Example: data environment, mid-size retailer
Replication, batch ETL
to collect data
Web Application
Standard retail /
wholesale data
warehouse model.
Nightly batch model
from stage, contains
~ 1 TB of data
DW
New requirement:
Load all the online marketing and web activity for
use in analytic models (not simple web analytics)
Will add ~10 TB of data, continuous or hourly loads
Copyright Third Nature, Inc.
SQLServer
RDBMS
Example: data environment, mid-size retailer
Replication, batch ETL
to collect data
Hadoop
Web Application
Persistent store,
data warehouse on
same DB in
different schemas
DW
Log file fetch & load for clickstream,
summaries sent to reporting env
Discovery work usually
done in Hadoop,
sometimes in database.
Copyright Third Nature, Inc.
SQLServer
This data architecture uses the 3 zone pattern
Hadoop
Web Application
RDBMS
DW
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
All reporting data
and analytics
output is sent to
the DW where it is
can be accessed.
Cleaned,
summarized or
derived data is
stored in DB
Raw data is stored
in two places:
relational and small
sets in DB, log
collection in Hadoop
Copyright Third Nature, Inc.
The concept of a zone is not a physical system. It’s data architecture
Apps
DW
The biggest decision is to separate all data collection from
the data integration from consumption.
Physical system/technology overlays are separate, depend
on the specific use cases and needs of the organization.
Dist FS
RDBMSs
Decompress &
process device logs
RDBMS
RDBMS
multiple
Discovery
platform
Copyright Third Nature, Inc.
Data has to be moved, standardized, tracked
There is a lot of data policy and governance to think about
Collect Manage & Integrate
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
CRM
User reg
Orders
Leads
Cust
Cust
Cust
Cust
Master DW
Mktg
Campaign
analysis
Rec
engine
Copyright Third Nature, Inc.
What about the technology? Do I need an <X>?
Copyright Third Nature, Inc.
TANSTAAFL
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
Technologies are not
perfect replacements for
one another. Often not
better, only different.
You need to see the full context in order to build a platform.
The analytics environment is more than just silos of activity
transactions
data
warehouse
applications
analytic
models
events
BI
discovery &
analysisdiscovery
platform
collection access discovery delivery
modeling
Data scientists
& Engineers
Stakeholders
Users,
Customers
Machines
There are multiple audiences involved.
It’s rare for anyone to see the full picture.
Data scientists
& Engineers
Stakeholders
Users,
Customers
Machines
Environment and workflows: Modeling / Analysis Loop
transactions
data
warehouse
applications
analytic
models
events
BI
discovery &
analysisdiscovery
platform
collection access discovery delivery
modeling
This is the view of the people who
build (and deploy) analytic models.
Data scientists
& Engineers
Stakeholders
Users,
Customers
Machines
Environment and workflows: Execution Loop
transactions
data
warehouse
applications
analytic
models
events
BI
discovery &
analysisdiscovery
platform
collection access discovery delivery
modeling
This is the view of the people who use
the results or interact with the models.
Data scientists
& Engineers
Stakeholders
Users,
Customers
Machines
Environment and workflows: Monitoring and Managing
transactions
data
warehouse
applications
analytic
models
events
BI
discovery &
analysisdiscovery
platform
collection access discovery delivery
modeling
This is the view of the people who manage the
business goals (and the models – as analytics
is operationalized, it becomes the same thing).
125
Blended Architectures Are a Requirement,
Not an Option
Data Warehouse + Data Lake
On Premise + Cloud
RDBMS + S3 + HDFS
Commercial + Open Source
You can’t just buy one thing platform from one vendor. We aren’t building a
death star.
Each of the zones is likely to have products specific to that zone’s usage. The
uses differ, the people using them differ, shouldn’t the tools should differ too?
© Third Nature Inc.
Manage your data
(or it will manage you)
Data management is where
developers are weakest.
Modern engineering
practices are where data
management is weakest.
You need to bridge these
groups and practices in the
organization if you want to
do meaningful work with
data. Remember Conway’s
Law when you build.
1.It’s not about storing data, it’s about using it
2.Use drives architecture. Understand the uses,
what you are designing for, to drive decisions.
3.Put data at the center, not technology. Don’t let
the tech define what you can do or how you do it.
4.The death star is not the answer. The data model
is not a flat earth. You are not building a monolith.
5.Know your history. Avoiding wheel reinvention
saves time, money, careers.
Conclusion
© Third Nature Inc.
“Now is not the end.
It is not even the
beginning of the end.
But it is, perhaps,
the end of the beginning.”
Winston Churchill
Mark Madsen is the global head of
architecture at Think Big Analytics,
Prior to that he was president of Third
Nature, a research and consulting firm
focused on analytics, data integration
and data management. Mark is an
award-winning author, architect and
CTO whose work has been featured in
numerous industry publications. Over
the past ten years Mark received
awards for his work from the American
Productivity & Quality Center, TDWI,
and the Smithsonian Institute. He is an
international speaker, chairs several
conferences, and is on the O’Reilly
Strata program committee. For more
information or to contact Mark, follow
@markmadsen on Twitter.
About the Presenter
Todd Walter
Chief Technologist - Teradata
• Chief Technologist for Teradata
• A pragmatic visionary, Walter helps business leaders,
analysts and technologists better understand all of the
astonishing possibilities of big data and analytics
• Works with organizations of all sizes and levels of
experience at the leading edge of adopting big data, data
warehouse and analytics technologies
• With Teradata for more than 30 years and served for more
than 10 years as CTO of Teradata Labs, contributing
significantly to Teradata’s unique design features and
functionality
• Holds more than a dozen Teradata patents and is a
Teradata Fellow in recognition of his long record of
technical innovation and contribution to the company

Weitere ähnliche Inhalte

Was ist angesagt?

Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
 
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudStrata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudJaipaul Agonus
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science TeamsGanes Kesari
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science TeamsEMC
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleVasu S
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataMicrosoft
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science teamAshish Bansal
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data scienceVipul Kalamkar
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approachjoshwills
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
Applications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus RealityApplications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus RealityGanes Kesari
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationYael Garten
 

Was ist angesagt? (20)

Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudStrata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big data
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science team
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Keynote Dubai
Keynote DubaiKeynote Dubai
Keynote Dubai
 
Applications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus RealityApplications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus Reality
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organization
 

Ähnlich wie Architecting a Data Platform For Enterprise Use (Strata NY 2018)

Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...Dario Mangano
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015Fiona Lew
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationChasity Gibson
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieSunil Ranka
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfvenkatakeerthi3
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
 
Top reasons why big data projects are still a failure
Top reasons why big data projects are still a failureTop reasons why big data projects are still a failure
Top reasons why big data projects are still a failureArun Kapoor
 
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...Dana Gardner
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist prateek kumar
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data ScienceNyraSehgal
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Jennifer Walker
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with MicrosoftCaserta
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationIntellipaat
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 

Ähnlich wie Architecting a Data Platform For Enterprise Use (Strata NY 2018) (20)

Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the Organization
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A Lie
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Top reasons why big data projects are still a failure
Top reasons why big data projects are still a failureTop reasons why big data projects are still a failure
Top reasons why big data projects are still a failure
 
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
 
Who is a data scientist
Who is a data scientist  Who is a data scientist
Who is a data scientist
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data Science
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
The REAL face of Big Data
The REAL face of Big DataThe REAL face of Big Data
The REAL face of Big Data
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview Preparation
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 

Mehr von mark madsen

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Rangemark madsen
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customersmark madsen
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionmark madsen
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsmark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)mark madsen
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...mark madsen
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good storymark madsen
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogiesmark madsen
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followersmark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousingmark madsen
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Datamark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousingmark madsen
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the datamark madsen
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolutionmark madsen
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Datamark madsen
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 

Mehr von mark madsen (20)

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good story
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Data
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolution
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Data
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 

Kürzlich hochgeladen

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 

Kürzlich hochgeladen (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 

Architecting a Data Platform For Enterprise Use (Strata NY 2018)

  • 1. Copyright Third Nature, Inc. Architecting a Data Platform For Enterprise Use September 2018 Mark Madsen Todd Walter
  • 2. Copyright Third Nature, Inc.Copyright Third Nature, Inc. What do we hear? "I want to do analytics, beyond what I can do with BI tools" "I need an analytics strategy" "I need an analytics roadmap" "I want to modernize my DW" aka "I want to speed up my DW" "I have a specific analytics project of type..." “We have a data lake. What should we do with it?” “Technology Y is our replacement for technology X” “Don’t need schema, curation, ETL, governance … anymore” “How do I ensure the data is good enough for analytics?” Good judgement is the result of experience. Experience is the result of bad judgement. —Fred Brooks
  • 3. 4 © 2017 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner statement in 2018: only 15% are reported to be successful only 17%of Hadoop deployments are in production in 2017 Survey Analysis: BI and Analytics Spending Intentions, 2017 A McKinsey survey this year asked executives if their company had achieved a positive ROI with their big data projects: 7% answered “yes” Gartner Finding: in 2017
  • 4. Copyright Third Nature, Inc. The business complaints about data and analytics You really should fire IT.
  • 5. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Data deficient Takes too long Costs too much Function deficient IT root causes IT proximate causes What is said in disputes Lack of agility People & vendor cost basis Client-server infrastructure Lack of “good enough” competency 1980s-era methods Inappropriate technology Data hygiene fetishes Vendor lock 1990s-era procurement IT skills deficit Dysfunctional OLTP portfolio
  • 6. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Data deficient Takes too long Costs too much Function deficient IT root causes IT proximate causes What is said in disputes How business responds How IT responsds Lack of agility People & vendor cost basis Client-server infrastructure Lack of “good enough” competency SaaS / Cloud BI Consultants Self-service BI Shadow BI Workarounds & spreadmarts Hidden costs Loss of control Loss of visibility Loss of knowledge Security risks Bad data risks 1980s-era methods Inappropriate technology Data hygiene fetishes Vendor lock 1990s-era procurement IT skills deficit Dysfunctional OLTP portfolio
  • 7. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Data deficient Takes too long Costs too much Function deficient IT root causes IT proximate causes What is said in disputes How business responds How IT responds Lack of agility People & vendor cost basis Client-server infrastructure Lack of “good enough” competency SaaS / Cloud BI Consultants Self-service BI Shadow BI Workarounds & spreadmarts Hidden costs Loss of control Loss of visibility Loss of knowledge Security risks Bad data risks 1980s-era methods Inappropriate technology Data hygiene fetishes Vendor lock 1990s-era procurement IT skills deficit Dysfunctional OLTP portfolio This is not a technology problem – it is an architecture problem.
  • 8. Copyright Third Nature, Inc. DW: Centralize, that solves all problems! Creates bottlenecks Causes scale problems Availability?
  • 9. Copyright Third Nature, Inc. The data lake solution: no central authority wtf, it was fully operational!
  • 10. Copyright Third Nature, Inc. The data lake solution? There’s a problem: as the lake is envisioned, it is still a centralized data architecture, but this time there is no single global model. Instead it’s files and not modeled. It can be operational while under construction. It’s still a death star.
  • 11. Copyright Third Nature, Inc. Eventually we run into the same problems Seriously, wtf? It was agile and operational Rising complexity and scale break centralized models
  • 12. Copyright Third Nature, Inc. “The problem is the tools. Standardize on one tech!” It’s called stack think. Pick your vendor. Cede all architecture.
  • 13. Copyright Third Nature, Inc. The problem is methods and process: agile your way into the Analytics Shantytown
  • 14. Copyright Third Nature, Inc.Copyright Third Nature, Inc. “Infrastructure one PoC at a time” • Use case driven • Leverage (any) new technology • Re-use of the technology stack • Data is captive to the application or to the technology stack • Developers tend to think “function first” • Analytics people also think “function first”, but generally require integrated data
  • 15. Copyright Third Nature, Inc.Copyright Third Nature, Inc. A key aspect of operational analytics is “end-to-end” E2E does not allow you to take a random walk through technologies and BYOT because the complexity of integration leads to a mountain of technical debt. • Most data applications are organized around a process, not a task or function. • Real-time analytics systems pass their data over the network, but the dependencies can cross application boundaries, as well as requiring persistence. • Developers are being forced to think of data outside the local context.
  • 16. Copyright Third Nature, Inc. “Start with the platform. The rest will follow.”
  • 17. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Big data reality: What users seeBig data promise: What you see
  • 18. Copyright Third Nature, Inc. Persisting data is not the end of the line. If you stop here you win the battle and lose the war
  • 19. Copyright Third Nature, Inc. “Begin with the end in mind” The starting point can’t be with technology. That’s like starting with bricks when designing a house. You may get lucky but… The goals and specific uses are the place to start ▪ Use dictates need ▪ Need dictates capabilities ▪ Capabilities are solved with technology This is how you avoid spending $2M on a Hadoop and spark cluster in order to serve data to analysts whose primary requirements are met with laptops.
  • 20. Copyright Third Nature, Inc. We don’t have an analytics problem, just like we didn’t have a BI problem The origin of analytics as “business intelligence” was stated well in 1958: …the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal. ~ H. P. Luhn “A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf ” “ Our goal is analytics as a capability, not a technology
  • 21. © Third Nature Inc. B I
  • 22. © Third Nature Inc. The old problem was access, the new problem is analysis
  • 23. Copyright Third Nature, Inc. Three constituencies Stakeholder Analyst Builder aka the recipient aka the data scientist aka the engineer
  • 24. Copyright Third Nature, Inc. Starting points for analytics strategy Many organizations choose to start with the analysts. Create a data science team. Turn them loose to find a problem. Many more start with builders: technology solutions looking for problems, e.g. 65% of the IT driven Hadoop and Spark projects over the last five years. The right place to start? Stakeholders. The goal to achieve, the problem to solve.
  • 25. Copyright Third Nature, Inc. NATURE OF THE PROBLEM FROM THE STAKEHOLDER’S PERSPECTIVE Each constituency has their own set of problems to deal with
  • 26. The myth that still drives analytics – analytic gold All we need is a fat pipe and pans working in parallel…
  • 27. Analytic insights that result in no action are expensive trivia. It’s not the insight, but what you do with it, that matters
  • 28. Applying analytics is not an analytics problem Applying analytics is not in the analyst’s control. It’s not in the engineer’s control. It’s in the control of the people involved in the process. Failures are often in execution, not in analytics development. For example, we saw unexpectedly poor performance in a number of geographies. Was it the new analytics we tried? Was it a data problem? No, it was a simple compliance problem.
  • 29. BI and Analysis: Repeatability and Discoverability Discover Explore Analyze Model Consume Promote Focus is on repeatability Application cycle time 90% of users just want answers Focus is on discoverability Analyst cycle time 9% analysts, 1% data scientists 80% of data use is repeatable 80% of analysis is novel
  • 30. NATURE OF THE PROBLEM FROM THE ANALYST’S PERSPECTIVE
  • 31. The analytics process at a high level Diagram: Kate Matsudaira
  • 32. The nature of analytics problems is researching the unknown rather than accessing the known. Repeat for each new problem (unlike BI) Diagram: Kate Matsudaira
  • 33. The main hurdle: just getting the data Do you know where to find it? Because it’s may not be in the data warehouse. Do you have access to it? Is access fast enough? Because DWs are for QRD, not for moving huge piles of data. And ERP systems and SaaS apps are right out. These are why developers jump on “data lake” as the solution. But that’s a small, easy part.
  • 34. Do you have the right data? Many machine learning techniques require labeled (known good) training data: Supervised learning: a person has to define the correct output for some portion of the data. Data is divided into training sets used for model building and test sets for validating the results. • What is spam and what isn’t? • What does a fraudulent transaction look like 40
  • 35. Do you have enough of the right data? ML needs a lot, you may be disappointed in your efforts
  • 36. Where do analysts spend their time? mostly data work Slide 42 Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results % of time spent 70% 30% Source: Michael Berry, Data Miners Inc.
  • 37. Where do most of the analytics tools focus? Slide 43 Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results Source: Michael Berry, Data Miners Inc.
  • 38. Where do most of the analytics aaS focus? Slide 44 Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results Source: Michael Berry, Data Miners Inc.
  • 39. The analyst’s workspace in BI is relatively spare
  • 40. The analyst’s workspace needs to be more like a kitchen than like BI vending machines
  • 41. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Do you treat this as a production workload? Storage control, space limitations, resource limitations, production lockdown Analytics work is a production workload, but… Hi, we’re from IT, and we’re here to help!
  • 42. NATURE OF THE PROBLEM FROM THE BUILDER’S PERSPECTIVE
  • 43. IT and Ops people want to know “what to build?” Giant data platform? Self service tools?
  • 44. Analytics has different processes and workloads None of this analytics work is the same as what IT considered “analysis” to be, which is usually equated with BI or ad-hoc query. Ad-hoc analysis = Exploratory data analysis = Batch analytics = Real-time analytics A real analytics production workflow Hatch, CIKM ‘11 Slide 50
  • 45. Things engineering and operations worry about Engineering time and effort ▪ Introduction of new technology, complexity ▪ Integration - Deployment of models requirements linking different types of environments, creating supportable workflows for the analysts ▪ Ability to develop and deploy at the required speed Supportability ▪ Automation ▪ The environment requires additional monitoring, other technology and processes, particularly for customer-facing work ▪ Support costs (time and money) SLAs: ▪ Availability – if analytics are tied to production operations, particularly customer facing, this becomes important and difficult because it’s not standard application work ▪ Performance and scalability – have to manage unpredictable workloads, resource conflicts between model development with model execution
  • 46. The world changes, do the models? In BI you maintain ETL and schemas, in ML you maintain models and maybe pipelines. “Model decay” happens as the assumptions around which a model is built change, e.g. spam techniques change. When you adjust the model you need to know it is normal again ▪ Better save the data used to build the model ▪ Better save the model ▪ Baseline and measurements
  • 47. THREE PERSPECTIVES, ONE SOLUTION? There are requirements from all constituents. You need to put them together to have a complete picture of what’s needed.
  • 48. The missing stakeholder There is another stakeholder: analytics management - the CAO, CDO, VP of analytics, aka “your boss” if you’re a data scientist. This is the perspective and problems of the person responsible for oversight of the team and efforts is across the organization and across multiple projects
  • 49. Job #1 - Repeatability
  • 50. Job # 2 - Operational predictability
  • 51. Job #3 - Reproducibility
  • 52. They need a system of record for analytics
  • 53. There is an extensive list of requirements to support Primary requirements needed by constituents S D E Data catalog and ability to search it for datasets X X Self-service access to curated data X Self-service access to uncurated (unknown, new) data X X Temporary storage for working with data X Data integration, cleaning, transformation, preparation tools and environment X X Persistent storage for source data used by production models X X Persistent storage for training, testing, production data used by models X X Storage and management of models X X Deployment, monitoring, decommissioning models X Lineage, traceability of changes made for data used by models X X Lineage, traceability for model changes X X X Managing baseline data / metrics for comparing model performance X X X Managing ongoing data / metrics for tracking ongoing model performance X X X S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
  • 54. © Third Nature Inc. How did we get to this state with BI & analytics? There’s a difference between having no past and actively rejecting it.
  • 55. 61 100 1,000 10,000 1 2 3 4 5 6 7 8 9 10 Customer Data Plateaus Add users, accumulate history, add attributes, add dimensions Entirely new data set enabled by new technology at new price point – value has to exceed cost (ROI). Earliest adopters have vision ahead of ROI
  • 56. 62 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 User Adoption of New Data Customer Data Users - Data Set 1 Users - Data Set 2 Users of Integrated Data User adoption of new data sets starts over. Very small number of experts growing to wider audience, sophisticated users moving to business analysts, then business users, B2B customers and even to consumers Greater value is derived when data sets are linked – see bigger picture (eg who buys what, when). Comes after initial extraction of easy value from standalone data
  • 57. 63 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Data Size Plateaus over Time Lather, Rinse, Repeat. Trend line is roughly Moore’s Law. Delay or skip a generation if new data set is two orders of magnitude instead of one.
  • 58. 64 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 10,000,000,000,000 100,000,000,000,000 1,000,000,000,000,000 10,000,000,000,000,000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Customer Data Size vs. 2x per 12mo Customer Data Moore's Law - 2x/18mo 2x/12 mo But all studies show demand/creation of data increasing significantly faster than Moore’s law If we started only 1 OOM behind, we are now 5 OOM away from user demand for data Data Exhaust
  • 59. 65 • <Store, Item, Week> • <Store, Item, Day> – Simple aggregations • Market Basket – Affinity – Link to person, demographics, HR • Inventory by SKU by store – Temporal, time series, forecasting – Link to product, marketing, market basket Retail Plateaus 2B records total for 9 quarters 2B records per day, keep 9 quarters
  • 60. 66 • Web Logs and traffic – Behavioral patterns – eg path linked to person, offers, other channels – Operations of the web site • Supply chain sensors – sampled at major event – Activity Based Costing – Link to customer, product, HR, planning • Social Media – Text analysis, Filtering, languages – Link to customer, sales, other channel interactions • Supply chain sensors – sampled at minutes or seconds – Telematics – Real time, Event detection, trending, static and dynamic rules – Link to HR, thresholds, forecasts, routing, planning Retail Plateaus 30B records per day
  • 61. 67 FINANCE Revenue Expenses Customers CUSTOMER CARE Customer Products Orders Case History SALES Orders Customers Products MARKETING Customers Orders Campaign History OPERATIONS Inventory Returns Manufacturing Supply Chain How many batteries are in inventory by plant? What is the trend of warranty costs? How many people made a warranty claim last week? How many sales have been made quarter to date? Which customers should get a communica tion on extended warranties? 54 32 29 49 66
  • 63. 69 Given the rise in warranty costs, isolate the problem to be a plant, then to a battery lot. Communicate with affected customers, who have not made a warranty claim on batteries, through Marketing and Customer Service channels to recall cars with affected batteries. 2855 Inventory Returns Manufacturing Supply Chain Customer Service Orders Revenue Expenses Case History Customers Products Pipeline Customers Campaign History FINANCE SALESMARKETING OPERATIONS CUSTOMER EXPERIENCE 2855
  • 64. 70 Manufacturing: Data Overlap Analysis New Business Improvement Opportunities through Data Leverage Sales Force Profitability Analysis 100% 80% 66% 24% 41% 66% 0% 24% 0% 24% Transportation Planning 11% 100% 87% 28% 64% 56% 22% 34% 13% 45% Production Planning 6% 57% 100% 19% 83% 35% 17% 28% 9% 40% Vendor Managed Inventory 12% 100% 100% 100% 78% 100% 61% 100% 39% 100% Global Pricing Rationalization 4% 50% 100% 17% 100% 27% 15% 29% 11% 41% Fulfillment (Perfect Order) 16% 94% 90% 48% 58% 100% 35% 53% 19% 56% Manufacturing Quality Optimization 0% 76% 88% 60% 66% 71% 100% 79% 43% 73% Preventative Maintenance Analysis 7% 75% 93% 61% 80% 68% 50% 100% 27% 82% Warranty Claims Analysis 0% 94% 100% 83% 100% 83% 94% 93% 100% 98% Quality Life Cycle Improvement 4% 57% 75% 35% 64% 41% 26% 47% 16% 100% SalesForce ProfitabilityAnalysis ProductionPlanning VendorManaged Inventory GlobalPricing Rationalization TransportationPlanning Fulfillment (PerfectOrder) ManufacturingQuality Optimization Preventative Maintenance Analysis WarrantyClaims Analysis QualityLifeCycle Improvement If Then
  • 65.
  • 66. 72 PRODUCT SENSOR SOCIAL MEDIA CUSTOMER CARE AUDIO RECORDINGS DIGITAL ADVERTISING CLICKSTREAM 65 41 32 19 28 How many visitors did we have to our hybrid cars microsite yesterday? What are the temperature readings for batteries by Manufactur er? What is the sentiment towards line of hybrid vehicles? Which customers likely expressed anger with customer care? Which ad creative generated the most clicks?
  • 67. 73 2855 High, Well Known Quality Directional Quality Unknown/Low quality Well Defined Data Model Curated JSON, XML, DB Extracted Attributes Curation Required
  • 68. 74 2855 SENSOR DIGITAL ADVERTISING CLICKSTREAM INTERACTIONS RATINGS & REVIEWS CUSTOMER PORTAL INTERACTIONS EXTERNAL INTERACTIONS SOCIAL MEDIA IVR Routing RFID ELECTRONIC COMMERCE FINANCE SALESMARKETING Inventory Returns Manufacturing Supply Chain Customer Service Orders Revenue Expenses Case History Customers Products Pipeline Customers Campaign History OPERATIONS CUSTOMER CARE – AUDIO RECORDINGS Maps Telemetry SERVER LOGS CUSTOMER EXPERIENCE Enterprise Data
  • 70. 76 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes Social Media Influencers FINANCE 11234 Minimum Viable Curation Minimum Viable Data Quality
  • 71. 77 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes Social Media Influencers FINANCE 11234 Sensor data formatting Unit Normalization Vehicle/version normalization Make/model/year selection Data Comprehension, Pipelines
  • 72. 78 2855 Enterprise Wide Use 10s to 100s of Users 10s Users Analysts / engineers Business Analysts Action list, Report, Dashboard Users Share across Enterprise Share in department Share over cube wall User Base and Sharing
  • 74. 80 Evolving Consumption Requirements 2855 Production tools, curated data, integrated across business areas Wide variety of tools and data forms Targeted data access, many applications, response time SLAs High CPU, moderate to low IO Bulk scans, large computation, transformation, data access, specific integration Moderate CPU, High IO, Resource management
  • 76. 82 MARKETING FINANCE SALES OPERATIONS CUSTOMER EXPERIENCE Given the rise in warranty costs, isolate the problem to be a plant and the specific lot. Exclude 2/3rd of the batteries from the lot that are fine. Communicate with affected customers, who have not made a warranty claim, through Marketing and Customer Service channels to recall cars with affected batteries. 2855 MANUFACTURING CAMPAIGN HISTORY COSTS PRODUCTS CUSTOMERS CASE HISTORY SENSOR Access Wide Variety of Data to Answer a Question
  • 77. Copyright Third Nature, Inc. An event contains mainly IDs…that reference other data 2010.01.10.01:41:4468 67.189.110.179 Ser40489a07-7f6e4251801a 13ae51a6d092 301212631165031 590387153892659 DNIS5555685981- UTF8&1T4ACGW_enUS386US387&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s m100109_44IOJ1-Id=105543CD1A7B8322 timestamp user ID device ID game ID event details IP adr log ID Log de-referencing and enrichment is difficult since you can’t enforce integrity like you can in a DB. What’s the glue that holds it together? It’s just keys to other data. e.g. remember that device ID 0 problem?
  • 78. Copyright Third Nature, Inc.© Third Nature Inc. It’s not just the log data, it’s big data + small data Device Event logs Event logs Event logs A product has a complex set of master data and metadata and this needs to be added back to the bare events.
  • 79. Copyright Third Nature, Inc.© Third Nature Inc. Where does the reference data come from? The keys come from somewhere. You don’t just make up system-wide unique identifiers in your code. The lack of local lookup data at event generation leads to development practices that lead to inconsistencies. Event logs Event logs Event logs Problem we had: product identifiers that didn’t match any known products Had to fix by analyzing each log of bad-ID events. Because developers used config files they copied from the PMS and put in the code Break here and you have nothing
  • 80. Copyright Third Nature, Inc. MDM again If you want to link datasets then you must manage the keys You need canonical forms for common data (in code too) user ID IP adr 2010.01.10.01:41:4468 67.189.110.179 Ser40489a07-7f6e4251801a 13ae51a6d092 301212631165031 590387153892659 DNIS5555685981- UTF8&1T4ACGW_enUS386US387&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s m100109_44IOJ1-Id=105543CD1A7B8322 UID CID Email City State Country 590387153892659 10098213 barry.dylan@odin.com Paris Île-de-FranceFrance 2010.01.10.14:26:2468 67.189.110.179 10098213 5046876319474403 MOZILLA/4.0 (COMPATIBLE; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) https://www.phisherking.com/ gifts/store/LogonForm?mmc=link-src-email_m100109 http://www.google.com/search? sourceid=navclient&aq=0h&oq=Italian&ie=UTF8&pid=1T4ACGW_13ae51a6d092&q=ita lian+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s game ID date IP adr customer ID Event Click Cust- user
  • 81. Copyright Third Nature, Inc. Data hoarding is not a data management strategy
  • 82. Copyright Third Nature, Inc. The missing ingredient from most big data Specifically, metadata kept separate from the data.
  • 83. Copyright Third Nature, Inc. The solution to our problems isn’t technology, it’s architecture.
  • 84. Copyright Third Nature, Inc. Architecture is an abstraction – it’s a purpose
  • 85. Copyright Third Nature, Inc. Blueprints are not architectures
  • 86. Copyright Third Nature, Inc. History: This is how BI was done through the 80s First there were files and reporting programs. Application files feed through a data processing pipeline to generate an output file. The file is used by a report formatter for print/screen. Files are largely single-purpose use. Every report is a program written by a developer. Data pipeline code
  • 87. Copyright Third Nature, Inc. History: This is how BI ended the 80s The inevitable situation was... Data pipeline code
  • 88. Copyright Third Nature, Inc. History: This is how we started the 90s Collect data in a database. Queries replaced a LOT of application code because much was just joins. We learned about “dead code” SQL SQL SQL SQL SQL
  • 89. Copyright Third Nature, Inc. Pragmatism and Data Lessons learned during the ad-hoc SQL era of the DW market: When the technology is awkward for the users, the users will stop trying to use it. Even “simple” schemas weren’t enough for anyone other than analysts and their Brio… Led to the evolution of metadata-driven SQL- generating BI tools, ETL tools.
  • 90. Copyright Third Nature, Inc. BI evolved to hiding query generation for end users With more regular schema models, in particular dimensional models that didn’t contain cyclic join paths, it was possible to automate SQL generation via semantic mapping layers. We developed data pipeline building tools (ETL). Query via business terms made BI usable by non-technical people. ETL SQL Life got much easier…for a while
  • 91. Copyright Third Nature, Inc. Today’s model: Lake + data engineers, looks familiar… The Lake with data pipelines to files or Hive tables is exactly the same pattern as the COBOL batch.. Dataflow code We already know that people don’t scale…
  • 92. Copyright Third Nature, Inc. DATA ARCHITECTURE We’re so focused on the light switch that we’re not talking about the light
  • 93. Copyright Third Nature, Inc. The architecture from 1988 we SHOULD HAVE BEEN USING The general concept of a separate architecture for BI has been around longer, but this paper by Devlin and Murphy is the first formal data warehouse architecture and definition published. 101 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 101Copyright Third Nature, Inc.
  • 94. But 30 years ago we did not expect so many different models of deployment, execution and use. Needs change DeployETL Data Data Storage Alerts / Reports/ Decisioning Deploy f Data Streams Intelligent Filter / Transform Model Execution Analytics for eyeballs and analytics for machines are different
  • 95. Copyright Third Nature, Inc. Decouple the Architecture The core of a data warehouse isn’t the database, it’s the data architecture that the database and tools implement. We need a new data architecture that is not limiting: ▪ Deals with change more easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from streaming to one-time bulk ▪ Allows both reading and writing of data ▪ Makes data linkable, and provide governance where required ▪ Does not give up the gains of the last 25 years
  • 96. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Platform Services Data Access Deliver & Use Data storage This separates uses of data from each other, allowing each type of use to structure the data specific to its own requirements. Data access is already somewhat separate today. Make the separation of different access methods a formal part of the architecture. Don’t force one model.
  • 97. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Platform Services Data Management Process & Integrate Data storage Data management should not be subject to the constraints of a single use Data management has historically been blended with both data acquisition and structuring data for client tools. It should be an independent function.
  • 98. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data storage Data arrives in many latencies, from real-time to one-time. Acquisition can’t be limited by the management or consumption layers. Data acquisition should not be directly tied to the needs of consumption. It must operate independently of data use.
  • 99. Copyright Third Nature, Inc. The full analytic environment subsumes all the functions of a data lake and a data warehouse, and extends them Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage The platform has to do more than serve queries; it has to be read-write.
  • 100. Copyright Third Nature, Inc. Food supply chain: an analogy for analytic data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  • 101. Copyright Third Nature, Inc. The data architecture must align with system components because each of them addresses different data needs Incremental Collect Batch One-time copy Real time Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use Separating concerns is part of the mechanism for change isolation
  • 102. Copyright Third Nature, Inc. The design focus is different in each area Ingredients Goal: available User needs a recipe in order to make use of the data. Pre-mixed Goal: discoverable and integrateable User needs a menu to choose from the data available Meals Goal: usable User needs utensils but is given a finished meal
  • 103. Copyright Third Nature, Inc. The data is in zones of management, not isolating layers Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data Relax control to enable self-service while avoiding a mess. Do not constrain access to one zone or to a single tool. Focus on visibility of data use, not control of data.
  • 104. Copyright Third Nature, Inc. This data architecture resolves rate of change problems Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data New data of unknown value, simple requests for new data can land here first, with little work by IT. More effort applied to management, slower. Optimized for specific uses / workloads. Generally the slowest change. Not fast vs slow: fast vs right Not flexibility vs control: flexibility vs repeatability Agile for structure change vs agile for questions / use
  • 105. Copyright Third Nature, Inc. SQLServer RDBMS Example: data environment, mid-size retailer Replication, batch ETL to collect data Web Application Standard retail / wholesale data warehouse model. Nightly batch model from stage, contains ~ 1 TB of data DW New requirement: Load all the online marketing and web activity for use in analytic models (not simple web analytics) Will add ~10 TB of data, continuous or hourly loads
  • 106. Copyright Third Nature, Inc. SQLServer RDBMS Example: data environment, mid-size retailer Replication, batch ETL to collect data Hadoop Web Application Persistent store, data warehouse on same DB in different schemas DW Log file fetch & load for clickstream, summaries sent to reporting env Discovery work usually done in Hadoop, sometimes in database.
  • 107. Copyright Third Nature, Inc. SQLServer This data architecture uses the 3 zone pattern Hadoop Web Application RDBMS DW Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data All reporting data and analytics output is sent to the DW where it is can be accessed. Cleaned, summarized or derived data is stored in DB Raw data is stored in two places: relational and small sets in DB, log collection in Hadoop
  • 108. Copyright Third Nature, Inc. The concept of a zone is not a physical system. It’s data architecture Apps DW The biggest decision is to separate all data collection from the data integration from consumption. Physical system/technology overlays are separate, depend on the specific use cases and needs of the organization. Dist FS RDBMSs Decompress & process device logs RDBMS RDBMS multiple Discovery platform
  • 109. Copyright Third Nature, Inc. Data has to be moved, standardized, tracked There is a lot of data policy and governance to think about Collect Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use CRM User reg Orders Leads Cust Cust Cust Cust Master DW Mktg Campaign analysis Rec engine
  • 110. Copyright Third Nature, Inc. What about the technology? Do I need an <X>?
  • 111. Copyright Third Nature, Inc. TANSTAAFL When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time. Technologies are not perfect replacements for one another. Often not better, only different.
  • 112. You need to see the full context in order to build a platform. The analytics environment is more than just silos of activity transactions data warehouse applications analytic models events BI discovery & analysisdiscovery platform collection access discovery delivery modeling Data scientists & Engineers Stakeholders Users, Customers Machines There are multiple audiences involved. It’s rare for anyone to see the full picture.
  • 113. Data scientists & Engineers Stakeholders Users, Customers Machines Environment and workflows: Modeling / Analysis Loop transactions data warehouse applications analytic models events BI discovery & analysisdiscovery platform collection access discovery delivery modeling This is the view of the people who build (and deploy) analytic models.
  • 114. Data scientists & Engineers Stakeholders Users, Customers Machines Environment and workflows: Execution Loop transactions data warehouse applications analytic models events BI discovery & analysisdiscovery platform collection access discovery delivery modeling This is the view of the people who use the results or interact with the models.
  • 115. Data scientists & Engineers Stakeholders Users, Customers Machines Environment and workflows: Monitoring and Managing transactions data warehouse applications analytic models events BI discovery & analysisdiscovery platform collection access discovery delivery modeling This is the view of the people who manage the business goals (and the models – as analytics is operationalized, it becomes the same thing).
  • 116. 125 Blended Architectures Are a Requirement, Not an Option Data Warehouse + Data Lake On Premise + Cloud RDBMS + S3 + HDFS Commercial + Open Source You can’t just buy one thing platform from one vendor. We aren’t building a death star. Each of the zones is likely to have products specific to that zone’s usage. The uses differ, the people using them differ, shouldn’t the tools should differ too?
  • 117. © Third Nature Inc. Manage your data (or it will manage you) Data management is where developers are weakest. Modern engineering practices are where data management is weakest. You need to bridge these groups and practices in the organization if you want to do meaningful work with data. Remember Conway’s Law when you build.
  • 118. 1.It’s not about storing data, it’s about using it 2.Use drives architecture. Understand the uses, what you are designing for, to drive decisions. 3.Put data at the center, not technology. Don’t let the tech define what you can do or how you do it. 4.The death star is not the answer. The data model is not a flat earth. You are not building a monolith. 5.Know your history. Avoiding wheel reinvention saves time, money, careers. Conclusion
  • 119. © Third Nature Inc. “Now is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” Winston Churchill
  • 120. Mark Madsen is the global head of architecture at Think Big Analytics, Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, chairs several conferences, and is on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter. About the Presenter
  • 121. Todd Walter Chief Technologist - Teradata • Chief Technologist for Teradata • A pragmatic visionary, Walter helps business leaders, analysts and technologists better understand all of the astonishing possibilities of big data and analytics • Works with organizations of all sizes and levels of experience at the leading edge of adopting big data, data warehouse and analytics technologies • With Teradata for more than 30 years and served for more than 10 years as CTO of Teradata Labs, contributing significantly to Teradata’s unique design features and functionality • Holds more than a dozen Teradata patents and is a Teradata Fellow in recognition of his long record of technical innovation and contribution to the company