Garbage in, Garbage out. This truism is not new only for Big Data but it has become significantly more impactful as the move has begun away from traditional schema based approaches to more flexible and dynamic file system approaches. Many companies are finding that early stage successes rapidly become mid-term failures as data poured into Hadoop can no longer be managed and marshaled effectively, creating a cycle of reduced effectiveness and increased cost. This session covers how Big Data changes the way that information needs to be managed, how traditional approaches like MDM need to change and how new technologies and methods such as machine learning and business meta-data become crucial to the long term viability of a heterogeneous Big Data infrastructure.
Presented at Informatica World 2016 by Steve Jones, Global VP Big Data, Capgemini Insights & Data
12. #INFA16
Marketing
Business Meta-Data – recognizing how people work
Data Lake
Debt
RecoveryHigh Value
Customers
Explore the Lake
Business Meta-
DataPurpose: Debt
Recovery
Verify Purpose
Exclude
Purpose: Mktg Exclude
13. #INFA16
So what do we need?
Govern where it
matters
! Focus on Meta-data, MDM and RDM
! Enforce only when sharing
! Treat Corporate as aggregation of Local.
Encourage local
requirements
! Let the business decide what they need
! Build from the bottom
! Enable traceability to source
! Disposable data views.
Distill on demand
! Select only what you want
! Business friendly tooling
! Re-usable information maps
! Rapid change cycle.
Store everything
! Store everything ‘as is’
! Include structured and unstructured data
! Store it cheaply.
14. #INFA16
Why Data Lakes work
Archives Operational systems
Case
Mgmt
Operation
s
Schedulin
g
Data Explosion
Geographic
Data
One Place for Everything
Supply Chain
Optimization
Fraud
Detection
Next Best
Action
Marketing
RMADLibSAS
18. #INFA16
Why the X-Ref matters
Local
view
Corporate
standards
Master data and
reference data
Corporate
view
Customer
x-ref
Customer
MDM
Invoices Orders
Customer
Invoices Orders
BU1
Information
governance
BU2 BU3
Customer
Invoices
Orders
Customer
Invoices
Orders
Customer
Invoices
Orders
19. #INFA16
The Quality mess
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
ETL ETL ETL
20. #INFA16
The current situation – add a new System
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
ETL ETL ETL
Reports
ETL
21. #INFA16
The current situation – Where does Data Quality go?
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
ETL ETL ETL
Reports
ETL
Fix at Source
Patch Patch Patch Patch Patch
DQ DQ DQ DQ
22. #INFA16
Lets try another way – Fix at Source is still the ideal but
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
Data Lake
Data Lake + Quality
Data Quality & MDM
23. #INFA16
Then lets go further
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
Data Lake
Data Lake + Quality
Data Quality & MDM
Data
Science
24. #INFA16
Why the Business Data Lake succeeds
Division1
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
Now agree
where it
counts
Division2
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Division3
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Division4
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Corporate
Corporate
KPIs
DivsionalKPIs
25. #INFA16
Ingestion
tier
Insights
tier
Unified operations tier
System monitoring
System
management
Unified data management tier
Data mgmt.
services
MDM
RDM
Audit and
policy
mgmt.
Processing tier
Workflow management
Distillation tier
HDFS storage
Unstructured and structured data
In-memory
MPP database
Real
time
Micro
batch
Mega
batch
SQL
NoSQL
SparkSQL
Query
interface
s
SQL
Help me ingestion – you’re my only hope
Sources Action tier
Real-time
ingestion
Micro
batch
ingestion
Batch
ingestion
Real-time
insights
Interactive
insights
Batch
insights
Ingestion
The first and
LAST place you
can have control
Distillation
Think ‘Excel’ not EDW
26. #INFA16
If your business isn’t trivial – you’ll end up with multiple
lakes
Client Data Lake Partner Data Lake
Geo specific Geo specificGeo specific
Corporate Data Lake
Analytics
Results
Distillatio
n
Summary
AnalyticsAnalytics
Results
Analytics
Results
27. #INFA16
Federation requires THREE things
The cross-reference
• If you can’t link data sets its impossible
Rationalization
• The lesson of Google & Amazon is less choice, not
more
Business Meta-Data
• Technical Meta-Data is no-longer enough
28. #INFA16
Building a Lake and the challenge of federation
Client specific Client specific
Geo specific Geo specificGeo specific
Corporate Data LakeTo do this you need to
standardize
technology
29. #INFA16
Business Meta-Data and the x-ref – the two things that
must be shared
Client specific Client specific
Geo specific Geo specificGeo specific
Corporate Data Lake
Business Meta-
Data & x-ref