Weitere ähnliche Inhalte Ähnlich wie Filling the Data Lake (20) Mehr von DataWorks Summit/Hadoop Summit (20) Kürzlich hochgeladen (20) Filling the Data Lake1. Filling the
Data Lake
June 29, 2016
Chuck Yarbrough
Sr Director, Solutions Marketing and Management
@cyarbrough
Mark Burnette
Enterprise Sales Engineer @MarkCBurnette
2. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75552
Emerging Big Data Use Cases
Improve operational
effectiveness
Machines/sensors:
predict failures, network
attacks
Financial risk management:
reduce fraud, increase
security
Reduce data warehouse cost
Improve customer
experience
Build a 360° view to fully
understand and serve the
customer
Drive personalized and
adjusted interaction
Use automated
recommendations logic
Drive incremental
revenue
Predict customer
behavior across all channels
Understand and
monetize customer behavior
Begin to monetize data
as a service
3. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75553
Spectrum of Big Data Use Cases
Entry
Transform
Advanced
Optimize
Data
Warehouse
Optimization
Streamlined
Data
Refinery
Big Data
Exploration
Customer
360 Degree
View
Harnessing
Machine &
Sensor Data
Next
Generation
Applications
Internal Big
Data as a
Service
On-Demand
Big Data
Blending
Big Data
Predictive
Analytics
Use Case Complexity
BusinessImpact
Monetize My
Data
Data
Warehouse
Optimization
Data
Warehouse
Optimization
Streamlined
Data
Refinery
360 Degree
View
Big Data
Onboarding
Filling the
Data Lake
5. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75555
Administration Security
Lifecycle
Management
Data
Provenance
Dynamic Data
Pipeline Monitoring Automation
Data Pipeline
Data Engineering
Managing and Automating the Pipeline
Data Engineering AnalyticsData Preparation
Data
Lake
6. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75556
The Data Swamp
7. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75557
The Data Lake
8. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75558
Does Hadoop Have to be Hard?
Empower team
members to
integrate and
process Hadoop
Data
Establish a
modern data on
boarding process
that is flexible and
scalable
Deliver governed
analytic insights
for large
production use
bases
Things that can help ease the pain
9. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75559
Proper Care and Feeding of the Data Lake
11. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755511
More Data, More Problems
Even with good integration tools, major data onboarding
projects can be painful:
User Challenges
§ Repetitive manual design
§ Very time-consuming
§ Difficult to maintain
Business Challenges
§ Takes too long
§ Business deadlines at risk
§ Opportunity cost
12. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755512
How do we effectively scale data pipelines to accommodate
exploding data sources, volumes, and complexity?
More Data, More Problems
Have you ever had the pleasure of…
Migrating hundreds of sources between systems?
Enabling business users to onboard a variety of data themselves?
Ingesting hundreds of changing data sources into Hadoop?
13. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755513
More Data, More Problems
Modern data onboarding is more than
just “dumping data” – it includes:
Managing a changing array of data sources
Establishing repeatable processes at scale
Maintaining control and governance
14. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755514
CSV
RDBMS
Data On Boarding
Filling the Data Lake
Ingest Procedures
Disparate Data Sources Integration Processes Transformations
Hadoop
AVRO
15. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755515
CSVCSV
RDBMS
Data On Boarding at Scale
RDBMS
Disparate Data Sources Integration Processes Transformations
RDBMS
Ingest Procedures
Hadoop
AVRO
16. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755516
Filling the Data Lake
A Modern Data Onboarding Blueprint
Streamline data
ingest from wide
variety of source data
Reduce dependence
on hard coded data
movement procedures
Simplify regular data
movement at scale
into Data Lake
18. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755518
CSVCSV
RDBMS
Dynamic ELT
Ingest Templates
Hadoop
RDBMS
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
RDBMS
Pass metadata in at run time
to generate jobs on the fly
(metadata injection)
19. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755519
CSV
CSV
RDBMS
Templated workflows
RDBMS -> AVRO
Template
Hadoop
RDBMS
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
RDBMS
CSV -> AVRO
Template
CSV -> HDFS
Template
20. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755520
Variety – different metadata, one template
Hadoop
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
CSV -> AVRO
Template
21. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755521
Key Takeaway
Managing
ELT and ELT
procedures
Managing
Metadata
Metadata Injection
23. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755523
RDBMS Ingestion
Automated
Metadata
Extraction
Extract table and store in AVRO
§ Database connection details
§ Table(s)
§ Field names (if available)
§ Data types
§ String length
§ Mask for numbers and dates
§ …
24. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755524
Option 1: Ingest RAW files into
HDFS (no parsing)
§ Path to CSVs
CSV Ingestion
Option 2: Parse and store in AVRO
§ Path to CSVs
§ Delimiter
§ Field names (if available)
§ Data types
§ String length
§ Mask for numbers and dates
§ …
Automated
Metadata
Extraction
26. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755526
Key Takeaway
ELT development
DAYS
Provisioning
MINUTES
Automated Metadata Extraction
28. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755528
Key Takeaways
Template-based
Data Integration
Manage metadata
vs.
ELT procedures
Automated
Metadata
Extraction
Provide minimum
required
configuration
Reduce Risk
Maintain an
organized,
standardized, &
clean, data lake
Data Onboarding Blueprint
29. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755529
Learn more about Big
Data Onboarding at
Pentaho.com
Download Pentaho
Platform at
Pentaho.com
What Next?