SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Filling the
Data Lake
June 29, 2016
Chuck Yarbrough
Sr Director, Solutions Marketing and Management
@cyarbrough
Mark Burnette
Enterprise Sales Engineer @MarkCBurnette
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75552
Emerging Big Data Use Cases
Improve operational
effectiveness
Machines/sensors:
predict failures, network
attacks
Financial risk management:
reduce fraud, increase
security
Reduce data warehouse cost
Improve customer
experience
Build a 360° view to fully
understand and serve the
customer
Drive personalized and
adjusted interaction
Use automated
recommendations logic
Drive incremental
revenue
Predict customer
behavior across all channels
Understand and
monetize customer behavior
Begin to monetize data
as a service
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75553
Spectrum of Big Data Use Cases
Entry
Transform
Advanced
Optimize
Data
Warehouse
Optimization
Streamlined
Data
Refinery
Big Data
Exploration
Customer
360 Degree
View
Harnessing
Machine &
Sensor Data
Next
Generation
Applications
Internal Big
Data as a
Service
On-Demand
Big Data
Blending
Big Data
Predictive
Analytics
Use Case Complexity
BusinessImpact
Monetize My
Data
Data
Warehouse
Optimization
Data
Warehouse
Optimization
Streamlined
Data
Refinery
360 Degree
View
Big Data
Onboarding
Filling the
Data Lake
What Does
Pentaho Do?
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75555
Administration Security
Lifecycle
Management
Data
Provenance
Dynamic Data
Pipeline Monitoring Automation
Data Pipeline
Data Engineering
Managing and Automating the Pipeline
Data Engineering AnalyticsData Preparation
Data
Lake
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75556
The Data Swamp
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75557
The Data Lake
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75558
Does Hadoop Have to be Hard?
Empower team
members to
integrate and
process Hadoop
Data
Establish a
modern data on
boarding process
that is flexible and
scalable
Deliver governed
analytic insights
for large
production use
bases
Things that can help ease the pain
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75559
Proper Care and Feeding of the Data Lake
Data
Onboarding
Challenges
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755511
More Data, More Problems
Even with good integration tools, major data onboarding
projects can be painful:
User Challenges
§  Repetitive manual design
§  Very time-consuming
§  Difficult to maintain
Business Challenges
§  Takes too long
§  Business deadlines at risk
§  Opportunity cost
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755512
How do we effectively scale data pipelines to accommodate
exploding data sources, volumes, and complexity?
More Data, More Problems
Have you ever had the pleasure of…
Migrating hundreds of sources between systems?
Enabling business users to onboard a variety of data themselves?
Ingesting hundreds of changing data sources into Hadoop?
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755513
More Data, More Problems
Modern data onboarding is more than
just “dumping data” – it includes:
Managing a changing array of data sources
Establishing repeatable processes at scale
Maintaining control and governance
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755514
CSV
RDBMS
Data On Boarding
Filling the Data Lake
Ingest Procedures
Disparate Data Sources Integration Processes Transformations
Hadoop
AVRO
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755515
CSVCSV
RDBMS
Data On Boarding at Scale
RDBMS
Disparate Data Sources Integration Processes Transformations
RDBMS
Ingest Procedures
Hadoop
AVRO
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755516
Filling the Data Lake
A Modern Data Onboarding Blueprint
Streamline data
ingest from wide
variety of source data
Reduce dependence
on hard coded data
movement procedures
Simplify regular data
movement at scale
into Data Lake
Template-based
Approach
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755518
CSVCSV
RDBMS
Dynamic ELT
Ingest Templates
Hadoop
RDBMS
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
RDBMS
Pass metadata in at run time
to generate jobs on the fly
(metadata injection)
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755519
CSV
CSV
RDBMS
Templated workflows
RDBMS -> AVRO
Template
Hadoop
RDBMS
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
RDBMS
CSV -> AVRO
Template
CSV -> HDFS
Template
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755520
Variety – different metadata, one template
Hadoop
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
CSV -> AVRO
Template
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755521
Key Takeaway
Managing
ELT and ELT
procedures
Managing
Metadata
Metadata Injection
Metadata
Acquisition
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755523
RDBMS Ingestion
Automated
Metadata
Extraction
Extract table and store in AVRO
§  Database connection details
§  Table(s)
§  Field names (if available)
§  Data types
§  String length
§  Mask for numbers and dates
§  …
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755524
Option 1: Ingest RAW files into
HDFS (no parsing)
§  Path to CSVs
CSV Ingestion
Option 2: Parse and store in AVRO
§  Path to CSVs
§  Delimiter
§  Field names (if available)
§  Data types
§  String length
§  Mask for numbers and dates
§  …
Automated
Metadata
Extraction
Demonstration
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755526
Key Takeaway
ELT development
DAYS
Provisioning
MINUTES
Automated Metadata Extraction
Summary
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755528
Key Takeaways
Template-based
Data Integration
Manage metadata
vs.
ELT procedures
Automated
Metadata
Extraction
Provide minimum
required
configuration
Reduce Risk
Maintain an
organized,
standardized, &
clean, data lake
Data Onboarding Blueprint
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755529
Learn more about Big
Data Onboarding at
Pentaho.com
Download Pentaho
Platform at
Pentaho.com
What Next?
Q&A
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Was ist angesagt? (20)

Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 

Andere mochten auch

Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 

Andere mochten auch (20)

How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
Lego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesLego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming Pipelines
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 

Ähnlich wie Filling the Data Lake

Big data for product managers
Big data for product managersBig data for product managers
Big data for product managers
AIPMM Administration
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Pentaho
 
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
Capgemini
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting final
Skills Matter
 

Ähnlich wie Filling the Data Lake (20)

Big data for product managers
Big data for product managersBig data for product managers
Big data for product managers
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
 
Big Data for Product Managers
Big Data for Product ManagersBig Data for Product Managers
Big Data for Product Managers
 
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
 
Big Data for BI - Beyond the Hype - Pentaho
Big Data for BI - Beyond the Hype - PentahoBig Data for BI - Beyond the Hype - Pentaho
Big Data for BI - Beyond the Hype - Pentaho
 
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
 
MongoDB IoT City Tour STUTTGART: Analysing the Internet of Things. By, Pentaho
MongoDB IoT City Tour STUTTGART: Analysing the Internet of Things. By, PentahoMongoDB IoT City Tour STUTTGART: Analysing the Internet of Things. By, Pentaho
MongoDB IoT City Tour STUTTGART: Analysing the Internet of Things. By, Pentaho
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
How advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorHow advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sector
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Five Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BIFive Critical Success Factors for Big Data and Traditional BI
Five Critical Success Factors for Big Data and Traditional BI
 
Automate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business ImpactAutomate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business Impact
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
Big Data, Big Picture: Can You See It?
Big Data, Big Picture: Can You See It?Big Data, Big Picture: Can You See It?
Big Data, Big Picture: Can You See It?
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting final
 

Mehr von DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Filling the Data Lake

  • 1. Filling the Data Lake June 29, 2016 Chuck Yarbrough Sr Director, Solutions Marketing and Management @cyarbrough Mark Burnette Enterprise Sales Engineer @MarkCBurnette
  • 2. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75552 Emerging Big Data Use Cases Improve operational effectiveness Machines/sensors: predict failures, network attacks Financial risk management: reduce fraud, increase security Reduce data warehouse cost Improve customer experience Build a 360° view to fully understand and serve the customer Drive personalized and adjusted interaction Use automated recommendations logic Drive incremental revenue Predict customer behavior across all channels Understand and monetize customer behavior Begin to monetize data as a service
  • 3. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75553 Spectrum of Big Data Use Cases Entry Transform Advanced Optimize Data Warehouse Optimization Streamlined Data Refinery Big Data Exploration Customer 360 Degree View Harnessing Machine & Sensor Data Next Generation Applications Internal Big Data as a Service On-Demand Big Data Blending Big Data Predictive Analytics Use Case Complexity BusinessImpact Monetize My Data Data Warehouse Optimization Data Warehouse Optimization Streamlined Data Refinery 360 Degree View Big Data Onboarding Filling the Data Lake
  • 5. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75555 Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation Data Pipeline Data Engineering Managing and Automating the Pipeline Data Engineering AnalyticsData Preparation Data Lake
  • 6. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75556 The Data Swamp
  • 7. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75557 The Data Lake
  • 8. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75558 Does Hadoop Have to be Hard? Empower team members to integrate and process Hadoop Data Establish a modern data on boarding process that is flexible and scalable Deliver governed analytic insights for large production use bases Things that can help ease the pain
  • 9. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-75559 Proper Care and Feeding of the Data Lake
  • 11. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755511 More Data, More Problems Even with good integration tools, major data onboarding projects can be painful: User Challenges §  Repetitive manual design §  Very time-consuming §  Difficult to maintain Business Challenges §  Takes too long §  Business deadlines at risk §  Opportunity cost
  • 12. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755512 How do we effectively scale data pipelines to accommodate exploding data sources, volumes, and complexity? More Data, More Problems Have you ever had the pleasure of… Migrating hundreds of sources between systems? Enabling business users to onboard a variety of data themselves? Ingesting hundreds of changing data sources into Hadoop?
  • 13. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755513 More Data, More Problems Modern data onboarding is more than just “dumping data” – it includes: Managing a changing array of data sources Establishing repeatable processes at scale Maintaining control and governance
  • 14. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755514 CSV RDBMS Data On Boarding Filling the Data Lake Ingest Procedures Disparate Data Sources Integration Processes Transformations Hadoop AVRO
  • 15. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755515 CSVCSV RDBMS Data On Boarding at Scale RDBMS Disparate Data Sources Integration Processes Transformations RDBMS Ingest Procedures Hadoop AVRO
  • 16. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755516 Filling the Data Lake A Modern Data Onboarding Blueprint Streamline data ingest from wide variety of source data Reduce dependence on hard coded data movement procedures Simplify regular data movement at scale into Data Lake
  • 18. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755518 CSVCSV RDBMS Dynamic ELT Ingest Templates Hadoop RDBMS Disparate Data Sources Dynamic Integration Processes Dynamic Transformations RDBMS Pass metadata in at run time to generate jobs on the fly (metadata injection)
  • 19. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755519 CSV CSV RDBMS Templated workflows RDBMS -> AVRO Template Hadoop RDBMS Disparate Data Sources Dynamic Integration Processes Dynamic Transformations RDBMS CSV -> AVRO Template CSV -> HDFS Template
  • 20. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755520 Variety – different metadata, one template Hadoop Disparate Data Sources Dynamic Integration Processes Dynamic Transformations CSV -> AVRO Template
  • 21. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755521 Key Takeaway Managing ELT and ELT procedures Managing Metadata Metadata Injection
  • 23. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755523 RDBMS Ingestion Automated Metadata Extraction Extract table and store in AVRO §  Database connection details §  Table(s) §  Field names (if available) §  Data types §  String length §  Mask for numbers and dates §  …
  • 24. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755524 Option 1: Ingest RAW files into HDFS (no parsing) §  Path to CSVs CSV Ingestion Option 2: Parse and store in AVRO §  Path to CSVs §  Delimiter §  Field names (if available) §  Data types §  String length §  Mask for numbers and dates §  … Automated Metadata Extraction
  • 26. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755526 Key Takeaway ELT development DAYS Provisioning MINUTES Automated Metadata Extraction
  • 28. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755528 Key Takeaways Template-based Data Integration Manage metadata vs. ELT procedures Automated Metadata Extraction Provide minimum required configuration Reduce Risk Maintain an organized, standardized, & clean, data lake Data Onboarding Blueprint
  • 29. © 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-755529 Learn more about Big Data Onboarding at Pentaho.com Download Pentaho Platform at Pentaho.com What Next?
  • 30. Q&A