SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Pentaho Data Integration/Kettle
● Dan Moore
● 8z Real Estate
● Kettle user for two years
Questions
● Who has used a relational database?
● Who has written scripts or java code to munge
data from one source and load it to another?
– What did you use?
– Scripts
– Custom java code
– ETL tool
What is Kettle?
● Batch data integration and processing tool
written in Java
● Exists to retrieve, process and load data
● ETL
– Extract, transform and load
● PDI synonomous
What is Kettle good for
● Mirroring data from master to slave
● Syncing two data sources
● Processing data retrieved from multiple sources
and pushed to multiple destinations
● Loading data to RDBMS
● Datamart/data warehouse
– Dimension lookup/update step
● Graphical manipulation of data
Alternatives
● Code
– Custom java
– Spring batch
● Scripts
– perl, python, shell, etc
– Possibly + db loader tool and cron
● Commercial ETL tools
– Oracle Warehouse Builder
– Datastage
– Informatica
– SQL Server Integration services
● Open source ETL tools:
– Talend
– KETL
– Clover.ETL
● Special case tools
– SymmetricDS
– Db replication
Why Kettle is better
● Higher level than code
– Graphical interface
– No connection pooling to worry about
– No DDL to write
– Validation/business rules
● Well tested full suite of components
● Data analysis tools
– Preview
– Data profiling with data cleaner (add on)
● Free (as in beer and speech)
– Two editions
– GPLv2
● Performant?
– Developer vs computer performant
– Depends, right?
– Sitemailsame job 10k rows/second for 125M rows
● Leverage java
– jvm tuning skills
– java libraries and logic (in jars)
Data sources
● Files
● Databases
● No SQL
● REST
● XML
● Hadoop/HBase
● JSON
● Excel
● EDI
● RSS
● Google Analytics
Kettle concepts
● Repository
● Rows/Stream
● Steps
● Job
● Transformation
Demo 1: one way sync
● Sync tables
Demo 2: processing
● Process data from one table and replace some
values, filter some values
● Lookup table
Demo 3: log file processing
● Load apache logs for analysis
What it is not good for
● User interfaces/user interaction
● Small data sets
– 500 (from experience)
● Web applications
● One off processes?
– One off becomes regular
Who uses
● Survey results
– ~20 people
● Number of downloads: 110K downloads of Kettle 4.4
– Since Nov 2012
● Our specific use
– MLS data
● Different data source formats and types (jdbc, local csv, ftp)
– Public records data
● Fixed width files
Larger picture
● Kettle 10 years old
– joined Pentaho about 7 years ago
● Open source, at version 4.4
– GPLv2 license
– EE edition available
● BI suite
– Reporting
– Analytics
– Dashboards
– Machine Learning (weka)
Kettle tools
● Spoon
● Kitchen
● Pan
● Carte
– Clustering tool
Advanced topics
● Existing java logic
– Embedded
– Polygon example
– Demo 4
● Deployment
– Variables Config files are your friend
● Mapping/Parameterization
– Subroutines of logic
Advanced Topics Continued
● Testing
– Who tests
● Version control
– Who uses version control
● Error handling
– Email
– Log files
Getting started
● Download
– sourceforge
● Includes over 150 example transformations
– Mysql 3.14 jdbc driver
● Helpful sites
– Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration-
Kettle
– Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps
– Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing
● Helpful books
– Pentaho Kettle Solutions: Casters, Bouman, van Dongen
● Barely scratched surface
● Don't like tools that turn me into a mechanic

Weitere ähnliche Inhalte

Was ist angesagt?

From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data Platform
Navneet Gupta
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
Laurent Alquier
 
final_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemfinal_proj_Implementation of the ETL system
final_proj_Implementation of the ETL system
R-uturaj R-aval
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 

Was ist angesagt? (20)

Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data Integration
 
A deep dive into neuton
A deep dive into neutonA deep dive into neuton
A deep dive into neuton
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data Integration
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data Platform
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 
final_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemfinal_proj_Implementation of the ETL system
final_proj_Implementation of the ETL system
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
 

Andere mochten auch (11)

Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
 
Introduction To Pentaho
Introduction To PentahoIntroduction To Pentaho
Introduction To Pentaho
 
Pentaho technical whitepaper-1-6
Pentaho technical whitepaper-1-6Pentaho technical whitepaper-1-6
Pentaho technical whitepaper-1-6
 
Introduction To Pentaho
Introduction To PentahoIntroduction To Pentaho
Introduction To Pentaho
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
 
Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BI
 
SQL Basics
SQL BasicsSQL Basics
SQL Basics
 
SQL Tutorial - Basic Commands
SQL Tutorial - Basic CommandsSQL Tutorial - Basic Commands
SQL Tutorial - Basic Commands
 
Sql ppt
Sql pptSql ppt
Sql ppt
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

Ähnlich wie Introduction To Pentaho Kettle

A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Lviv Startup Club
 

Ähnlich wie Introduction To Pentaho Kettle (20)

NoSQL
NoSQLNoSQL
NoSQL
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Are we there yet?
Are we there yet?Are we there yet?
Are we there yet?
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglot
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to choose
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 

Mehr von Boulder Java User's Group

Mehr von Boulder Java User's Group (8)

Spring insight what just happened
Spring insight   what just happenedSpring insight   what just happened
Spring insight what just happened
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Json at work overview and ecosystem-v2.0
Json at work   overview and ecosystem-v2.0Json at work   overview and ecosystem-v2.0
Json at work overview and ecosystem-v2.0
 
Restful design at work v2.0
Restful design at work v2.0Restful design at work v2.0
Restful design at work v2.0
 
Introduction To JavaFX 2.0
Introduction To JavaFX 2.0Introduction To JavaFX 2.0
Introduction To JavaFX 2.0
 
55 New Features in Java 7
55 New Features in Java 755 New Features in Java 7
55 New Features in Java 7
 
Watson and Open Source Tools
Watson and Open Source ToolsWatson and Open Source Tools
Watson and Open Source Tools
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Introduction To Pentaho Kettle

  • 1. Pentaho Data Integration/Kettle ● Dan Moore ● 8z Real Estate ● Kettle user for two years
  • 2. Questions ● Who has used a relational database? ● Who has written scripts or java code to munge data from one source and load it to another? – What did you use? – Scripts – Custom java code – ETL tool
  • 3. What is Kettle? ● Batch data integration and processing tool written in Java ● Exists to retrieve, process and load data ● ETL – Extract, transform and load ● PDI synonomous
  • 4. What is Kettle good for ● Mirroring data from master to slave ● Syncing two data sources ● Processing data retrieved from multiple sources and pushed to multiple destinations ● Loading data to RDBMS ● Datamart/data warehouse – Dimension lookup/update step ● Graphical manipulation of data
  • 5. Alternatives ● Code – Custom java – Spring batch ● Scripts – perl, python, shell, etc – Possibly + db loader tool and cron ● Commercial ETL tools – Oracle Warehouse Builder – Datastage – Informatica – SQL Server Integration services ● Open source ETL tools: – Talend – KETL – Clover.ETL ● Special case tools – SymmetricDS – Db replication
  • 6. Why Kettle is better ● Higher level than code – Graphical interface – No connection pooling to worry about – No DDL to write – Validation/business rules ● Well tested full suite of components ● Data analysis tools – Preview – Data profiling with data cleaner (add on) ● Free (as in beer and speech) – Two editions – GPLv2 ● Performant? – Developer vs computer performant – Depends, right? – Sitemailsame job 10k rows/second for 125M rows ● Leverage java – jvm tuning skills – java libraries and logic (in jars)
  • 7. Data sources ● Files ● Databases ● No SQL ● REST ● XML ● Hadoop/HBase ● JSON ● Excel ● EDI ● RSS ● Google Analytics
  • 8. Kettle concepts ● Repository ● Rows/Stream ● Steps ● Job ● Transformation
  • 9. Demo 1: one way sync ● Sync tables
  • 10. Demo 2: processing ● Process data from one table and replace some values, filter some values ● Lookup table
  • 11. Demo 3: log file processing ● Load apache logs for analysis
  • 12. What it is not good for ● User interfaces/user interaction ● Small data sets – 500 (from experience) ● Web applications ● One off processes? – One off becomes regular
  • 13. Who uses ● Survey results – ~20 people ● Number of downloads: 110K downloads of Kettle 4.4 – Since Nov 2012 ● Our specific use – MLS data ● Different data source formats and types (jdbc, local csv, ftp) – Public records data ● Fixed width files
  • 14. Larger picture ● Kettle 10 years old – joined Pentaho about 7 years ago ● Open source, at version 4.4 – GPLv2 license – EE edition available ● BI suite – Reporting – Analytics – Dashboards – Machine Learning (weka)
  • 15. Kettle tools ● Spoon ● Kitchen ● Pan ● Carte – Clustering tool
  • 16. Advanced topics ● Existing java logic – Embedded – Polygon example – Demo 4 ● Deployment – Variables Config files are your friend ● Mapping/Parameterization – Subroutines of logic
  • 17. Advanced Topics Continued ● Testing – Who tests ● Version control – Who uses version control ● Error handling – Email – Log files
  • 18. Getting started ● Download – sourceforge ● Includes over 150 example transformations – Mysql 3.14 jdbc driver ● Helpful sites – Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration- Kettle – Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps – Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing ● Helpful books – Pentaho Kettle Solutions: Casters, Bouman, van Dongen ● Barely scratched surface ● Don't like tools that turn me into a mechanic