SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Building a data pipeline to ingest
data into Hadoop in minutes
using Streamsets Data Collector
Guglielmo Iozzia,
Big Data Infrastructure Engineer @ IBM Ireland
Data Ingestion for Analytics: a real scenario
In the business area (cloud applications) to which my team belongs there were so
many questions to be answered. They were related to:
● Defect analysis
● Outage analysis
● Cyber-Security
“Data is the second
most important
thing in analytics”
Data Ingestion: multiple sources...
● Legacy systems
● DB2
● Lotus Domino
● MongoDB
● Application logs
● System logs
● New Relic
● Jenkins pipelines
● Testing tools output
● RESTful Services
… and so many tools available to get the data
What are we going to do with all those data?
Issues
● The need to collect data from multiple sources introduces redundancy, which
costs additional disk space and increases query times.
● A small team.
● Lack of skills and experience across the team (and the business area in
general) in managing Big Data tools.
● Low budget.
Alternatives
#1 Panic
Alternatives
#2 Cloning team members
Alternatives
#3 Find a smart way to simplify the data ingestion
process
A single tool needed...
● Design complex data flows with minimal coding and the maximum flexibility.
● Provide real-time data flow statistics, metrics for each flow stage.
● Automated error handling and alerting.
● Easy to use by everyone.
● Zero-downtime when upgrading the infrastructure due to logical isolation of
each flow stage.
● Open Source
… something like this
Streamsets Data Collector
Streamsets Data Collector
Streamsets Data Collector: supported origins
Streamsets Data Collector: available destinations
Streamsets Data Collector: available processors
● Base64 Field Decoder
● Base64 Field Encoder
● Expression Evaluator
● Field Converter
● JavaScript Evaluator
● JSON Parser
● Jython Evaluator
● Log Parser
● Stream Selector
● XML Parser
...and many others
Streamsets Data Collector
Demo
Streamsets DC: performance and reliability
● Two available execution modes: standalone or cluster
● Implemented in Java: so any performance best practice/recommendation for
Java applications applies here
● REST services for performance monitoring available
● Rules and alerts (metric and data both)
Streamsets Data Collector: security
● You can authenticate user accounts based on LDAP
● Authorization: the Data Collector provides several roles (admin, manager,
creator, guest)
● You can use Kerberos authentication to connect to origin and destination
systems
● Follow the usual security best practices in terms of iptables, networking, etc.
for Java web applications running on Linux machines.
Useful Links
Streamsets Data Collector:
https://streamsets.com/product/
Thanks!
My contacts:
Linkedin: https://ie.linkedin.com/in/giozzia
Blog: http://googlielmo.blogspot.ie/
Twitter: https://twitter.com/guglielmoiozzia
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Weitere ähnliche Inhalte

Was ist angesagt?

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 

Was ist angesagt? (20)

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 

Andere mochten auch

A REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOPA REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
Ezekiel Tunde ADEBAMIWI
 
Propiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del SoftwarePropiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del Software
jorge quispe
 
JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016
Ahmad butt
 

Andere mochten auch (20)

Base de datos
Base de datosBase de datos
Base de datos
 
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOPA REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
 
Propiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del SoftwarePropiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del Software
 
Informatica
InformaticaInformatica
Informatica
 
JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016
 
Presentación Rol del estudiante y de los tutores en la educación a distancia
Presentación  Rol del estudiante y de los tutores en la educación a distanciaPresentación  Rol del estudiante y de los tutores en la educación a distancia
Presentación Rol del estudiante y de los tutores en la educación a distancia
 
Housing Academy 2
Housing Academy 2Housing Academy 2
Housing Academy 2
 
Derecho informatico
Derecho informaticoDerecho informatico
Derecho informatico
 
Contratacion Electronica & Contratacion Informatica
Contratacion Electronica & Contratacion InformaticaContratacion Electronica & Contratacion Informatica
Contratacion Electronica & Contratacion Informatica
 
Chan
ChanChan
Chan
 
Teletrabajo en la administración pública
Teletrabajo en la administración públicaTeletrabajo en la administración pública
Teletrabajo en la administración pública
 
tieguy brochure
tieguy brochuretieguy brochure
tieguy brochure
 
Delitos informaticos
Delitos informaticosDelitos informaticos
Delitos informaticos
 
Tic
TicTic
Tic
 
Analisis economico del derecho
Analisis economico del derechoAnalisis economico del derecho
Analisis economico del derecho
 
Comercio electrónico
Comercio electrónicoComercio electrónico
Comercio electrónico
 
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
CISSP Prep: Ch 1: Security Governance Through Principles and PoliciesCISSP Prep: Ch 1: Security Governance Through Principles and Policies
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Practical Malware Analysis: Ch 11: Malware Behavior
Practical Malware Analysis: Ch 11: Malware BehaviorPractical Malware Analysis: Ch 11: Malware Behavior
Practical Malware Analysis: Ch 11: Malware Behavior
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgPractical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
 

Ähnlich wie Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Ähnlich wie Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector (20)

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systems
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 

Kürzlich hochgeladen

一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 

Kürzlich hochgeladen (20)

2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

  • 1. Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland
  • 2. Data Ingestion for Analytics: a real scenario In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to: ● Defect analysis ● Outage analysis ● Cyber-Security
  • 3. “Data is the second most important thing in analytics”
  • 4. Data Ingestion: multiple sources... ● Legacy systems ● DB2 ● Lotus Domino ● MongoDB ● Application logs ● System logs ● New Relic ● Jenkins pipelines ● Testing tools output ● RESTful Services
  • 5. … and so many tools available to get the data
  • 6. What are we going to do with all those data?
  • 7. Issues ● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times. ● A small team. ● Lack of skills and experience across the team (and the business area in general) in managing Big Data tools. ● Low budget.
  • 10. Alternatives #3 Find a smart way to simplify the data ingestion process
  • 11. A single tool needed... ● Design complex data flows with minimal coding and the maximum flexibility. ● Provide real-time data flow statistics, metrics for each flow stage. ● Automated error handling and alerting. ● Easy to use by everyone. ● Zero-downtime when upgrading the infrastructure due to logical isolation of each flow stage. ● Open Source
  • 15. Streamsets Data Collector: supported origins
  • 16. Streamsets Data Collector: available destinations
  • 17. Streamsets Data Collector: available processors ● Base64 Field Decoder ● Base64 Field Encoder ● Expression Evaluator ● Field Converter ● JavaScript Evaluator ● JSON Parser ● Jython Evaluator ● Log Parser ● Stream Selector ● XML Parser ...and many others
  • 19. Streamsets DC: performance and reliability ● Two available execution modes: standalone or cluster ● Implemented in Java: so any performance best practice/recommendation for Java applications applies here ● REST services for performance monitoring available ● Rules and alerts (metric and data both)
  • 20. Streamsets Data Collector: security ● You can authenticate user accounts based on LDAP ● Authorization: the Data Collector provides several roles (admin, manager, creator, guest) ● You can use Kerberos authentication to connect to origin and destination systems ● Follow the usual security best practices in terms of iptables, networking, etc. for Java web applications running on Linux machines.
  • 21. Useful Links Streamsets Data Collector: https://streamsets.com/product/
  • 22. Thanks! My contacts: Linkedin: https://ie.linkedin.com/in/giozzia Blog: http://googlielmo.blogspot.ie/ Twitter: https://twitter.com/guglielmoiozzia