Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Building a data pipeline to ingest
data into Hadoop in minutes
using Streamsets Data Collector
Guglielmo Iozzia,
Big Data Infrastructure Engineer @ IBM Ireland

Data Ingestion for Analytics: a real scenario
In the business area (cloud applications) to which my team belongs there were so
many questions to be answered. They were related to:
● Defect analysis
● Outage analysis
● Cyber-Security

“Data is the second
most important
thing in analytics”

Data Ingestion: multiple sources...
● Legacy systems
● DB2
● Lotus Domino
● MongoDB
● Application logs
● System logs
● New Relic
● Jenkins pipelines
● Testing tools output
● RESTful Services

… and so many tools available to get the data

What are we going to do with all those data?

Issues
● The need to collect data from multiple sources introduces redundancy, which
costs additional disk space and increases query times.
● A small team.
● Lack of skills and experience across the team (and the business area in
general) in managing Big Data tools.
● Low budget.

Alternatives
#2 Cloning team members

Alternatives
#3 Find a smart way to simplify the data ingestion
process

A single tool needed...
● Design complex data flows with minimal coding and the maximum flexibility.
● Provide real-time data flow statistics, metrics for each flow stage.
● Automated error handling and alerting.
● Easy to use by everyone.
● Zero-downtime when upgrading the infrastructure due to logical isolation of
each flow stage.
● Open Source

Streamsets Data Collector: supported origins

Streamsets Data Collector: available destinations

Streamsets Data Collector: available processors
● Base64 Field Decoder
● Base64 Field Encoder
● Expression Evaluator
● Field Converter
● JavaScript Evaluator
● JSON Parser
● Jython Evaluator
● Log Parser
● Stream Selector
● XML Parser
...and many others

Streamsets Data Collector
Demo

Streamsets DC: performance and reliability
● Two available execution modes: standalone or cluster
● Implemented in Java: so any performance best practice/recommendation for
Java applications applies here
● REST services for performance monitoring available
● Rules and alerts (metric and data both)

Streamsets Data Collector: security
● You can authenticate user accounts based on LDAP
● Authorization: the Data Collector provides several roles (admin, manager,
creator, guest)
● You can use Kerberos authentication to connect to origin and destination
systems
● Follow the usual security best practices in terms of iptables, networking, etc.
for Java web applications running on Linux machines.

Useful Links
Streamsets Data Collector:
https://streamsets.com/product/

Thanks!
My contacts:
Linkedin: https://ie.linkedin.com/in/giozzia
Blog: http://googlielmo.blogspot.ie/
Twitter: https://twitter.com/guglielmoiozzia

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Ähnlich wie Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector