ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
1. Scalable ETL with Talend and Hadoop
Talend,
Global
Leader
in
Open
Source
Integra7on
Solu7ons
Cédric Carbone – Talend CTO
Twitter : @carbone
ccarbone@talend.com
2. Why speaking about ETL with Hadoop
Hadoop
is
complex
BI
consultants
don’t
have
the
skill
to
manipulate
Hadoop
➜ Biggest
issue
in
Hadoop
project
is
to
find
skilled
Hadoop
Engineers
➜
ETL
tool
like
Talend
can
help
to
democra7ze
Hadoop
4. to
this…
Why Talend…
Talend generates code that is executed within map reduce. This open
approach removes the limitation of a proprietary “engine” to provide a truly
unique and powerful set of tools for big data.
5. BIG Data Management
Big
Data
Produc7on
RDBMS
Analy7cal
DB
NoSQL
DB
ERP/CRM
SaaS
Social
Media
Web
Analy7cs
Log
Files
RFID
Call
Data
Records
Sensors
Machine-‐Generated
Big
Data
Management
Big
Data
Integra7on
Big Data
Quality
Big
Data
Consump7on
Mining
Analy7cs
Storage
Processing
Filtering
Parsing
Checking
Search
Enrichment
Turn Big Data into
actionable information
6. Two methods for inserting data quality into a
big data job
1.
Pipelining:
as
part
of
the
load
process
2.
Load
the
cluster
than
implement
and
execute
a
data
quality
map
reduce
job
9. Pipelining: data quality with big data
CRM
ERP
Finance
DQ
DQ
DQ
Big Data
DQ
Social
Networking
Mobile Devices
DQ
• Use
tradi7onal
data
quality
tools
• Once
and
done
10. Big data alternative: Load and improve within
the cluster
CRM
DQ
ERP
DQ
Finance
Social
Networking
Mobile Devices
Big Data
• Load
first,
improve
later
• Complex
matching
cannot
be
done
outside
11. One key DQ rules: Match
Find
duplicates
within
Hadoop
➜ Today’s
matching
algorithms
are
processor-‐intensive
➜ Tomorrow’s
matching
algorithms
could
be
more
precise,
more
intensive
➜
13. What’s hadoop
➜
➜
➜
➜
The
Apache™
Hadoop®
project
develops
open-‐source
so^ware
for
reliable,
scalable,
distributed
compu7ng.
Java
framework
for
storage
and
running
data
transforma7on
on
large
cluster
of
commodity
hardware
Licensed
under
the
Apache
v2
license
Created
from
Google's
MapReduce,
BigTable
and
Google
File
System
(GFS)
papers
17. WordCount
➜
WordCount
• Comes
with
Hadoop
• “First”
demo
that
everyone
tries!
➜
How-‐to
in
Talend
Big
Data
• Simple
read,
count,
load
results
• No
coding,
just
drag-‐n-‐drop
• Runs
remotely
32. Apache Mahout
Big
Data
can
also
be
a
blob
of
data
to
an
organiza7on
➜ Apache
Mahout
provides
algorithms
to
understanding
data
–
data
mining
➜ “You
don’t
know
what
you
don’t
know.”
and
mahout
will
tell
you.
➜
33. Metadata Management
➜
Centralize
Metadata
repository
for
Hadoop
Cluster,
HDFS,
Hive…
• Versioning
• Impact
Analysis
and
Data
Lineage
➜
HCatalog
accros
• HDFS
• Hive
• Pig
35. Choose your Hadoop distro
Widely
adopted
➜ Management
tooling
is
not
OSS
➜
Fully
OpenSource
➜ Strong
Developer
ecosystem
➜
More
proprietary
➜ GTM
partner
with
AWS
➜
➜
A
lot
of
more
are
comming
36. Choose your Hadoop distro
Provide
tooling
:
➜ For
installa7on
➜ For
server
monitoring
But
➜ No
GUI
for
parsing,
transforming,
easily
loading.
No
data
management
37. Parse and Standardize
Big
Data
is
not
always
structured
➜ Correct
big
data
so
that
data
conforms
to
the
same
rules
➜
39. Implications for Integration
DATA
Challenge:
Informa7on
explosion
increases
complexity
of
integra7on
and
requires
governance
to
maintain
data
quality
Requirement:
Informa7on
processing
must
scale
40. Implications for Integration
APPLICATION
Challenge:
Brisle,
point-‐to-‐point
connec7ons
cannot
adapt
to
evolving
business
requirements,
new
channels,
and
quickly
changing
topologies
Requirement:
Applica7on
architecture
must
scale
41. Implications for Integration
PROCESS
Challenge:
Compe77ve
market
forces
drive
frequent
process
changes
and
increased
process
complexity
Requirement: Business
processes
must
scale
42. Implications for Integration
Challenge:
Interdependencies
across
data,
applica7ons
and
processes
require
more
resources
and
budget
Requirement:
Resources
and
skillsets
must
scale
43. Integration at Any Scale
True
scalability for
• Any
integra7on
challenge
• Any
data
volume
• Any
project
size
Enables integration
convergence
44. Technology that Scales
Java
STANDARDS-BASED
Easy
to
learn,
flexible
to
adopt,
reduces
vendor
lock-‐in
SQL
Camel
Map
Reduce
……
CODE GENERATOR
No
black-‐box
engine
means
faster
maintenance
and
deployment
with
improved
quality
The
“engine”
for
Big
Data
is
“Hadoop”,
making
it
uniquely
run
at
infinite
scale.
46. Talend Overview
At a glance
Talend
today
➜
400
employees
in
7
countries
with
dual
HQ
in
Los
Altos,
CA
and
Paris,
France
➜
Over
4,000
paying
customers
across
different
industry
ver7cals
and
company
sizes
Backed
by
Silver
Lake
Sumeru,
Balderton
Capital
and
Idinvest
Partners
▶ Founded
in
2005
▶ Offers
highly
scalable
integra7on
solu7ons
addressing
Data
Integra7on,
Data
Quality,
MDM,
ESB
and
BPM
▶ Provides:
➜
High
growth
through
a
proven
model
§ Subscrip7ons
including
24/7
support
and
indemnifica7on;
Brand
Awareness
20
million
Downloads
§ Worldwide
training
and
services
▶ Recognized
as
the
open
source
leader
in
each
of
its
market
categories
Market
Momentum
AdopDon
1,000,000
Users
+50
New
Customers
/
Month
MoneDzaDon
4,000
Customers
47. Talend’s Unique Integration Solution
Best-ofBreed
Solutions
Data
Quality
Data
Integration
MDM
ESB
BPM
+
Studio
Repository
Deployment
Execu7on
Monitoring
Talend
Unified
Single
web-‐based
Web-‐based
Comprehensive
Platform
monitoring
console
Eclipse-‐based
deployment
&
➜ Reduce
costs
Talend
scheduling
user
interface
➜ Eliminate
risk
5
Unified =
1
3 Same
container
for
Consolidated
➜ Reuse
skills
Platform
metadata
&
project
batch
processing,
Unique
➜ Economies
of
scale
informa7on
message
rou7ng
&
Integration
services
➜ Incremental
adop7on
2
Solution
4
Recognized
as
the
open
source
leader
in
each
of
its
market
category
by
all
industry
analysts
48. Solutions that Scale
Big
Data
Studio
Data
Quality
Repository
Data
Integra7on
Data
Integra7on
MDM
ESB
BPM
TALEND UNIFIED
PLATFORM
Deployment
Execu7on
Monitoring
UNIFIED PLATFORM
A
shared
founda7on
and
toolset
increases
resource
reuse
CONVERGED INTEGRATION
Use
for
any
data,
applica7on
and
process
project
49. The 6 Dimensions of BIG Data
Primary
challenges
¾ Volume
¾ Velocity
¾ Variety
And
also
¾ Complexity
¾ Valida7on
¾ Lineage
50. What Is BIG Data?
"Big
data"
is
informa7on
of
extreme
size,
diversity,
complexity
and
need
for
rapid
processing.
Ted Friedman - Information
Infrastructure and Big Data Projects
Key Initiative Overview - July 2011
3,500
tweets
per
second
(June
2011)
1,000,000
transac7ons
per
day
at
Walmart
200
billion
intelligent
devices
200,000,000,000
2015
275
exabytes
of
data
flowing
over
the
Internet
each
day
275,000,000,000,000,000,000
2020
52. How to
define Big
data is….
Hans
Rosling
–
uses
big
data
to
analyze
world
health
trends
Key
Takeaway
#1
volume, variety, velocity
53. Traditional Data Flows
CRM
ERP
ETL
Data
Quality
Normalized
Data
Tradi7onal
Data
Warehouse
Finance
• Scheduled–daily
or
weekly,
some7mes
more
frequently.
• Volumes
rarely
exceed
terabytes
Business
Analyst
Warehouse
Administrator
Business
User
Execu7ves
54. The new world of big data
CRM
Social
Networking
ERP
Finance
Big Data
55. The new world of big data
Social
Networking
CRM
ERP
Finance
Big Data
Mobile
Devices
56. The new world of big data
Social
Networking
CRM
ERP
Mobile
Devices
Transac7ons
Finance
Network
Devices
Big Data
Sensors
58. Data driven business
enables
data
governance
information
supports
Information provides
value to the business
If
you
can't
rely
on
your
informa7on
then
the
result
can
be
missed
opportuni7es,
or
higher
costs.
Mashew
West
and
Julian
Fowler
(1999).
Developing
High
Quality
Data
Models.
The
European
Process
Industries
STEP
Technical
Liaison
Execu7ve
(EPISTLE).
decisions
drives
Your business
59. BIG data driven business
BIG
data
enables
governance
BIG
supports
information
Information provides
value to the business
If
you
can't
rely
on
your
informa7on
then
the
result
can
be
missed
opportuni7es,
or
higher
costs.
Mashew
West
and
Julian
Fowler
(1999).
Developing
High
Quality
Data
Models.
The
European
Process
Industries
STEP
Technical
Liaison
Execu7ve
(EPISTLE).
BIG
decisions
drives
BIG
business
60. How is big data integration being used?
Use
Cases
•
•
•
•
•
•
•
•
•
•
Recommenda7on
Engine
Sen7ment
Analysis
Risk
Modeling
Fraud
Detec7on
Behavior
Analysis
Marke7ng
Campaign
Analysis
Customer
Churn
Analysis
Social
Graph
Analysis
Customer
Experience
Analy7cs
Network
Monitoring
BUT:
to
what
level
is
DQ
required
for
your
use
case?