Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
1. Large Scale ETL with Hadoop
Headline Goes Here
Eric Sammer | Principal Solution Architect
Speaker Name or Subhead Goes Here
@esammer
Strata + Hadoop World 2012
1
3. ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
2
4. ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
2
5. ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way
2
6. ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way
Worst, it’s trivial at face value, complicated in
practice
2
8. So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
3
9. So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
3
10. So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
3
11. So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
3
12. So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility
3
13. So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility
How it all fits together
3
18. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
5
19. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
Hive, Pig, Cascading, ...
5
20. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
6
21. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Flume, Sqoop, WebHDFS, ...
6
22. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
7
23. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Oozie, Azkaban, ...
7
24. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
8
25. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
Tika, ?, ...
8
26. The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
...and now low latency query with Impala
9
27. To truly scale ETL, separate infrastructure from
processes
10
28. To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service
11
29. To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service
(composed of other services).
12
32. The services of ETL
Process Repository
Metadata Repository
13
33. The services of ETL
Process Repository
Metadata Repository
Scheduling
13
34. The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
13
35. The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels
13
36. The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels
Service and Process Instrumentation and
Collection
13
38. What do we have today?
HDFS and MapReduce – The core
14
39. What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
14
40. What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
14
41. What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling
14
42. What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling
Impala – Fast analysis of data quality
14
44. MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
15
45. MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
15
46. MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required
15
47. MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required
Use higher level tools where possible
15
55. Structure data in tiers
A clear hierarchy of source/derived relationships
20
56. Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
20
57. Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
20
58. Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
20
59. Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
20
60. Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
Tier 1 – Derived from 0, cleansed, normalized
20
61. Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
Tier 1 – Derived from 0, cleansed, normalized
Tier 2 – Derived from 1, aggregated
20
64. There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
22
65. There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
22
66. There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
22
67. There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)
22
68. There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)
Process (job) deployment, service location,
22