Weitere ähnliche Inhalte Ähnlich wie Gobblin' Big Data With Ease @ QConSF 2014 (20) Kürzlich hochgeladen (20) Gobblin' Big Data With Ease @ QConSF 20141. Gobblin’ Big Data with Ease
Lin Qiao
Data Analytics Infra @ LinkedIn
©2014 LinkedIn Corporation. All Rights Reserved.
2. Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
3. Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
4. Perception
Analytics Platform
Ingest
Framework
Primary
Data
Sources
Transformations Business
©2014 LinkedIn Corporation. All Rights Reserved.
Facing
Insights
Member
Facing
Insights and
Data Products
Load
Load
Validation
Validation
5. Reality
Profile Data
©2014 LinkedIn Corporation. All Rights Reserved.
5
Hadoop
Camus
Lumos
Teradata
External
Partner
Data
Ingest
Framework
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
(tracking)
Data
R/W store
(Oracle/
Espresso)
Databus
Changes
Core Data Set
(Tracking,
Database,
External)
Derived Data
Set
Computed Results for Member Facing Products
Enterprise
Products
Change
dump on filer
Ingest
utilities
Lassen
(facts and
dimensions)
Read store
(Voldemort)
6. Challenges @ LinkedIn
• Large variety of data sources
• Multi-paradigm: streaming data, batch data
• Different types of data: facts, dimensions, logs,
snapshots, increments, changelog
• Operational complexity of multiple pipelines
• Data quality
• Data availability and predictability
• Engineering cost
©2014 LinkedIn Corporation. All Rights Reserved.
7. Open source solutions
sqoopp
aegisthus
flumep morphlinep
logstash Camus
RDBMS vendor-specific
connectorsp
©2014 LinkedIn Corporation. All Rights Reserved.
8. Goals
• Unified and Structured Data Ingestion Flow
– RDBMS -> Hadoop
– Event Streams -> Hadoop
• Higher level abstractions
– Facts, Dimensions
– Snapshots, increments, changelog
• ELT oriented
– Minimize transformation in the ingest pipeline
©2014 LinkedIn Corporation. All Rights Reserved.
9. Central Ingestion Pipeline
Hadoop
OLTP Data
©2014 LinkedIn Corporation. All Rights Reserved.
Teradata
External
Partner
Data
Gobblin
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Tracking
R/W store
(Oracle/
Espresso)
Databus
Changes
Core Data Set
(Tracking,
Database,
External)
Derived Data
Set
Enterprise
Products
Change
dump on filer
REST
JDBC
SOAP
Custom
Compaction
10. Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
11. Gobblin Usage @ LinkedIn
• Business Analytics
– Source data for, sales analysis, product sentiment
analysis, etc.
• Engineering
– Source data for issue tracking, monitoring, product
release, security compliance, A/B testing
• Consumer product
– Source data for acquisition integration
– Performance analysis for email campaign, ads
campaign, etc.
©2014 LinkedIn Corporation. All Rights Reserved.
12. Key Features
Horizontally scalable and robust framework
Unified computation paradigm
Turn-key solution
Customize your own Ingestion
©2014 LinkedIn Corporation. All Rights Reserved.
13. Scalable and Robust Framework
Centralized
State Management
©2014 LinkedIn Corporation. All Rights Reserved.
13
Scalable
State is carried over between jobs automatically, so metadata can be used
to track offsets, checkpoints, watermarks, etc.
Jobs are partitioned into tasks that run concurrently
Fault Tolerant Framework gracefully deals with machine and job failures
Query Assurance Baked in quality checking throughout the flow
14. Unified computation paradigm
Common execution
flow
Common execution flow between batch ingestion and streaming ingestion
pipelines
Shared infra
components
Shared job state management, job metrics store, metadata management.
©2014 LinkedIn Corporation. All Rights Reserved.
15. Turn Key Solution
Built-in Exchange
Protocols
Existing adapters can easily be re-used for sources with common protocols
(e.g. JDBC, REST, SFTP, SOAP, etc.)
Built-in Source
Integration
Fully integrated with commonly used sources including MySQL, SQLServer,
Oracle, SalesForce, HDFS, filer, internal dropbox)
Built-in Data
Ingestion Semantics
Covers full dump and incremental ingestion for fact and dimension
datasets.
Policy driven flow
execution & tuning
Flow owners just need to specify pre-defined policy for handling job
failure, degree of parallelism, what data to publish, etc.
©2014 LinkedIn Corporation. All Rights Reserved.
16. Customize Your Own Ingestion Pipeline
Extendable
Operators
Configurable
Operator Flow
Operators for doing extraction, conversion, quality checking, data
persistence, etc., can be implemented or extended against common API.
Configuration allows for multiple plugin points to add in customized logic
and code
©2014 LinkedIn Corporation. All Rights Reserved.
17. Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
19. Computation Model
• Gobblin standalone
– single process, multi-threading
– Testing, small data, sampling
• Gobblin on Map/Reduce
– Large datasets, horizontally scalable
• Gobblin on Yarn
– Better resource utilization
– More scheduling flexibilities
©2014 LinkedIn Corporation. All Rights Reserved.
20. Scalable Ingestion Flow
©2014 LinkedIn Corporation. All Rights Reserved.
20
Source
Work
Unit
Work
Unit
Work
Unit
Data
Publisher
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Task
Task
Task
21. Sources
Source
Work
Unit Extractor Converter Publisher
• Determines how to partition work
- Partitioning algorithm can leverage source sharding
- Group partitions intelligently for performance
• Creates work-units to be scheduled
©2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checker
Writer
22. Job Management
Job run 1 Job run 2 Job run 3
• Job execution states
– Watermark
– Task state, job state, quality checker output, error code
• Job synchronization
• Job failure handling: policy driven
©2014 LinkedIn Corporation. All Rights Reserved.
22
State Store
23. Gobblin Operator Flow
Extract
Schema
Extract
Record
Convert
Record
©2014 LinkedIn Corporation. All Rights Reserved.
Check
Record Data
Quality
Commit
Task Data
Write
Record
Convert
Schema
Check Task
Data
Quality
23
24. Extractors Source
Work
Unit Extractor Converter Publisher
©2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checker
Writer
• Specifies how to get the schema and pull data from
the source
• Return ResultSet iterator
• Track high watermark
• Track extraction metrics
25. Converters
Source
Work
Unit Extractor Converter Publisher
• Allow for schema and data transformation
– Filtering
– projection
– type conversion
– Structural change
• Composable: can specify a list of converters to be applied in
the given order
©2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checker
Writer
26. Quality
Checkers
• Ensure quality of any data produced by Gobblin
• Can be run on a per record, per task, or per job basis
• Can specify a list of quality checkers to be applied
– Schema compatibility
– Audit check
– Sensitive fields
– Unique key
• Policy driven
– FAIL – if the check fails then so does the job
– OPTIONAL – if the checks fails the job continues
– ERR_FILE – the offending row is written to an error file
©2014 LinkedIn Corporation. All Rights Reserved.
26
Source
Work
Unit Extractor Converter Publisher
Quality
Checker
Writer
27. Writers
Source
Work
Unit Extractor Converter Publisher
• Writing data in Avro format onto HDFS
– One writer per task
• Flexibility
– Configurable compression codec (Deflate, Snappy)
– Configurable buffer size
• Plan to support other data format (Parquet, ORC)
©2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checker
Writer
28. Publishers
• Determines job success based on Policy.
- COMMIT_ON_FULL_SUCCESS
- COMMIT_ON_PARTIAL_SUCCESS
• Commits data to final directories based on job success.
Task 1
Task 2
Task 3
File 1
File 2
File 3
©2014 LinkedIn Corporation. All Rights Reserved.
Tmp Dir
File 1
File 2
File 3
Final Dir
File 1
File 2
File 3
Source
Work
Unit Extractor Converter Publisher
Quality
Checker
Writer
29. Gobblin Compaction
Ingestion HDFS Compaction
• Dimensions:
– Initial full dump followed by incremental extracts in
Gobblin
– Maintain a consistent snapshot by doing regularly
scheduled compaction
• Facts:
– Merge small files
©2014 LinkedIn Corporation. All Rights Reserved.
29
30. Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
31. Gobblin in Production
• > 350 datasets
• ~ 60 TB per day
• Salesforce
• Responsys
• RightNow
• Timeforce
• Slideshare
• Newsle
• A/B testing
• LinkedIn JIRA
• Data retention
©2014 LinkedIn Corporation. All Rights Reserved.
31
Production
Instances
Data Volume
32. Lesson Learned
• Data quality has a lot more work to do
• Small data problem is not small
• Performance optimization opportunities
• Operational traits
©2014 LinkedIn Corporation. All Rights Reserved.
33. Gobblin Roadmap
• Gobblin on Yarn
• Streaming Sources
• Gobblin Workbench with ingestion DSL
• Data Profiling for richer quality checking
• Open source in Q4’14
©2014 LinkedIn Corporation. All Rights Reserved.
33
Hinweis der Redaktion Custom data pipelines: Developing a data pipeline per source
Data model: Tightly bundled with RDBMS with strict DDL
Operations effort: Large amount of pipelines to monitor, maintain and trouble-shoot.
Data quality: no source of truth
High investment cost and low productivity!