Gobblin’ Big Data with Ease 
Lin Qiao 
Data Analytics Infra @ LinkedIn 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 Linke...
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 Linke...
Perception 
Analytics Platform 
Ingest 
Framework 
Primary 
Data 
Sources 
Transformations Business 
©2014 LinkedIn Corpor...
Reality 
Profile Data 
©2014 LinkedIn Corporation. All Rights Reserved. 
5 
Hadoop 
Camus 
Lumos 
Teradata 
External 
Part...
Challenges @ LinkedIn 
• Large variety of data sources 
• Multi-paradigm: streaming data, batch data 
• Different types of...
Open source solutions 
sqoopp 
aegisthus 
flumep morphlinep 
logstash Camus 
RDBMS vendor-specific 
connectorsp 
©2014 Lin...
Goals 
• Unified and Structured Data Ingestion Flow 
– RDBMS -> Hadoop 
– Event Streams -> Hadoop 
• Higher level abstract...
Central Ingestion Pipeline 
Hadoop 
OLTP Data 
©2014 LinkedIn Corporation. All Rights Reserved. 
Teradata 
External 
Partn...
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 Linke...
Gobblin Usage @ LinkedIn 
• Business Analytics 
– Source data for, sales analysis, product sentiment 
analysis, etc. 
• En...
Key Features 
 Horizontally scalable and robust framework 
 Unified computation paradigm 
 Turn-key solution 
 Customi...
Scalable and Robust Framework 
Centralized 
State Management 
©2014 LinkedIn Corporation. All Rights Reserved. 
13 
Scalab...
Unified computation paradigm 
Common execution 
flow 
Common execution flow between batch ingestion and streaming ingestio...
Turn Key Solution 
Built-in Exchange 
Protocols 
Existing adapters can easily be re-used for sources with common protocols...
Customize Your Own Ingestion Pipeline 
Extendable 
Operators 
Configurable 
Operator Flow 
Operators for doing extraction,...
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Lookahead 
©2014 LinkedIn Corporation. A...
Under the Hood 
©2014 LinkedIn Corporation. All Rights Reserved.
Computation Model 
• Gobblin standalone 
– single process, multi-threading 
– Testing, small data, sampling 
• Gobblin on ...
Scalable Ingestion Flow 
©2014 LinkedIn Corporation. All Rights Reserved. 
20 
Source 
Work 
Unit 
Work 
Unit 
Work 
Unit ...
Sources 
Source 
Work 
Unit Extractor Converter Publisher 
• Determines how to partition work 
- Partitioning algorithm ca...
Job Management 
Job run 1 Job run 2 Job run 3 
• Job execution states 
– Watermark 
– Task state, job state, quality check...
Gobblin Operator Flow 
Extract 
Schema 
Extract 
Record 
Convert 
Record 
©2014 LinkedIn Corporation. All Rights Reserved....
Extractors Source 
Work 
Unit Extractor Converter Publisher 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Ch...
Converters 
Source 
Work 
Unit Extractor Converter Publisher 
• Allow for schema and data transformation 
– Filtering 
– p...
Quality 
Checkers 
• Ensure quality of any data produced by Gobblin 
• Can be run on a per record, per task, or per job ba...
Writers 
Source 
Work 
Unit Extractor Converter Publisher 
• Writing data in Avro format onto HDFS 
– One writer per task ...
Publishers 
• Determines job success based on Policy. 
- COMMIT_ON_FULL_SUCCESS 
- COMMIT_ON_PARTIAL_SUCCESS 
• Commits da...
Gobblin Compaction 
Ingestion HDFS Compaction 
• Dimensions: 
– Initial full dump followed by incremental extracts in 
Gob...
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 Linke...
Gobblin in Production 
• > 350 datasets 
• ~ 60 TB per day 
• Salesforce 
• Responsys 
• RightNow 
• Timeforce 
• Slidesha...
Lesson Learned 
• Data quality has a lot more work to do 
• Small data problem is not small 
• Performance optimization op...
Gobblin Roadmap 
• Gobblin on Yarn 
• Streaming Sources 
• Gobblin Workbench with ingestion DSL 
• Data Profiling for rich...
©2014 LinkedIn Corporation. All Rights Reserved.
Nächste SlideShare
Wird geladen in …5
×

Gobblin' Big Data With Ease @ QConSF 2014

7.638 Aufrufe

Veröffentlicht am

QConSF 2014 talk

Veröffentlicht in: Daten & Analysen
0 Kommentare
34 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
7.638
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
623
Aktionen
Geteilt
0
Downloads
221
Kommentare
0
Gefällt mir
34
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie
  • Custom data pipelines: Developing a data pipeline per source
    Data model: Tightly bundled with RDBMS with strict DDL
    Operations effort: Large amount of pipelines to monitor, maintain and trouble-shoot.
    Data quality: no source of truth
    High investment cost and low productivity!
  • Gobblin' Big Data With Ease @ QConSF 2014

    1. 1. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn ©2014 LinkedIn Corporation. All Rights Reserved.
    2. 2. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
    3. 3. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
    4. 4. Perception Analytics Platform Ingest Framework Primary Data Sources Transformations Business ©2014 LinkedIn Corporation. All Rights Reserved. Facing Insights Member Facing Insights and Data Products Load Load Validation Validation
    5. 5. Reality Profile Data ©2014 LinkedIn Corporation. All Rights Reserved. 5 Hadoop Camus Lumos Teradata External Partner Data Ingest Framework DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity (tracking) Data R/W store (Oracle/ Espresso) Databus Changes Core Data Set (Tracking, Database, External) Derived Data Set Computed Results for Member Facing Products Enterprise Products Change dump on filer Ingest utilities Lassen (facts and dimensions) Read store (Voldemort)
    6. 6. Challenges @ LinkedIn • Large variety of data sources • Multi-paradigm: streaming data, batch data • Different types of data: facts, dimensions, logs, snapshots, increments, changelog • Operational complexity of multiple pipelines • Data quality • Data availability and predictability • Engineering cost ©2014 LinkedIn Corporation. All Rights Reserved.
    7. 7. Open source solutions sqoopp aegisthus flumep morphlinep logstash Camus RDBMS vendor-specific connectorsp ©2014 LinkedIn Corporation. All Rights Reserved.
    8. 8. Goals • Unified and Structured Data Ingestion Flow – RDBMS -> Hadoop – Event Streams -> Hadoop • Higher level abstractions – Facts, Dimensions – Snapshots, increments, changelog • ELT oriented – Minimize transformation in the ingest pipeline ©2014 LinkedIn Corporation. All Rights Reserved.
    9. 9. Central Ingestion Pipeline Hadoop OLTP Data ©2014 LinkedIn Corporation. All Rights Reserved. Teradata External Partner Data Gobblin DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Tracking R/W store (Oracle/ Espresso) Databus Changes Core Data Set (Tracking, Database, External) Derived Data Set Enterprise Products Change dump on filer REST JDBC SOAP Custom Compaction
    10. 10. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
    11. 11. Gobblin Usage @ LinkedIn • Business Analytics – Source data for, sales analysis, product sentiment analysis, etc. • Engineering – Source data for issue tracking, monitoring, product release, security compliance, A/B testing • Consumer product – Source data for acquisition integration – Performance analysis for email campaign, ads campaign, etc. ©2014 LinkedIn Corporation. All Rights Reserved.
    12. 12. Key Features  Horizontally scalable and robust framework  Unified computation paradigm  Turn-key solution  Customize your own Ingestion ©2014 LinkedIn Corporation. All Rights Reserved.
    13. 13. Scalable and Robust Framework Centralized State Management ©2014 LinkedIn Corporation. All Rights Reserved. 13 Scalable State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow
    14. 14. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management. ©2014 LinkedIn Corporation. All Rights Reserved.
    15. 15. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc. ©2014 LinkedIn Corporation. All Rights Reserved.
    16. 16. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code ©2014 LinkedIn Corporation. All Rights Reserved.
    17. 17. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
    18. 18. Under the Hood ©2014 LinkedIn Corporation. All Rights Reserved.
    19. 19. Computation Model • Gobblin standalone – single process, multi-threading – Testing, small data, sampling • Gobblin on Map/Reduce – Large datasets, horizontally scalable • Gobblin on Yarn – Better resource utilization – More scheduling flexibilities ©2014 LinkedIn Corporation. All Rights Reserved.
    20. 20. Scalable Ingestion Flow ©2014 LinkedIn Corporation. All Rights Reserved. 20 Source Work Unit Work Unit Work Unit Data Publisher Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Task Task Task
    21. 21. Sources Source Work Unit Extractor Converter Publisher • Determines how to partition work - Partitioning algorithm can leverage source sharding - Group partitions intelligently for performance • Creates work-units to be scheduled ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
    22. 22. Job Management Job run 1 Job run 2 Job run 3 • Job execution states – Watermark – Task state, job state, quality checker output, error code • Job synchronization • Job failure handling: policy driven ©2014 LinkedIn Corporation. All Rights Reserved. 22 State Store
    23. 23. Gobblin Operator Flow Extract Schema Extract Record Convert Record ©2014 LinkedIn Corporation. All Rights Reserved. Check Record Data Quality Commit Task Data Write Record Convert Schema Check Task Data Quality 23
    24. 24. Extractors Source Work Unit Extractor Converter Publisher ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer • Specifies how to get the schema and pull data from the source • Return ResultSet iterator • Track high watermark • Track extraction metrics
    25. 25. Converters Source Work Unit Extractor Converter Publisher • Allow for schema and data transformation – Filtering – projection – type conversion – Structural change • Composable: can specify a list of converters to be applied in the given order ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
    26. 26. Quality Checkers • Ensure quality of any data produced by Gobblin • Can be run on a per record, per task, or per job basis • Can specify a list of quality checkers to be applied – Schema compatibility – Audit check – Sensitive fields – Unique key • Policy driven – FAIL – if the check fails then so does the job – OPTIONAL – if the checks fails the job continues – ERR_FILE – the offending row is written to an error file ©2014 LinkedIn Corporation. All Rights Reserved. 26 Source Work Unit Extractor Converter Publisher Quality Checker Writer
    27. 27. Writers Source Work Unit Extractor Converter Publisher • Writing data in Avro format onto HDFS – One writer per task • Flexibility – Configurable compression codec (Deflate, Snappy) – Configurable buffer size • Plan to support other data format (Parquet, ORC) ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
    28. 28. Publishers • Determines job success based on Policy. - COMMIT_ON_FULL_SUCCESS - COMMIT_ON_PARTIAL_SUCCESS • Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 ©2014 LinkedIn Corporation. All Rights Reserved. Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit Extractor Converter Publisher Quality Checker Writer
    29. 29. Gobblin Compaction Ingestion HDFS Compaction • Dimensions: – Initial full dump followed by incremental extracts in Gobblin – Maintain a consistent snapshot by doing regularly scheduled compaction • Facts: – Merge small files ©2014 LinkedIn Corporation. All Rights Reserved. 29
    30. 30. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
    31. 31. Gobblin in Production • > 350 datasets • ~ 60 TB per day • Salesforce • Responsys • RightNow • Timeforce • Slideshare • Newsle • A/B testing • LinkedIn JIRA • Data retention ©2014 LinkedIn Corporation. All Rights Reserved. 31 Production Instances Data Volume
    32. 32. Lesson Learned • Data quality has a lot more work to do • Small data problem is not small • Performance optimization opportunities • Operational traits ©2014 LinkedIn Corporation. All Rights Reserved.
    33. 33. Gobblin Roadmap • Gobblin on Yarn • Streaming Sources • Gobblin Workbench with ingestion DSL • Data Profiling for richer quality checking • Open source in Q4’14 ©2014 LinkedIn Corporation. All Rights Reserved. 33
    34. 34. ©2014 LinkedIn Corporation. All Rights Reserved.

    ×