SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Gobblin’ Big Data with Ease 
Lin Qiao 
Data Analytics Infra @ LinkedIn 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Perception 
Analytics Platform 
Ingest 
Framework 
Primary 
Data 
Sources 
Transformations Business 
©2014 LinkedIn Corporation. All Rights Reserved. 
Facing 
Insights 
Member 
Facing 
Insights and 
Data Products 
Load 
Load 
Validation 
Validation
Reality 
Profile Data 
©2014 LinkedIn Corporation. All Rights Reserved. 
5 
Hadoop 
Camus 
Lumos 
Teradata 
External 
Partner 
Data 
Ingest 
Framework 
DWH ETL 
(fact tables) 
Product, 
Sciences, 
Enterprise 
Analytics 
Site 
(Member 
Facing 
Products) 
Kafka 
Activity 
(tracking) 
Data 
R/W store 
(Oracle/ 
Espresso) 
Databus 
Changes 
Core Data Set 
(Tracking, 
Database, 
External) 
Derived Data 
Set 
Computed Results for Member Facing Products 
Enterprise 
Products 
Change 
dump on filer 
Ingest 
utilities 
Lassen 
(facts and 
dimensions) 
Read store 
(Voldemort)
Challenges @ LinkedIn 
• Large variety of data sources 
• Multi-paradigm: streaming data, batch data 
• Different types of data: facts, dimensions, logs, 
snapshots, increments, changelog 
• Operational complexity of multiple pipelines 
• Data quality 
• Data availability and predictability 
• Engineering cost 
©2014 LinkedIn Corporation. All Rights Reserved.
Open source solutions 
sqoopp 
aegisthus 
flumep morphlinep 
logstash Camus 
RDBMS vendor-specific 
connectorsp 
©2014 LinkedIn Corporation. All Rights Reserved.
Goals 
• Unified and Structured Data Ingestion Flow 
– RDBMS -> Hadoop 
– Event Streams -> Hadoop 
• Higher level abstractions 
– Facts, Dimensions 
– Snapshots, increments, changelog 
• ELT oriented 
– Minimize transformation in the ingest pipeline 
©2014 LinkedIn Corporation. All Rights Reserved.
Central Ingestion Pipeline 
Hadoop 
OLTP Data 
©2014 LinkedIn Corporation. All Rights Reserved. 
Teradata 
External 
Partner 
Data 
Gobblin 
DWH ETL 
(fact tables) 
Product, 
Sciences, 
Enterprise 
Analytics 
Site 
(Member 
Facing 
Products) 
Kafka 
Tracking 
R/W store 
(Oracle/ 
Espresso) 
Databus 
Changes 
Core Data Set 
(Tracking, 
Database, 
External) 
Derived Data 
Set 
Enterprise 
Products 
Change 
dump on filer 
REST 
JDBC 
SOAP 
Custom 
Compaction
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Usage @ LinkedIn 
• Business Analytics 
– Source data for, sales analysis, product sentiment 
analysis, etc. 
• Engineering 
– Source data for issue tracking, monitoring, product 
release, security compliance, A/B testing 
• Consumer product 
– Source data for acquisition integration 
– Performance analysis for email campaign, ads 
campaign, etc. 
©2014 LinkedIn Corporation. All Rights Reserved.
Key Features 
 Horizontally scalable and robust framework 
 Unified computation paradigm 
 Turn-key solution 
 Customize your own Ingestion 
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable and Robust Framework 
Centralized 
State Management 
©2014 LinkedIn Corporation. All Rights Reserved. 
13 
Scalable 
State is carried over between jobs automatically, so metadata can be used 
to track offsets, checkpoints, watermarks, etc. 
Jobs are partitioned into tasks that run concurrently 
Fault Tolerant Framework gracefully deals with machine and job failures 
Query Assurance Baked in quality checking throughout the flow
Unified computation paradigm 
Common execution 
flow 
Common execution flow between batch ingestion and streaming ingestion 
pipelines 
Shared infra 
components 
Shared job state management, job metrics store, metadata management. 
©2014 LinkedIn Corporation. All Rights Reserved.
Turn Key Solution 
Built-in Exchange 
Protocols 
Existing adapters can easily be re-used for sources with common protocols 
(e.g. JDBC, REST, SFTP, SOAP, etc.) 
Built-in Source 
Integration 
Fully integrated with commonly used sources including MySQL, SQLServer, 
Oracle, SalesForce, HDFS, filer, internal dropbox) 
Built-in Data 
Ingestion Semantics 
Covers full dump and incremental ingestion for fact and dimension 
datasets. 
Policy driven flow 
execution & tuning 
Flow owners just need to specify pre-defined policy for handling job 
failure, degree of parallelism, what data to publish, etc. 
©2014 LinkedIn Corporation. All Rights Reserved.
Customize Your Own Ingestion Pipeline 
Extendable 
Operators 
Configurable 
Operator Flow 
Operators for doing extraction, conversion, quality checking, data 
persistence, etc., can be implemented or extended against common API. 
Configuration allows for multiple plugin points to add in customized logic 
and code 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Under the Hood 
©2014 LinkedIn Corporation. All Rights Reserved.
Computation Model 
• Gobblin standalone 
– single process, multi-threading 
– Testing, small data, sampling 
• Gobblin on Map/Reduce 
– Large datasets, horizontally scalable 
• Gobblin on Yarn 
– Better resource utilization 
– More scheduling flexibilities 
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable Ingestion Flow 
©2014 LinkedIn Corporation. All Rights Reserved. 
20 
Source 
Work 
Unit 
Work 
Unit 
Work 
Unit 
Data 
Publisher 
Extractor Converter 
Quality 
Checker 
Writer 
Extractor Converter 
Quality 
Checker 
Writer 
Extractor Converter 
Quality 
Checker 
Writer 
Task 
Task 
Task
Sources 
Source 
Work 
Unit Extractor Converter Publisher 
• Determines how to partition work 
- Partitioning algorithm can leverage source sharding 
- Group partitions intelligently for performance 
• Creates work-units to be scheduled 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer
Job Management 
Job run 1 Job run 2 Job run 3 
• Job execution states 
– Watermark 
– Task state, job state, quality checker output, error code 
• Job synchronization 
• Job failure handling: policy driven 
©2014 LinkedIn Corporation. All Rights Reserved. 
22 
State Store
Gobblin Operator Flow 
Extract 
Schema 
Extract 
Record 
Convert 
Record 
©2014 LinkedIn Corporation. All Rights Reserved. 
Check 
Record Data 
Quality 
Commit 
Task Data 
Write 
Record 
Convert 
Schema 
Check Task 
Data 
Quality 
23
Extractors Source 
Work 
Unit Extractor Converter Publisher 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer 
• Specifies how to get the schema and pull data from 
the source 
• Return ResultSet iterator 
• Track high watermark 
• Track extraction metrics
Converters 
Source 
Work 
Unit Extractor Converter Publisher 
• Allow for schema and data transformation 
– Filtering 
– projection 
– type conversion 
– Structural change 
• Composable: can specify a list of converters to be applied in 
the given order 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer
Quality 
Checkers 
• Ensure quality of any data produced by Gobblin 
• Can be run on a per record, per task, or per job basis 
• Can specify a list of quality checkers to be applied 
– Schema compatibility 
– Audit check 
– Sensitive fields 
– Unique key 
• Policy driven 
– FAIL – if the check fails then so does the job 
– OPTIONAL – if the checks fails the job continues 
– ERR_FILE – the offending row is written to an error file 
©2014 LinkedIn Corporation. All Rights Reserved. 
26 
Source 
Work 
Unit Extractor Converter Publisher 
Quality 
Checker 
Writer
Writers 
Source 
Work 
Unit Extractor Converter Publisher 
• Writing data in Avro format onto HDFS 
– One writer per task 
• Flexibility 
– Configurable compression codec (Deflate, Snappy) 
– Configurable buffer size 
• Plan to support other data format (Parquet, ORC) 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer
Publishers 
• Determines job success based on Policy. 
- COMMIT_ON_FULL_SUCCESS 
- COMMIT_ON_PARTIAL_SUCCESS 
• Commits data to final directories based on job success. 
Task 1 
Task 2 
Task 3 
File 1 
File 2 
File 3 
©2014 LinkedIn Corporation. All Rights Reserved. 
Tmp Dir 
File 1 
File 2 
File 3 
Final Dir 
File 1 
File 2 
File 3 
Source 
Work 
Unit Extractor Converter Publisher 
Quality 
Checker 
Writer
Gobblin Compaction 
Ingestion HDFS Compaction 
• Dimensions: 
– Initial full dump followed by incremental extracts in 
Gobblin 
– Maintain a consistent snapshot by doing regularly 
scheduled compaction 
• Facts: 
– Merge small files 
©2014 LinkedIn Corporation. All Rights Reserved. 
29
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin in Production 
• > 350 datasets 
• ~ 60 TB per day 
• Salesforce 
• Responsys 
• RightNow 
• Timeforce 
• Slideshare 
• Newsle 
• A/B testing 
• LinkedIn JIRA 
• Data retention 
©2014 LinkedIn Corporation. All Rights Reserved. 
31 
Production 
Instances 
Data Volume
Lesson Learned 
• Data quality has a lot more work to do 
• Small data problem is not small 
• Performance optimization opportunities 
• Operational traits 
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Roadmap 
• Gobblin on Yarn 
• Streaming Sources 
• Gobblin Workbench with ingestion DSL 
• Data Profiling for richer quality checking 
• Open source in Q4’14 
©2014 LinkedIn Corporation. All Rights Reserved. 
33
©2014 LinkedIn Corporation. All Rights Reserved.

Más contenido relacionado

Was ist angesagt?

Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at FacebookRedis Labs
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
OMG DDS: The Data Distribution Service for Real-Time Systems
OMG DDS: The Data Distribution Service for Real-Time SystemsOMG DDS: The Data Distribution Service for Real-Time Systems
OMG DDS: The Data Distribution Service for Real-Time SystemsAngelo Corsaro
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 

Was ist angesagt? (20)

Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
OMG DDS: The Data Distribution Service for Real-Time Systems
OMG DDS: The Data Distribution Service for Real-Time SystemsOMG DDS: The Data Distribution Service for Real-Time Systems
OMG DDS: The Data Distribution Service for Real-Time Systems
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
RDD
RDDRDD
RDD
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 

Andere mochten auch

gobblin-meetup-yarn
gobblin-meetup-yarngobblin-meetup-yarn
gobblin-meetup-yarnYinan Li
 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data AnalyticsIntel IT Center
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ
 
19 Sure Ways To Sabotage Your Job Search
19 Sure Ways To Sabotage Your Job Search19 Sure Ways To Sabotage Your Job Search
19 Sure Ways To Sabotage Your Job SearchJarkko Sjöman
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopYinan Li
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...DataWorks Summit
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoopskaluska
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to DatabusAmy W. Tang
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Kanak Biscuitwala
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 

Andere mochten auch (13)

gobblin-meetup-yarn
gobblin-meetup-yarngobblin-meetup-yarn
gobblin-meetup-yarn
 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data Analytics
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
 
19 Sure Ways To Sabotage Your Job Search
19 Sure Ways To Sabotage Your Job Search19 Sure Ways To Sabotage Your Job Search
19 Sure Ways To Sabotage Your Job Search
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 

Ähnlich wie Gobblin' Big Data With Ease @ QConSF 2014

Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
The Essential Guide for Automating CMDB population and maintenance
The Essential Guide for Automating CMDB population and maintenanceThe Essential Guide for Automating CMDB population and maintenance
The Essential Guide for Automating CMDB population and maintenanceStefan Bergstein
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
Key Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to PostgresKey Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to PostgresEDB
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationMichael Rainey
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014Wilfried Hoge
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 

Ähnlich wie Gobblin' Big Data With Ease @ QConSF 2014 (20)

Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
The Essential Guide for Automating CMDB population and maintenance
The Essential Guide for Automating CMDB population and maintenanceThe Essential Guide for Automating CMDB population and maintenance
The Essential Guide for Automating CMDB population and maintenance
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Key Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to PostgresKey Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to Postgres
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data Integration
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 

Último

Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Neo4j
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performancePrithaVashisht1
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxEmmanuel Dauda
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfmxlos0
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxShammiRai3
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMMarco Wobben
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsGain Insights
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1bengalurutug
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggbhadratanusenapati1
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...ferisulianta.com
 

Último (20)

Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performance
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
Target_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110millionTarget_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110million
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potx
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdf
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptx
 
Unleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IMUnleashing Datas Potential - Mastering Precision with FCO-IM
Unleashing Datas Potential - Mastering Precision with FCO-IM
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data Warehouse
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded Analytics
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfgggg
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
 

Gobblin' Big Data With Ease @ QConSF 2014

  • 1. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn ©2014 LinkedIn Corporation. All Rights Reserved.
  • 2. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 3. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 4. Perception Analytics Platform Ingest Framework Primary Data Sources Transformations Business ©2014 LinkedIn Corporation. All Rights Reserved. Facing Insights Member Facing Insights and Data Products Load Load Validation Validation
  • 5. Reality Profile Data ©2014 LinkedIn Corporation. All Rights Reserved. 5 Hadoop Camus Lumos Teradata External Partner Data Ingest Framework DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity (tracking) Data R/W store (Oracle/ Espresso) Databus Changes Core Data Set (Tracking, Database, External) Derived Data Set Computed Results for Member Facing Products Enterprise Products Change dump on filer Ingest utilities Lassen (facts and dimensions) Read store (Voldemort)
  • 6. Challenges @ LinkedIn • Large variety of data sources • Multi-paradigm: streaming data, batch data • Different types of data: facts, dimensions, logs, snapshots, increments, changelog • Operational complexity of multiple pipelines • Data quality • Data availability and predictability • Engineering cost ©2014 LinkedIn Corporation. All Rights Reserved.
  • 7. Open source solutions sqoopp aegisthus flumep morphlinep logstash Camus RDBMS vendor-specific connectorsp ©2014 LinkedIn Corporation. All Rights Reserved.
  • 8. Goals • Unified and Structured Data Ingestion Flow – RDBMS -> Hadoop – Event Streams -> Hadoop • Higher level abstractions – Facts, Dimensions – Snapshots, increments, changelog • ELT oriented – Minimize transformation in the ingest pipeline ©2014 LinkedIn Corporation. All Rights Reserved.
  • 9. Central Ingestion Pipeline Hadoop OLTP Data ©2014 LinkedIn Corporation. All Rights Reserved. Teradata External Partner Data Gobblin DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Tracking R/W store (Oracle/ Espresso) Databus Changes Core Data Set (Tracking, Database, External) Derived Data Set Enterprise Products Change dump on filer REST JDBC SOAP Custom Compaction
  • 10. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 11. Gobblin Usage @ LinkedIn • Business Analytics – Source data for, sales analysis, product sentiment analysis, etc. • Engineering – Source data for issue tracking, monitoring, product release, security compliance, A/B testing • Consumer product – Source data for acquisition integration – Performance analysis for email campaign, ads campaign, etc. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 12. Key Features  Horizontally scalable and robust framework  Unified computation paradigm  Turn-key solution  Customize your own Ingestion ©2014 LinkedIn Corporation. All Rights Reserved.
  • 13. Scalable and Robust Framework Centralized State Management ©2014 LinkedIn Corporation. All Rights Reserved. 13 Scalable State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow
  • 14. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 15. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 16. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code ©2014 LinkedIn Corporation. All Rights Reserved.
  • 17. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 18. Under the Hood ©2014 LinkedIn Corporation. All Rights Reserved.
  • 19. Computation Model • Gobblin standalone – single process, multi-threading – Testing, small data, sampling • Gobblin on Map/Reduce – Large datasets, horizontally scalable • Gobblin on Yarn – Better resource utilization – More scheduling flexibilities ©2014 LinkedIn Corporation. All Rights Reserved.
  • 20. Scalable Ingestion Flow ©2014 LinkedIn Corporation. All Rights Reserved. 20 Source Work Unit Work Unit Work Unit Data Publisher Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Task Task Task
  • 21. Sources Source Work Unit Extractor Converter Publisher • Determines how to partition work - Partitioning algorithm can leverage source sharding - Group partitions intelligently for performance • Creates work-units to be scheduled ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
  • 22. Job Management Job run 1 Job run 2 Job run 3 • Job execution states – Watermark – Task state, job state, quality checker output, error code • Job synchronization • Job failure handling: policy driven ©2014 LinkedIn Corporation. All Rights Reserved. 22 State Store
  • 23. Gobblin Operator Flow Extract Schema Extract Record Convert Record ©2014 LinkedIn Corporation. All Rights Reserved. Check Record Data Quality Commit Task Data Write Record Convert Schema Check Task Data Quality 23
  • 24. Extractors Source Work Unit Extractor Converter Publisher ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer • Specifies how to get the schema and pull data from the source • Return ResultSet iterator • Track high watermark • Track extraction metrics
  • 25. Converters Source Work Unit Extractor Converter Publisher • Allow for schema and data transformation – Filtering – projection – type conversion – Structural change • Composable: can specify a list of converters to be applied in the given order ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
  • 26. Quality Checkers • Ensure quality of any data produced by Gobblin • Can be run on a per record, per task, or per job basis • Can specify a list of quality checkers to be applied – Schema compatibility – Audit check – Sensitive fields – Unique key • Policy driven – FAIL – if the check fails then so does the job – OPTIONAL – if the checks fails the job continues – ERR_FILE – the offending row is written to an error file ©2014 LinkedIn Corporation. All Rights Reserved. 26 Source Work Unit Extractor Converter Publisher Quality Checker Writer
  • 27. Writers Source Work Unit Extractor Converter Publisher • Writing data in Avro format onto HDFS – One writer per task • Flexibility – Configurable compression codec (Deflate, Snappy) – Configurable buffer size • Plan to support other data format (Parquet, ORC) ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
  • 28. Publishers • Determines job success based on Policy. - COMMIT_ON_FULL_SUCCESS - COMMIT_ON_PARTIAL_SUCCESS • Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 ©2014 LinkedIn Corporation. All Rights Reserved. Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit Extractor Converter Publisher Quality Checker Writer
  • 29. Gobblin Compaction Ingestion HDFS Compaction • Dimensions: – Initial full dump followed by incremental extracts in Gobblin – Maintain a consistent snapshot by doing regularly scheduled compaction • Facts: – Merge small files ©2014 LinkedIn Corporation. All Rights Reserved. 29
  • 30. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 31. Gobblin in Production • > 350 datasets • ~ 60 TB per day • Salesforce • Responsys • RightNow • Timeforce • Slideshare • Newsle • A/B testing • LinkedIn JIRA • Data retention ©2014 LinkedIn Corporation. All Rights Reserved. 31 Production Instances Data Volume
  • 32. Lesson Learned • Data quality has a lot more work to do • Small data problem is not small • Performance optimization opportunities • Operational traits ©2014 LinkedIn Corporation. All Rights Reserved.
  • 33. Gobblin Roadmap • Gobblin on Yarn • Streaming Sources • Gobblin Workbench with ingestion DSL • Data Profiling for richer quality checking • Open source in Q4’14 ©2014 LinkedIn Corporation. All Rights Reserved. 33
  • 34. ©2014 LinkedIn Corporation. All Rights Reserved.

Hinweis der Redaktion

  1. Custom data pipelines: Developing a data pipeline per source Data model: Tightly bundled with RDBMS with strict DDL Operations effort: Large amount of pipelines to monitor, maintain and trouble-shoot. Data quality: no source of truth High investment cost and low productivity!