Gobblin' Big Data With Ease @ QConSF 2014

Lin Qiao
Lin QiaoTech Lead / Manager um LinkedIn
Gobblin’ Big Data with Ease 
Lin Qiao 
Data Analytics Infra @ LinkedIn 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Perception 
Analytics Platform 
Ingest 
Framework 
Primary 
Data 
Sources 
Transformations Business 
©2014 LinkedIn Corporation. All Rights Reserved. 
Facing 
Insights 
Member 
Facing 
Insights and 
Data Products 
Load 
Load 
Validation 
Validation
Reality 
Profile Data 
©2014 LinkedIn Corporation. All Rights Reserved. 
5 
Hadoop 
Camus 
Lumos 
Teradata 
External 
Partner 
Data 
Ingest 
Framework 
DWH ETL 
(fact tables) 
Product, 
Sciences, 
Enterprise 
Analytics 
Site 
(Member 
Facing 
Products) 
Kafka 
Activity 
(tracking) 
Data 
R/W store 
(Oracle/ 
Espresso) 
Databus 
Changes 
Core Data Set 
(Tracking, 
Database, 
External) 
Derived Data 
Set 
Computed Results for Member Facing Products 
Enterprise 
Products 
Change 
dump on filer 
Ingest 
utilities 
Lassen 
(facts and 
dimensions) 
Read store 
(Voldemort)
Challenges @ LinkedIn 
• Large variety of data sources 
• Multi-paradigm: streaming data, batch data 
• Different types of data: facts, dimensions, logs, 
snapshots, increments, changelog 
• Operational complexity of multiple pipelines 
• Data quality 
• Data availability and predictability 
• Engineering cost 
©2014 LinkedIn Corporation. All Rights Reserved.
Open source solutions 
sqoopp 
aegisthus 
flumep morphlinep 
logstash Camus 
RDBMS vendor-specific 
connectorsp 
©2014 LinkedIn Corporation. All Rights Reserved.
Goals 
• Unified and Structured Data Ingestion Flow 
– RDBMS -> Hadoop 
– Event Streams -> Hadoop 
• Higher level abstractions 
– Facts, Dimensions 
– Snapshots, increments, changelog 
• ELT oriented 
– Minimize transformation in the ingest pipeline 
©2014 LinkedIn Corporation. All Rights Reserved.
Central Ingestion Pipeline 
Hadoop 
OLTP Data 
©2014 LinkedIn Corporation. All Rights Reserved. 
Teradata 
External 
Partner 
Data 
Gobblin 
DWH ETL 
(fact tables) 
Product, 
Sciences, 
Enterprise 
Analytics 
Site 
(Member 
Facing 
Products) 
Kafka 
Tracking 
R/W store 
(Oracle/ 
Espresso) 
Databus 
Changes 
Core Data Set 
(Tracking, 
Database, 
External) 
Derived Data 
Set 
Enterprise 
Products 
Change 
dump on filer 
REST 
JDBC 
SOAP 
Custom 
Compaction
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Usage @ LinkedIn 
• Business Analytics 
– Source data for, sales analysis, product sentiment 
analysis, etc. 
• Engineering 
– Source data for issue tracking, monitoring, product 
release, security compliance, A/B testing 
• Consumer product 
– Source data for acquisition integration 
– Performance analysis for email campaign, ads 
campaign, etc. 
©2014 LinkedIn Corporation. All Rights Reserved.
Key Features 
 Horizontally scalable and robust framework 
 Unified computation paradigm 
 Turn-key solution 
 Customize your own Ingestion 
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable and Robust Framework 
Centralized 
State Management 
©2014 LinkedIn Corporation. All Rights Reserved. 
13 
Scalable 
State is carried over between jobs automatically, so metadata can be used 
to track offsets, checkpoints, watermarks, etc. 
Jobs are partitioned into tasks that run concurrently 
Fault Tolerant Framework gracefully deals with machine and job failures 
Query Assurance Baked in quality checking throughout the flow
Unified computation paradigm 
Common execution 
flow 
Common execution flow between batch ingestion and streaming ingestion 
pipelines 
Shared infra 
components 
Shared job state management, job metrics store, metadata management. 
©2014 LinkedIn Corporation. All Rights Reserved.
Turn Key Solution 
Built-in Exchange 
Protocols 
Existing adapters can easily be re-used for sources with common protocols 
(e.g. JDBC, REST, SFTP, SOAP, etc.) 
Built-in Source 
Integration 
Fully integrated with commonly used sources including MySQL, SQLServer, 
Oracle, SalesForce, HDFS, filer, internal dropbox) 
Built-in Data 
Ingestion Semantics 
Covers full dump and incremental ingestion for fact and dimension 
datasets. 
Policy driven flow 
execution & tuning 
Flow owners just need to specify pre-defined policy for handling job 
failure, degree of parallelism, what data to publish, etc. 
©2014 LinkedIn Corporation. All Rights Reserved.
Customize Your Own Ingestion Pipeline 
Extendable 
Operators 
Configurable 
Operator Flow 
Operators for doing extraction, conversion, quality checking, data 
persistence, etc., can be implemented or extended against common API. 
Configuration allows for multiple plugin points to add in customized logic 
and code 
©2014 LinkedIn Corporation. All Rights Reserved.
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Under the Hood 
©2014 LinkedIn Corporation. All Rights Reserved.
Computation Model 
• Gobblin standalone 
– single process, multi-threading 
– Testing, small data, sampling 
• Gobblin on Map/Reduce 
– Large datasets, horizontally scalable 
• Gobblin on Yarn 
– Better resource utilization 
– More scheduling flexibilities 
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable Ingestion Flow 
©2014 LinkedIn Corporation. All Rights Reserved. 
20 
Source 
Work 
Unit 
Work 
Unit 
Work 
Unit 
Data 
Publisher 
Extractor Converter 
Quality 
Checker 
Writer 
Extractor Converter 
Quality 
Checker 
Writer 
Extractor Converter 
Quality 
Checker 
Writer 
Task 
Task 
Task
Sources 
Source 
Work 
Unit Extractor Converter Publisher 
• Determines how to partition work 
- Partitioning algorithm can leverage source sharding 
- Group partitions intelligently for performance 
• Creates work-units to be scheduled 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer
Job Management 
Job run 1 Job run 2 Job run 3 
• Job execution states 
– Watermark 
– Task state, job state, quality checker output, error code 
• Job synchronization 
• Job failure handling: policy driven 
©2014 LinkedIn Corporation. All Rights Reserved. 
22 
State Store
Gobblin Operator Flow 
Extract 
Schema 
Extract 
Record 
Convert 
Record 
©2014 LinkedIn Corporation. All Rights Reserved. 
Check 
Record Data 
Quality 
Commit 
Task Data 
Write 
Record 
Convert 
Schema 
Check Task 
Data 
Quality 
23
Extractors Source 
Work 
Unit Extractor Converter Publisher 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer 
• Specifies how to get the schema and pull data from 
the source 
• Return ResultSet iterator 
• Track high watermark 
• Track extraction metrics
Converters 
Source 
Work 
Unit Extractor Converter Publisher 
• Allow for schema and data transformation 
– Filtering 
– projection 
– type conversion 
– Structural change 
• Composable: can specify a list of converters to be applied in 
the given order 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer
Quality 
Checkers 
• Ensure quality of any data produced by Gobblin 
• Can be run on a per record, per task, or per job basis 
• Can specify a list of quality checkers to be applied 
– Schema compatibility 
– Audit check 
– Sensitive fields 
– Unique key 
• Policy driven 
– FAIL – if the check fails then so does the job 
– OPTIONAL – if the checks fails the job continues 
– ERR_FILE – the offending row is written to an error file 
©2014 LinkedIn Corporation. All Rights Reserved. 
26 
Source 
Work 
Unit Extractor Converter Publisher 
Quality 
Checker 
Writer
Writers 
Source 
Work 
Unit Extractor Converter Publisher 
• Writing data in Avro format onto HDFS 
– One writer per task 
• Flexibility 
– Configurable compression codec (Deflate, Snappy) 
– Configurable buffer size 
• Plan to support other data format (Parquet, ORC) 
©2014 LinkedIn Corporation. All Rights Reserved. 
Quality 
Checker 
Writer
Publishers 
• Determines job success based on Policy. 
- COMMIT_ON_FULL_SUCCESS 
- COMMIT_ON_PARTIAL_SUCCESS 
• Commits data to final directories based on job success. 
Task 1 
Task 2 
Task 3 
File 1 
File 2 
File 3 
©2014 LinkedIn Corporation. All Rights Reserved. 
Tmp Dir 
File 1 
File 2 
File 3 
Final Dir 
File 1 
File 2 
File 3 
Source 
Work 
Unit Extractor Converter Publisher 
Quality 
Checker 
Writer
Gobblin Compaction 
Ingestion HDFS Compaction 
• Dimensions: 
– Initial full dump followed by incremental extracts in 
Gobblin 
– Maintain a consistent snapshot by doing regularly 
scheduled compaction 
• Facts: 
– Merge small files 
©2014 LinkedIn Corporation. All Rights Reserved. 
29
Overview 
• Challenges 
• What does Gobblin provide? 
• How does Gobblin work? 
• Retrospective and lookahead 
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin in Production 
• > 350 datasets 
• ~ 60 TB per day 
• Salesforce 
• Responsys 
• RightNow 
• Timeforce 
• Slideshare 
• Newsle 
• A/B testing 
• LinkedIn JIRA 
• Data retention 
©2014 LinkedIn Corporation. All Rights Reserved. 
31 
Production 
Instances 
Data Volume
Lesson Learned 
• Data quality has a lot more work to do 
• Small data problem is not small 
• Performance optimization opportunities 
• Operational traits 
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Roadmap 
• Gobblin on Yarn 
• Streaming Sources 
• Gobblin Workbench with ingestion DSL 
• Data Profiling for richer quality checking 
• Open source in Q4’14 
©2014 LinkedIn Corporation. All Rights Reserved. 
33
©2014 LinkedIn Corporation. All Rights Reserved.
1 von 34

Recomendados

SSD Deployment Strategies for MySQL von
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLYoshinori Matsunobu
18.9K views52 Folien
NATS Streaming - an alternative to Apache Kafka? von
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?Anton Zadorozhniy
4.6K views13 Folien
A Reference Architecture for ETL 2.0 von
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
21.5K views31 Folien
Migrating to Apache Spark at Netflix von
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixDatabricks
2.2K views34 Folien
High Performance Data Lake with Apache Hudi and Alluxio at T3Go von
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
267 views29 Folien
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy... von
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
4.2K views45 Folien

Más contenido relacionado

Was ist angesagt?

Using Kafka to scale database replication von
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replicationVenu Ryali
732 views46 Folien
Druid deep dive von
Druid deep diveDruid deep dive
Druid deep diveKashif Khan
3K views45 Folien
Open vSwitch - Stateful Connection Tracking & Stateful NAT von
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATThomas Graf
4.3K views17 Folien
Supporting Apache HBase : Troubleshooting and Supportability Improvements von
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
1.8K views47 Folien
Spark and S3 with Ryan Blue von
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
3.9K views29 Folien
Kafka replication apachecon_2013 von
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
21.3K views31 Folien

Was ist angesagt?(20)

Using Kafka to scale database replication von Venu Ryali
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
Venu Ryali732 views
Open vSwitch - Stateful Connection Tracking & Stateful NAT von Thomas Graf
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Thomas Graf4.3K views
Supporting Apache HBase : Troubleshooting and Supportability Improvements von DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit1.8K views
Spark and S3 with Ryan Blue von Databricks
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks3.9K views
Kafka replication apachecon_2013 von Jun Rao
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao21.3K views
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ... von Databricks
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks2.8K views
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase von DataWorks Summit
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit2.6K views
Apache Iceberg - A Table Format for Hige Analytic Datasets von Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
How YugaByte DB Implements Distributed PostgreSQL von Yugabyte
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
Yugabyte1.3K views
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im... von Databricks
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks858 views
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia von Databricks
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks11.7K views
Espresso: LinkedIn's Distributed Data Serving Platform (Paper) von Amy W. Tang
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Amy W. Tang40.1K views
Apache Hudi: The Path Forward von Alluxio, Inc.
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.496 views
Espresso: LinkedIn's Distributed Data Serving Platform (Talk) von Amy W. Tang
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang14.2K views
Pinot: Near Realtime Analytics @ Uber von Xiang Fu
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu21.8K views
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake von Databricks
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks2.2K views
Schema-on-Read vs Schema-on-Write von Amr Awadallah
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah26.9K views

Destacado

gobblin-meetup-yarn von
gobblin-meetup-yarngobblin-meetup-yarn
gobblin-meetup-yarnYinan Li
769 views10 Folien
Gobblin for Data Analytics von
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data AnalyticsIntel IT Center
2.4K views14 Folien
Bringing OLTP woth OLAP: Lumos on Hadoop von
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
3.9K views30 Folien
Gobblin @ NerdWallet (Nov 2015) von
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ
2.2K views15 Folien
19 Sure Ways To Sabotage Your Job Search von
19 Sure Ways To Sabotage Your Job Search19 Sure Ways To Sabotage Your Job Search
19 Sure Ways To Sabotage Your Job SearchJarkko Sjöman
328.6K views43 Folien
Gobblin: Unifying Data Ingestion for Hadoop von
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopYinan Li
2.9K views19 Folien

Destacado(12)

gobblin-meetup-yarn von Yinan Li
gobblin-meetup-yarngobblin-meetup-yarn
gobblin-meetup-yarn
Yinan Li769 views
Bringing OLTP woth OLAP: Lumos on Hadoop von DataWorks Summit
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit3.9K views
Gobblin @ NerdWallet (Nov 2015) von NerdWalletHQ
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
NerdWalletHQ2.2K views
19 Sure Ways To Sabotage Your Job Search von Jarkko Sjöman
19 Sure Ways To Sabotage Your Job Search19 Sure Ways To Sabotage Your Job Search
19 Sure Ways To Sabotage Your Job Search
Jarkko Sjöman328.6K views
Gobblin: Unifying Data Ingestion for Hadoop von Yinan Li
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li2.9K views
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos... von DataWorks Summit
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit3.5K views
Data Ingestion, Extraction & Parsing on Hadoop von skaluska
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska13.9K views
Apache NiFi- MiNiFi meetup Slides von Isheeta Sanghi
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi110.7K views
Introduction to Databus von Amy W. Tang
Introduction to DatabusIntroduction to Databus
Introduction to Databus
Amy W. Tang6.5K views
Fine-Grained Scheduling with Helix (ApacheCon NA 2014) von Kanak Biscuitwala
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Kanak Biscuitwala2.1K views
Data Infrastructure at LinkedIn von Amy W. Tang
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang6.9K views

Similar a Gobblin' Big Data With Ease @ QConSF 2014

eBay Experimentation Platform on Hadoop von
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
2.3K views34 Folien
Experimentation Platform on Hadoop von
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
838 views34 Folien
Big Data Ready Enterprise von
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise DataWorks Summit/Hadoop Summit
2.2K views25 Folien
InfoSphere BigInsights - Analytics power for Hadoop - field experience von
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
2.6K views33 Folien
Hadoop and SQL: Delivery Analytics Across the Organization von
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
2.5K views30 Folien
The Essential Guide for Automating CMDB population and maintenance von
The Essential Guide for Automating CMDB population and maintenanceThe Essential Guide for Automating CMDB population and maintenance
The Essential Guide for Automating CMDB population and maintenanceStefan Bergstein
1.3K views20 Folien

Similar a Gobblin' Big Data With Ease @ QConSF 2014(20)

eBay Experimentation Platform on Hadoop von Tony Ng
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
Tony Ng2.3K views
InfoSphere BigInsights - Analytics power for Hadoop - field experience von Wilfried Hoge
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
Wilfried Hoge2.6K views
Hadoop and SQL: Delivery Analytics Across the Organization von Seeling Cheung
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung2.5K views
The Essential Guide for Automating CMDB population and maintenance von Stefan Bergstein
The Essential Guide for Automating CMDB population and maintenanceThe Essential Guide for Automating CMDB population and maintenance
The Essential Guide for Automating CMDB population and maintenance
Stefan Bergstein1.3K views
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse von Rizaldy Ignacio
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio197 views
Key Methodologies for Migrating from Oracle to Postgres von EDB
Key Methodologies for Migrating from Oracle to PostgresKey Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to Postgres
EDB2.5K views
Tame Big Data with Oracle Data Integration von Michael Rainey
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data Integration
Michael Rainey1.2K views
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En... von MapR Technologies
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
MapR Technologies2.1K views
Which Change Data Capture Strategy is Right for You? von Precisely
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely4.2K views
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud von DataWorks Summit
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit2.4K views
Big SQL 3.0 - Fast and easy SQL on Hadoop von Wilfried Hoge
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
Wilfried Hoge6.3K views
Apache Tez – Present and Future von Jianfeng Zhang
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang618 views
2014.07.11 biginsights data2014 von Wilfried Hoge
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
Wilfried Hoge2.2K views
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov von Big Data Spain
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain445 views
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi von Felicia Haggarty
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty349 views

Último

CRIJ4385_Death Penalty_F23.pptx von
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
7 views24 Folien
shivam tiwari.pptx von
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
5 views14 Folien
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx von
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptxDataScienceConferenc1
6 views16 Folien
Organic Shopping in Google Analytics 4.pdf von
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
16 views13 Folien
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx von
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 views12 Folien
LIVE OAK MEMORIAL PARK.pptx von
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptxms2332always
7 views6 Folien

Último(20)

CRIJ4385_Death Penalty_F23.pptx von yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx von DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
Organic Shopping in Google Analytics 4.pdf von GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx von DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Short Story Assignment by Kelly Nguyen von kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
Ukraine Infographic_22NOV2023_v2.pdf von AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... von DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
Data Journeys Hard Talk workshop final.pptx von info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... von DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
CRM stick or twist.pptx von info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
Data about the sector workshop von info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821715 views

Gobblin' Big Data With Ease @ QConSF 2014

  • 1. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn ©2014 LinkedIn Corporation. All Rights Reserved.
  • 2. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 3. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 4. Perception Analytics Platform Ingest Framework Primary Data Sources Transformations Business ©2014 LinkedIn Corporation. All Rights Reserved. Facing Insights Member Facing Insights and Data Products Load Load Validation Validation
  • 5. Reality Profile Data ©2014 LinkedIn Corporation. All Rights Reserved. 5 Hadoop Camus Lumos Teradata External Partner Data Ingest Framework DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity (tracking) Data R/W store (Oracle/ Espresso) Databus Changes Core Data Set (Tracking, Database, External) Derived Data Set Computed Results for Member Facing Products Enterprise Products Change dump on filer Ingest utilities Lassen (facts and dimensions) Read store (Voldemort)
  • 6. Challenges @ LinkedIn • Large variety of data sources • Multi-paradigm: streaming data, batch data • Different types of data: facts, dimensions, logs, snapshots, increments, changelog • Operational complexity of multiple pipelines • Data quality • Data availability and predictability • Engineering cost ©2014 LinkedIn Corporation. All Rights Reserved.
  • 7. Open source solutions sqoopp aegisthus flumep morphlinep logstash Camus RDBMS vendor-specific connectorsp ©2014 LinkedIn Corporation. All Rights Reserved.
  • 8. Goals • Unified and Structured Data Ingestion Flow – RDBMS -> Hadoop – Event Streams -> Hadoop • Higher level abstractions – Facts, Dimensions – Snapshots, increments, changelog • ELT oriented – Minimize transformation in the ingest pipeline ©2014 LinkedIn Corporation. All Rights Reserved.
  • 9. Central Ingestion Pipeline Hadoop OLTP Data ©2014 LinkedIn Corporation. All Rights Reserved. Teradata External Partner Data Gobblin DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Tracking R/W store (Oracle/ Espresso) Databus Changes Core Data Set (Tracking, Database, External) Derived Data Set Enterprise Products Change dump on filer REST JDBC SOAP Custom Compaction
  • 10. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 11. Gobblin Usage @ LinkedIn • Business Analytics – Source data for, sales analysis, product sentiment analysis, etc. • Engineering – Source data for issue tracking, monitoring, product release, security compliance, A/B testing • Consumer product – Source data for acquisition integration – Performance analysis for email campaign, ads campaign, etc. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 12. Key Features  Horizontally scalable and robust framework  Unified computation paradigm  Turn-key solution  Customize your own Ingestion ©2014 LinkedIn Corporation. All Rights Reserved.
  • 13. Scalable and Robust Framework Centralized State Management ©2014 LinkedIn Corporation. All Rights Reserved. 13 Scalable State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow
  • 14. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 15. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 16. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code ©2014 LinkedIn Corporation. All Rights Reserved.
  • 17. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 18. Under the Hood ©2014 LinkedIn Corporation. All Rights Reserved.
  • 19. Computation Model • Gobblin standalone – single process, multi-threading – Testing, small data, sampling • Gobblin on Map/Reduce – Large datasets, horizontally scalable • Gobblin on Yarn – Better resource utilization – More scheduling flexibilities ©2014 LinkedIn Corporation. All Rights Reserved.
  • 20. Scalable Ingestion Flow ©2014 LinkedIn Corporation. All Rights Reserved. 20 Source Work Unit Work Unit Work Unit Data Publisher Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Task Task Task
  • 21. Sources Source Work Unit Extractor Converter Publisher • Determines how to partition work - Partitioning algorithm can leverage source sharding - Group partitions intelligently for performance • Creates work-units to be scheduled ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
  • 22. Job Management Job run 1 Job run 2 Job run 3 • Job execution states – Watermark – Task state, job state, quality checker output, error code • Job synchronization • Job failure handling: policy driven ©2014 LinkedIn Corporation. All Rights Reserved. 22 State Store
  • 23. Gobblin Operator Flow Extract Schema Extract Record Convert Record ©2014 LinkedIn Corporation. All Rights Reserved. Check Record Data Quality Commit Task Data Write Record Convert Schema Check Task Data Quality 23
  • 24. Extractors Source Work Unit Extractor Converter Publisher ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer • Specifies how to get the schema and pull data from the source • Return ResultSet iterator • Track high watermark • Track extraction metrics
  • 25. Converters Source Work Unit Extractor Converter Publisher • Allow for schema and data transformation – Filtering – projection – type conversion – Structural change • Composable: can specify a list of converters to be applied in the given order ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
  • 26. Quality Checkers • Ensure quality of any data produced by Gobblin • Can be run on a per record, per task, or per job basis • Can specify a list of quality checkers to be applied – Schema compatibility – Audit check – Sensitive fields – Unique key • Policy driven – FAIL – if the check fails then so does the job – OPTIONAL – if the checks fails the job continues – ERR_FILE – the offending row is written to an error file ©2014 LinkedIn Corporation. All Rights Reserved. 26 Source Work Unit Extractor Converter Publisher Quality Checker Writer
  • 27. Writers Source Work Unit Extractor Converter Publisher • Writing data in Avro format onto HDFS – One writer per task • Flexibility – Configurable compression codec (Deflate, Snappy) – Configurable buffer size • Plan to support other data format (Parquet, ORC) ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checker Writer
  • 28. Publishers • Determines job success based on Policy. - COMMIT_ON_FULL_SUCCESS - COMMIT_ON_PARTIAL_SUCCESS • Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 ©2014 LinkedIn Corporation. All Rights Reserved. Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit Extractor Converter Publisher Quality Checker Writer
  • 29. Gobblin Compaction Ingestion HDFS Compaction • Dimensions: – Initial full dump followed by incremental extracts in Gobblin – Maintain a consistent snapshot by doing regularly scheduled compaction • Facts: – Merge small files ©2014 LinkedIn Corporation. All Rights Reserved. 29
  • 30. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead ©2014 LinkedIn Corporation. All Rights Reserved.
  • 31. Gobblin in Production • > 350 datasets • ~ 60 TB per day • Salesforce • Responsys • RightNow • Timeforce • Slideshare • Newsle • A/B testing • LinkedIn JIRA • Data retention ©2014 LinkedIn Corporation. All Rights Reserved. 31 Production Instances Data Volume
  • 32. Lesson Learned • Data quality has a lot more work to do • Small data problem is not small • Performance optimization opportunities • Operational traits ©2014 LinkedIn Corporation. All Rights Reserved.
  • 33. Gobblin Roadmap • Gobblin on Yarn • Streaming Sources • Gobblin Workbench with ingestion DSL • Data Profiling for richer quality checking • Open source in Q4’14 ©2014 LinkedIn Corporation. All Rights Reserved. 33
  • 34. ©2014 LinkedIn Corporation. All Rights Reserved.

Hinweis der Redaktion

  1. Custom data pipelines: Developing a data pipeline per source Data model: Tightly bundled with RDBMS with strict DDL Operations effort: Large amount of pipelines to monitor, maintain and trouble-shoot. Data quality: no source of truth High investment cost and low productivity!