SlideShare ist ein Scribd-Unternehmen logo
1 von 24
The Big Data Analytics
Ecosystem at LinkedIn
Rajappa Iyer
September 17, 2013
Agenda
 LinkedIn by the numbers
 An Overview of Data Driven Products / Insights
 The Big Data Analytics Ecosystem
– Storage and Compute Platforms
– Data Transport Pipelines
– Data Processing Pipelines
– Operational Tooling - Metadata
 Q&A
LinkedIn: The World’s Largest
Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
238M+ 3M+
Company Pages
Connecting Talent ïƒł Opportunity. At scale

Insights
(Analysts and Data
Scientists)
Data Driven Products and
Insights
Products for
Members
(Professionals)
Products for
Enterprises
(Companies)
Data,
Platforms,
Analytics
Products for Members
Products for Enterprises
Sell - Sales Navigator Market - Marketing Solutions
Hire - Talent Solutions
Examples of Business Insights
Example of Deeper Insight
Job Migration After Financial Collapse
A Simplified Overview of Data Flow
Hadoop
Camus
Lumos
Teradata
External
Partner Data
Ingest
Utilities
DWH ETL
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
Data
Espresso /
Voldemort /
Oracle
Member Data
DatabusChanges
Derived
Data Set
Core Data
Set
Computed Results for Member Facing Products
Enterprise
Products
Storage and Compute Platforms
LinkedIn Confidential ©2013 All Rights Reserved 10
Most data in Avro format Access via Hive and Pig
Most ETL processes run here
Specialized batch processing
Algorithmic data mining
Storage and Compute Platforms
LinkedIn Confidential ©2013 All Rights Reserved 11
Integrated Data Warehouse
Standard BI Tools
Interactive Querying
(Low latency)
Workload Management
Transport Pipeline - Kafka
LinkedIn Confidential ©2013 All Rights Reserved 12
 High-volume, low-latency
messaging system
 Horizontally scalable
 Automatic load balancing
 Rewindability
 Intra-cluster replication
 Mainly used for log
aggregation and queuing
Transport Pipeline - Databus
 Timeline consistent data change capture
 Works with Oracle, MySQL, Espresso

 Transactional semantics
 In-order, at least once delivery
 Low latency
 Has scaled to 100s of sources
LinkedIn Confidential ©2013 All Rights Reserved 13
Hadoop
Kafka
Brokers
Topic Registry
(zookeeper)
Camus
Incremental
Pull
Hourly Data
by Topic
Last
processed
offset by topic
Daily Data
by Topic
Camus
Daily
Compaction
Audit DB
Camus
Audit
Job
audit
counts
audit
counts
Hive
Registration
Topics
1 ïŹle/
run/
topic/
partitionTopics
Data + offsets
Processing Pipeline: Camus
Camus: Framework for ingesting Kafka streams to HDFS
LinkedIn Confidential ©2013 All Rights Reserved 14
Camus: Features
 Highly scalable due to adaptive input format
– Handled 10x increase in data volume without
change
 Restartable with checkpointing
 Robust auditing support
 Plays nicely with Hive and Pig
– Avro format support
– Hive metastore registration
 Open source
– https://github.com/linkedin/camus
LinkedIn Confidential ©2013 All Rights Reserved 15
Processing Pipeline: Lumos
LinkedIn Confidential ©2013 All Rights Reserved 16
Lumos: Framework to ingest database data to HDFS
PROD
Oracle
Virtual
Snapshot
Materializer
ETL Hadoop Cluster
Staging
Data
(internal)
Data-
Bus
DB
Extract
Lazy
Snapshot
Materializer
External
Data
Inc/Full
(internal)
DWH
processes
Meta-
Data
Published
Virtual
Snapshot
Pig/Hive
Loaders
PROD
Espresso
Lumos: Features
 Supports Espresso, Oracle and MySQL as sources
 Full snapshots and incremental dumps
 Automatic type translation for most database types
 Provides LAST UPDATE semantics for data
 Supports low latency requirements
– Reader API performs just-in-time compaction
 Snapshot constructed two ways:
– On demand compaction for upserts
– Periodic snapshotting that reflects deletes as well
LinkedIn Confidential ©2013 All Rights Reserved 17
Operational Support - Metadata
 ETL pipeline is a complex graph of workflows
– Our comprehensive dashboard production flow is
nearly 30 levels deep with complex dependencies
 To manage this, we needed to capture:
– Process dependencies
– Data dependencies
– Process execution history
– Data load status
– Data consumption status (watermarks)
Operational Metadata – v1
 Capture process
dependency graph
– Also capture useful
metadata such as process
owners
 Capture stats for each
execution of a workflow
– Time of execution
– Status
– Pointer to error logs
 Has proved quite useful for
monitoring critical chains
WorkïŹ‚ow F
Workunit
W1
Workunit
W2
Workunit
W3
Workunit
W4
Workunit
W5
on success
on success on failure
on successon success
Start
Stop
Operational Metadata – v2
Data Entity
D1
Data Entity
D2
Data Entity
D3
WorkïŹ‚ow F
consumes consumes
produces
 For each flow, capture input
and output data elements
 For each execution, capture
stats on data element, e.g.
– Number of records / lines read
– Number of records / lines
written
– Error counts
– Last processed record
 Can be time based or sequence
based
 This can be per flow as more
than one flow can consume a
data element
Operational Metadata – The Payoff
 Restartable ETL jobs
– Process new data since last successful previous run
 Catch up mode for ETL jobs
– Single run can consume data from multiple intervals
in one batch
– Next run will resume from correct place
 Data freshness and availability dashboard
 Coarse form of data lineage
– Impact analysis for unfortunately all-too-common
changes upstream
Putting it all Together
LinkedIn Confidential ©2013 All Rights Reserved 22
Online Data
Stores
Data
Transport
Pipelines
Data
Processing
Pipelines
Offline
Storage /
Compute
Analytics
Applications
Espresso
Voldemort
Kafka
Databus
Camus
Lumos
Hadoop
Teradata
Operational Metadata and Tooling
`whoami`
 Sr. Manager / DWH Architect @ LinkedIn
since 2011
 Prior to that:
– Director of Engineering at Digg
– Enterprise Data Architect at eBay
 www.linkedin.com/in/rajappaiyer/
Questions?
More at data.linkedin.com
We’re Hiring

Weitere Àhnliche Inhalte

Was ist angesagt?

NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )ANKUSH
 
Big data-ppt
Big data-pptBig data-ppt
Big data-pptNazir Ahmed
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis pptAntaraBhattacharya12
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and RisksDATAVERSITY
 
Data Architecture Strategies
Data Architecture StrategiesData Architecture Strategies
Data Architecture StrategiesDATAVERSITY
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment reviewLalit Jain
 
AnalytiX DS - Master Deck
AnalytiX DS - Master DeckAnalytiX DS - Master Deck
AnalytiX DS - Master DeckAnalytiX DS
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data AnalyticsProduct School
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 

Was ist angesagt? (20)

NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and Risks
 
Data Architecture Strategies
Data Architecture StrategiesData Architecture Strategies
Data Architecture Strategies
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
AnalytiX DS - Master Deck
AnalytiX DS - Master DeckAnalytiX DS - Master Deck
AnalytiX DS - Master Deck
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data Analytics
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 

Andere mochten auch

Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...rajappaiyer
 
Inside Linkedin
Inside LinkedinInside Linkedin
Inside Linkedinshikhamathur
 
Key note big data analytics ecosystem strategy
Key note   big data analytics ecosystem strategyKey note   big data analytics ecosystem strategy
Key note big data analytics ecosystem strategyIBM Sverige
 
Teradata Aster: Big Data Discovery Made Easy
Teradata Aster: Big Data Discovery Made EasyTeradata Aster: Big Data Discovery Made Easy
Teradata Aster: Big Data Discovery Made EasyTIBCO Spotfire
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architectureDataWorks Summit
 
Data flow in Extraction of ETL data warehousing
Data flow in Extraction of ETL data warehousingData flow in Extraction of ETL data warehousing
Data flow in Extraction of ETL data warehousingDr. Dipti Patil
 
Teradata Unity
Teradata UnityTeradata Unity
Teradata UnityTeradata
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Data Con LA
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Teradata introduction - A basic introduction for Taradate system Architecture
Teradata introduction - A basic introduction for Taradate system ArchitectureTeradata introduction - A basic introduction for Taradate system Architecture
Teradata introduction - A basic introduction for Taradate system ArchitectureMohammad Tahoon
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 

Andere mochten auch (15)

Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
 
Inside Linkedin
Inside LinkedinInside Linkedin
Inside Linkedin
 
Key note big data analytics ecosystem strategy
Key note   big data analytics ecosystem strategyKey note   big data analytics ecosystem strategy
Key note big data analytics ecosystem strategy
 
Teradata Aster: Big Data Discovery Made Easy
Teradata Aster: Big Data Discovery Made EasyTeradata Aster: Big Data Discovery Made Easy
Teradata Aster: Big Data Discovery Made Easy
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architecture
 
Data flow in Extraction of ETL data warehousing
Data flow in Extraction of ETL data warehousingData flow in Extraction of ETL data warehousing
Data flow in Extraction of ETL data warehousing
 
Teradata Unity
Teradata UnityTeradata Unity
Teradata Unity
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Teradata introduction - A basic introduction for Taradate system Architecture
Teradata introduction - A basic introduction for Taradate system ArchitectureTeradata introduction - A basic introduction for Taradate system Architecture
Teradata introduction - A basic introduction for Taradate system Architecture
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Teradata - Architecture of Teradata
Teradata - Architecture of TeradataTeradata - Architecture of Teradata
Teradata - Architecture of Teradata
 

Ähnlich wie The Big Data Analytics Ecosystem at LinkedIn

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & AnswersZaranTech LLC
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014Goetz Lessmann
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 
Big and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsBig and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsFred Melo
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoRomit Mehta
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 

Ähnlich wie The Big Data Analytics Ecosystem at LinkedIn (20)

OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & Answers
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
Big and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsBig and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable Systems
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 

KĂŒrzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

KĂŒrzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

The Big Data Analytics Ecosystem at LinkedIn

  • 1. The Big Data Analytics Ecosystem at LinkedIn Rajappa Iyer September 17, 2013
  • 2. Agenda  LinkedIn by the numbers  An Overview of Data Driven Products / Insights  The Big Data Analytics Ecosystem – Storage and Compute Platforms – Data Transport Pipelines – Data Processing Pipelines – Operational Tooling - Metadata  Q&A
  • 3. LinkedIn: The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 238M+ 3M+ Company Pages Connecting Talent ïƒł Opportunity. At scale

  • 4. Insights (Analysts and Data Scientists) Data Driven Products and Insights Products for Members (Professionals) Products for Enterprises (Companies) Data, Platforms, Analytics
  • 6. Products for Enterprises Sell - Sales Navigator Market - Marketing Solutions Hire - Talent Solutions
  • 8. Example of Deeper Insight Job Migration After Financial Collapse
  • 9. A Simplified Overview of Data Flow Hadoop Camus Lumos Teradata External Partner Data Ingest Utilities DWH ETL Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity Data Espresso / Voldemort / Oracle Member Data DatabusChanges Derived Data Set Core Data Set Computed Results for Member Facing Products Enterprise Products
  • 10. Storage and Compute Platforms LinkedIn Confidential ©2013 All Rights Reserved 10 Most data in Avro format Access via Hive and Pig Most ETL processes run here Specialized batch processing Algorithmic data mining
  • 11. Storage and Compute Platforms LinkedIn Confidential ©2013 All Rights Reserved 11 Integrated Data Warehouse Standard BI Tools Interactive Querying (Low latency) Workload Management
  • 12. Transport Pipeline - Kafka LinkedIn Confidential ©2013 All Rights Reserved 12  High-volume, low-latency messaging system  Horizontally scalable  Automatic load balancing  Rewindability  Intra-cluster replication  Mainly used for log aggregation and queuing
  • 13. Transport Pipeline - Databus  Timeline consistent data change capture  Works with Oracle, MySQL, Espresso
  Transactional semantics  In-order, at least once delivery  Low latency  Has scaled to 100s of sources LinkedIn Confidential ©2013 All Rights Reserved 13
  • 14. Hadoop Kafka Brokers Topic Registry (zookeeper) Camus Incremental Pull Hourly Data by Topic Last processed offset by topic Daily Data by Topic Camus Daily Compaction Audit DB Camus Audit Job audit counts audit counts Hive Registration Topics 1 ïŹle/ run/ topic/ partitionTopics Data + offsets Processing Pipeline: Camus Camus: Framework for ingesting Kafka streams to HDFS LinkedIn Confidential ©2013 All Rights Reserved 14
  • 15. Camus: Features  Highly scalable due to adaptive input format – Handled 10x increase in data volume without change  Restartable with checkpointing  Robust auditing support  Plays nicely with Hive and Pig – Avro format support – Hive metastore registration  Open source – https://github.com/linkedin/camus LinkedIn Confidential ©2013 All Rights Reserved 15
  • 16. Processing Pipeline: Lumos LinkedIn Confidential ©2013 All Rights Reserved 16 Lumos: Framework to ingest database data to HDFS PROD Oracle Virtual Snapshot Materializer ETL Hadoop Cluster Staging Data (internal) Data- Bus DB Extract Lazy Snapshot Materializer External Data Inc/Full (internal) DWH processes Meta- Data Published Virtual Snapshot Pig/Hive Loaders PROD Espresso
  • 17. Lumos: Features  Supports Espresso, Oracle and MySQL as sources  Full snapshots and incremental dumps  Automatic type translation for most database types  Provides LAST UPDATE semantics for data  Supports low latency requirements – Reader API performs just-in-time compaction  Snapshot constructed two ways: – On demand compaction for upserts – Periodic snapshotting that reflects deletes as well LinkedIn Confidential ©2013 All Rights Reserved 17
  • 18. Operational Support - Metadata  ETL pipeline is a complex graph of workflows – Our comprehensive dashboard production flow is nearly 30 levels deep with complex dependencies  To manage this, we needed to capture: – Process dependencies – Data dependencies – Process execution history – Data load status – Data consumption status (watermarks)
  • 19. Operational Metadata – v1  Capture process dependency graph – Also capture useful metadata such as process owners  Capture stats for each execution of a workflow – Time of execution – Status – Pointer to error logs  Has proved quite useful for monitoring critical chains WorkïŹ‚ow F Workunit W1 Workunit W2 Workunit W3 Workunit W4 Workunit W5 on success on success on failure on successon success Start Stop
  • 20. Operational Metadata – v2 Data Entity D1 Data Entity D2 Data Entity D3 WorkïŹ‚ow F consumes consumes produces  For each flow, capture input and output data elements  For each execution, capture stats on data element, e.g. – Number of records / lines read – Number of records / lines written – Error counts – Last processed record  Can be time based or sequence based  This can be per flow as more than one flow can consume a data element
  • 21. Operational Metadata – The Payoff  Restartable ETL jobs – Process new data since last successful previous run  Catch up mode for ETL jobs – Single run can consume data from multiple intervals in one batch – Next run will resume from correct place  Data freshness and availability dashboard  Coarse form of data lineage – Impact analysis for unfortunately all-too-common changes upstream
  • 22. Putting it all Together LinkedIn Confidential ©2013 All Rights Reserved 22 Online Data Stores Data Transport Pipelines Data Processing Pipelines Offline Storage / Compute Analytics Applications Espresso Voldemort Kafka Databus Camus Lumos Hadoop Teradata Operational Metadata and Tooling
  • 23. `whoami`  Sr. Manager / DWH Architect @ LinkedIn since 2011  Prior to that: – Director of Engineering at Digg – Enterprise Data Architect at eBay  www.linkedin.com/in/rajappaiyer/