SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Data Infrastructure at Linkedin
Jun Rao and Sam Shah

LinkedIn Confidential ©2013 All Rights Reserved
Outline
1.
2.
3.
4.

LinkedIn introduction
Online/nearline infrastructure
Offline infrastructure
Conclusion

LinkedIn Confidential ©2013 All Rights Reserved

2
The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…

200M+ 2 new
Members Worldwide

Members Per Second

LinkedIn Confidential ©2013 All Rights Reserved

100M+
Monthly Unique Visitors

2M+
Company Pages

3
Two Product Families
For Members

Professionals

For Partners

 People You May Know
 Who’s Viewed My Profile
 Jobs You May Be
Interested In
 News/Sharing
 Today
 Search
 Subscriptions

Hire
Companies

Market
Sell

Science and Analytics
Data Infrastructure
Actions

Profiles
Connections
LinkedIn Confidential ©2013 All Rights Reserved

Data

Content
4
The Big-Data Feedback Loop
Refinement 

Engagement
Value 

Member

Product

Insights 

Virality

Data

Signals

Science
Analytics 

Scale 
Infrastructure
LinkedIn Confidential ©2013 All Rights Reserved

5
LinkedIn Data Infrastructure: Three-Phase Abstraction
Near-Line
Infra

Offline
Data Infra

Application

Users

Infrastructure

Online

Near-Line

Offline

Online Data
Infra

Latency & Freshness Requirements
Activity that should be reflected immediately

•
•
•

Products
• Messages
Member Profiles
• Endorsements
Company Profiles
• Skills
Connections

Activity that should be reflected soon

•
•
•

•
Activity Streams
Profile Standardization •
•
News

Recommendations
Search
Messages

Activity that can be reflected later

•
•
•

People You May Know •
Connection Strength •
News

Recommendations
Next best idea…

LinkedIn Confidential ©2013 All Rights Reserved

6
LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase
ecosystem are
diverse, complex and specific

Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
7
LinkedIn Data Infrastructure Solutions

Voldemort: Highly-Available
Distributed KV Store
• Key/value access at scale

8
Voldemort: Architecture

• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”

•
•
•
•
•

10 clusters, 100+ nodes
Largest cluster – 10K+ qps
Avg latency: 3ms
Hundreds of Stores
Largest store – 2.8TB+
LinkedIn Data Infrastructure Solutions

Espresso: Indexed Timeline-Consistent
Distributed Data Store
• Fill in the gap btw Oracle and KV store

10
Espresso: System Components
• Hierarchical data model
• Timeline consistency
• Rich functionality
• Transactions
• Secondary index
• Text search
• Partitioning/replication
• Change propagation

11
Generic Cluster Manager: Helix
• Generic Distributed State Model
•
•
•
•

ConfigManagement
Automatic Load Balancing
Fault tolerance
Cluster expansion and rebalancing

• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix

12
LinkedIn Data Infrastructure Solutions

Databus : Timeline-Consistent
Change Data Capture
• Deliver data store changes to apps
Databus at LinkedIn
DB

Capture
Changes

Relay
Event Win

On-line
Changes

On-line
Changes

Databus
Client Lib

Client

Snapshot at U

Databus
Client Lib

Consistent

 Transport independent of data
source: Oracle, MySQL, …
 Transactional semantics
 In order, at least once delivery

Consumer n

Client

Bootstrap

DB

Consumer 1

Consumer 1

Consumer n

 Tens of relays
 Hundreds of sources
 Low latency - milliseconds

14
LinkedIn Data Infrastructure Solutions

Kafka: High-Volume Low-Latency
Messaging System
• Log aggregation and queuing

15
Kafka Architecture
Producer

Producer

Broker 1

Broker 2

Broker 3

Broker 4

topic1-part1

topic1-part2

topic2-part1

topic2-part2

topic2-part2

topic1-part1

topic1-part2

topic2-part1

topic2-part1

topic2-part2

topic1-part1

topic1-part2

Key features
• Scale-out architecture
• Automatic load balancing
• High throughput/low latency
• Rewindability
• Intra-cluster replication

Zookeeper

Consumer

Consumer

Per day stats
• writes: 10+ billion messages
• reads: 50+ billion messages
LinkedIn Data Infrastructure: A few take-aways
1.
2.
3.

Building infrastructure in a hyper-growth
environment is challenging.
Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
Balance open-source products with homegrown platforms (**)

LinkedIn Confidential ©2013 All Rights Reserved

17

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
TechEvent Building a Data Lake
TechEvent Building a Data LakeTechEvent Building a Data Lake
TechEvent Building a Data LakeTrivadis
 
Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets Jowanza Joseph
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...DataWorks Summit
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.
 
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)Cathrine Wilhelmsen
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
 
Red Hat JBoss Data Virtualization
Red Hat JBoss Data VirtualizationRed Hat JBoss Data Virtualization
Red Hat JBoss Data VirtualizationDLT Solutions
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 

Was ist angesagt? (20)

Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
TechEvent Building a Data Lake
TechEvent Building a Data LakeTechEvent Building a Data Lake
TechEvent Building a Data Lake
 
Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
 
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Red Hat JBoss Data Virtualization
Red Hat JBoss Data VirtualizationRed Hat JBoss Data Virtualization
Red Hat JBoss Data Virtualization
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 

Ähnlich wie LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...MongoDB
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationDenodo
 
How to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSIHow to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSIDenodo
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...serge luca
 
Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4Tihomir Ignatov
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
 
SharePoint Online vs. On-Premise
SharePoint Online vs. On-PremiseSharePoint Online vs. On-Premise
SharePoint Online vs. On-PremiseEvan Hodges
 
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Denodo
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Best practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biBest practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biSatya Shyam K Jayanty
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB
 
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Denodo
 

Ähnlich wie LinkedIn Infrastructure (analytics@webscale, at fb 2013) (20)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
How to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSIHow to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSI
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
 
Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
SharePoint Online vs. On-Premise
SharePoint Online vs. On-PremiseSharePoint Online vs. On-Premise
SharePoint Online vs. On-Premise
 
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Best practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biBest practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power bi
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
 
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
 

Kürzlich hochgeladen

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

  • 1. Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
  • 2. Outline 1. 2. 3. 4. LinkedIn introduction Online/nearline infrastructure Offline infrastructure Conclusion LinkedIn Confidential ©2013 All Rights Reserved 2
  • 3. The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 200M+ 2 new Members Worldwide Members Per Second LinkedIn Confidential ©2013 All Rights Reserved 100M+ Monthly Unique Visitors 2M+ Company Pages 3
  • 4. Two Product Families For Members Professionals For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Companies Market Sell Science and Analytics Data Infrastructure Actions Profiles Connections LinkedIn Confidential ©2013 All Rights Reserved Data Content 4
  • 5. The Big-Data Feedback Loop Refinement  Engagement Value  Member Product Insights  Virality Data Signals Science Analytics  Scale  Infrastructure LinkedIn Confidential ©2013 All Rights Reserved 5
  • 6. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line Infra Offline Data Infra Application Users Infrastructure Online Near-Line Offline Online Data Infra Latency & Freshness Requirements Activity that should be reflected immediately • • • Products • Messages Member Profiles • Endorsements Company Profiles • Skills Connections Activity that should be reflected soon • • • • Activity Streams Profile Standardization • • News Recommendations Search Messages Activity that can be reflected later • • • People You May Know • Connection Strength • News Recommendations Next best idea… LinkedIn Confidential ©2013 All Rights Reserved 6
  • 7. LinkedIn Data Infrastructure: Sample Stack Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms 7
  • 8. LinkedIn Data Infrastructure Solutions Voldemort: Highly-Available Distributed KV Store • Key/value access at scale 8
  • 9. Voldemort: Architecture • Pluggable components • Tunable consistency / availability • Key/value model, server side “views” • • • • • 10 clusters, 100+ nodes Largest cluster – 10K+ qps Avg latency: 3ms Hundreds of Stores Largest store – 2.8TB+
  • 10. LinkedIn Data Infrastructure Solutions Espresso: Indexed Timeline-Consistent Distributed Data Store • Fill in the gap btw Oracle and KV store 10
  • 11. Espresso: System Components • Hierarchical data model • Timeline consistency • Rich functionality • Transactions • Secondary index • Text search • Partitioning/replication • Change propagation 11
  • 12. Generic Cluster Manager: Helix • Generic Distributed State Model • • • • ConfigManagement Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 12
  • 13. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture • Deliver data store changes to apps
  • 14. Databus at LinkedIn DB Capture Changes Relay Event Win On-line Changes On-line Changes Databus Client Lib Client Snapshot at U Databus Client Lib Consistent  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery Consumer n Client Bootstrap DB Consumer 1 Consumer 1 Consumer n  Tens of relays  Hundreds of sources  Low latency - milliseconds 14
  • 15. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System • Log aggregation and queuing 15
  • 16. Kafka Architecture Producer Producer Broker 1 Broker 2 Broker 3 Broker 4 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 Key features • Scale-out architecture • Automatic load balancing • High throughput/low latency • Rewindability • Intra-cluster replication Zookeeper Consumer Consumer Per day stats • writes: 10+ billion messages • reads: 50+ billion messages
  • 17. LinkedIn Data Infrastructure: A few take-aways 1. 2. 3. Building infrastructure in a hyper-growth environment is challenging. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) Balance open-source products with homegrown platforms (**) LinkedIn Confidential ©2013 All Rights Reserved 17

Hinweis der Redaktion

  1. Enterprise Facing is all about Segmentation and Connections Our base data lead to revenue-generating productsEnterprise Application-building problems with deterministic life-cycles Science is key for targeting and matching (e.g. CAP, Marketing Solutions) Key back-office play for Hiring, Sales and Marketing for 85% of Fortune-500
  2. Transition needs to be goodProducts => data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
  3. Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban