SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Xin Fu, Carl Steinbach
Hadoop Summit
Tokyo, October 26, 2016
Path to 400M* Members: LinkedIn’s Data Powered Journey
* As of Q2 2016, LinkedIn had 450M members world wide
2
2004
2011 2012
2009
2012 2015
3
Real Time Visualization of New Sign-ups
What Does “Data-Driven” Mean at LinkedIn?
4
What Does “Data-Driven” Mean at LinkedIn?
5
Monitoring & Learning
6
What is This Phase Comprised of?
7
● Dashboards
● Reports
● Trend explanation
○ Short term fluctuation:
investigation
○ Long term trend:
strategic analysis
Past Challenges
8
Reliability
● Easily broken without operational support, huge time spent in
maintenance
Diverse technology
● Self maintained pipelines
● Various UIs with different visualization capabilities
● Redundant computation
Standardized Reporting Tool
9
● Reduces dependency on 3rd party BI tools
● Closer integration with LinkedIn’s ecosystem of experimentation
and anomaly detection solutions
Towards Real Time Monitoring
10
Sign-up
Country
Platform
Language
Browser
Signup Type
OS
Experimentation & Analysis
11
What is This Phase Comprised of?
12
● Experiment design
● Experiment analysis to inform ramp decisions
● Learning from multiple experiments to identify what works and
what doesn’t work
Past Challenges
13
Experiment design
● Interaction between experiments
Experiment analysis and ramp decision
● Manual analysis, extended time-to-
decision
● Ramp decisions based on localized
metrics
● Reruns needed sometimes due to
undetected errors in setup
Worst of all, some ramps happened without
A/B testing
● e.g. infrastructural changes
Experimentation Platform @ LinkedIn
14
● Company-wide platform for A/B
testing, ramping, and advanced
targeting needs
● Automated reporting and analysis
capabilities
Tiering of Metrics
15
Metrics at different tier:
● Different review processes
● Different levels of visibility in dashboards
and experiment scorecards
● Different computation priorities and
SLAs in data pipelines
● Different life cycles
Backend Infrastructure for Tracking &
Instrumentation
16
17
InvitationClickEvent()
Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products
Tracking Data Records User Activity
Tracking Data Lifecycle and Teams
18
Product teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH,
...
Data teams:
Analytics, Relevance Engineers,...
Example: How Do We Track a Profile View?
19
PageViewEvent
Record 1:
{
"header" : {
"memberId" : 12345,
"time" : 1454745292951,
"appName" : {
"string" : "LinkedIn"
"pageKey" : "profile_page"
},
},
"trackingInfo" : {
["vieweeID" : "23456"],
...
}
}
pageViews = LOAD ‘/data/tracking/PageViewEvent’;
profileViews = FILTER pageViews by
header.pageKey==‘profile_page’;
Example: How Do We Track a Profile View?
20
PageViewEvent
Record 1:
{
"header" : {
"memberId" : 12345,
"time" : 1454745292951,
"appName" : {
"string" : "LinkedIn"
"pageKey" : "new_profile_page"
},
},
"trackingInfo" : {
["vieweeID" : "23456"],
...
}
}
pageViews = LOAD ‘/data/tracking/PageViewEvent’;
profileViews = FILTER pageViews by
header.pageKey==‘profile_page’ or
header.pageKey==‘new_profile_page’;
At Some Point It Becomes Unmaintainable ...
21
How Do We Handle Old and New?
22
Producers Consumers
DALI: A Data Access Layer for LinkedIn
Abstract away underlying physical details to allow users
to focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
We had been working on something that could
help...
24
Data Catalog +
Discovery
(DALI)
DaliFileSystem Client
Data Source
(HDFS)
Data Sink
(HDFS)
Processing Engine
(MapReduce, Spark, Presto)
DALI Datasets (Tables + Views)
Query Layers
(Hive, Pig, Spark)
View Defs + UDFs
(Artifactory, Git)
Dataflow APIs
(MR, Spark, Scalding)DALI CLI
DALI: Implementation Details in Context
Solving with DALI Views
Producers Consumers
State of the World Today with Dali
~ 100 producer views
~ 200 consumer views
~ 80 unique tracking event data sources
What’s next?
! Views on streaming data
! Selective materialization and caching
! Open source
At the Core of “Data-Driven” is ....
27
28
Used to be Tug of War Between Speed and Quality
29
Before We Learned that Technology Could Break
the Dichotomy Between Speed and Quality
30
Cultural Aspects: Partnership Data Scientists and
Engineers
Interesting Challenges
- Metric trade-off, e.g.
between engagement
vs. monetization
- Real-time everything?
- A/B test in a social
network
- Human judge for
personalized search
- Value of an action
31
It Took a Village
32
Thanks to all the Data Scientists, Engineers and Product partners at
LinkedIn for being part of this great journey!
https://engineering.linkedin.com/data

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
 
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and Challenges
 

Ähnlich wie Path to 400M Members: LinkedIn’s Data Powered Journey

Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
DataWorks Summit
 
Data analytic for mobile app development
Data analytic for mobile app developmentData analytic for mobile app development
Data analytic for mobile app development
Trieu Nguyen
 

Ähnlich wie Path to 400M Members: LinkedIn’s Data Powered Journey (20)

Build Answer-generating Apps that Users Love: Development best practices for ...
Build Answer-generating Apps that Users Love: Development best practices for ...Build Answer-generating Apps that Users Love: Development best practices for ...
Build Answer-generating Apps that Users Love: Development best practices for ...
 
Lean User Testing Intro
Lean User Testing IntroLean User Testing Intro
Lean User Testing Intro
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4
 
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
 
UX Analytics for Data-driven Product Development
UX Analytics for Data-driven Product DevelopmentUX Analytics for Data-driven Product Development
UX Analytics for Data-driven Product Development
 
Houston UiPath Community - Document Understanding Solution Accelerators
Houston UiPath Community - Document Understanding Solution AcceleratorsHouston UiPath Community - Document Understanding Solution Accelerators
Houston UiPath Community - Document Understanding Solution Accelerators
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
DAMG7245-Fall23-FinalProjectProposal.pdf
DAMG7245-Fall23-FinalProjectProposal.pdfDAMG7245-Fall23-FinalProjectProposal.pdf
DAMG7245-Fall23-FinalProjectProposal.pdf
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 
Developing software/App requirements specification
Developing software/App requirements specificationDeveloping software/App requirements specification
Developing software/App requirements specification
 
Developing software and/or App requirements specification
Developing software and/or App requirements specificationDeveloping software and/or App requirements specification
Developing software and/or App requirements specification
 
Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413
 
Data Analytics for Mobile App Development
Data Analytics for Mobile App DevelopmentData Analytics for Mobile App Development
Data Analytics for Mobile App Development
 
Data analytic for mobile app development
Data analytic for mobile app developmentData analytic for mobile app development
Data analytic for mobile app development
 
E commerce
E commerce E commerce
E commerce
 

Mehr von DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Path to 400M Members: LinkedIn’s Data Powered Journey

  • 1. Xin Fu, Carl Steinbach Hadoop Summit Tokyo, October 26, 2016 Path to 400M* Members: LinkedIn’s Data Powered Journey * As of Q2 2016, LinkedIn had 450M members world wide
  • 3. 3 Real Time Visualization of New Sign-ups
  • 4. What Does “Data-Driven” Mean at LinkedIn? 4
  • 5. What Does “Data-Driven” Mean at LinkedIn? 5
  • 7. What is This Phase Comprised of? 7 ● Dashboards ● Reports ● Trend explanation ○ Short term fluctuation: investigation ○ Long term trend: strategic analysis
  • 8. Past Challenges 8 Reliability ● Easily broken without operational support, huge time spent in maintenance Diverse technology ● Self maintained pipelines ● Various UIs with different visualization capabilities ● Redundant computation
  • 9. Standardized Reporting Tool 9 ● Reduces dependency on 3rd party BI tools ● Closer integration with LinkedIn’s ecosystem of experimentation and anomaly detection solutions
  • 10. Towards Real Time Monitoring 10 Sign-up Country Platform Language Browser Signup Type OS
  • 12. What is This Phase Comprised of? 12 ● Experiment design ● Experiment analysis to inform ramp decisions ● Learning from multiple experiments to identify what works and what doesn’t work
  • 13. Past Challenges 13 Experiment design ● Interaction between experiments Experiment analysis and ramp decision ● Manual analysis, extended time-to- decision ● Ramp decisions based on localized metrics ● Reruns needed sometimes due to undetected errors in setup Worst of all, some ramps happened without A/B testing ● e.g. infrastructural changes
  • 14. Experimentation Platform @ LinkedIn 14 ● Company-wide platform for A/B testing, ramping, and advanced targeting needs ● Automated reporting and analysis capabilities
  • 15. Tiering of Metrics 15 Metrics at different tier: ● Different review processes ● Different levels of visibility in dashboards and experiment scorecards ● Different computation priorities and SLAs in data pipelines ● Different life cycles
  • 16. Backend Infrastructure for Tracking & Instrumentation 16
  • 17. 17 InvitationClickEvent() Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products Tracking Data Records User Activity
  • 18. Tracking Data Lifecycle and Teams 18 Product teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data teams: Analytics, Relevance Engineers,...
  • 19. Example: How Do We Track a Profile View? 19 PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } pageViews = LOAD ‘/data/tracking/PageViewEvent’; profileViews = FILTER pageViews by header.pageKey==‘profile_page’;
  • 20. Example: How Do We Track a Profile View? 20 PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "new_profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } pageViews = LOAD ‘/data/tracking/PageViewEvent’; profileViews = FILTER pageViews by header.pageKey==‘profile_page’ or header.pageKey==‘new_profile_page’;
  • 21. At Some Point It Becomes Unmaintainable ... 21
  • 22. How Do We Handle Old and New? 22 Producers Consumers
  • 23. DALI: A Data Access Layer for LinkedIn Abstract away underlying physical details to allow users to focus solely on the logical concerns Logical Tables + Views Logical FileSystem We had been working on something that could help...
  • 24. 24 Data Catalog + Discovery (DALI) DaliFileSystem Client Data Source (HDFS) Data Sink (HDFS) Processing Engine (MapReduce, Spark, Presto) DALI Datasets (Tables + Views) Query Layers (Hive, Pig, Spark) View Defs + UDFs (Artifactory, Git) Dataflow APIs (MR, Spark, Scalding)DALI CLI DALI: Implementation Details in Context
  • 25. Solving with DALI Views Producers Consumers
  • 26. State of the World Today with Dali ~ 100 producer views ~ 200 consumer views ~ 80 unique tracking event data sources What’s next? ! Views on streaming data ! Selective materialization and caching ! Open source
  • 27. At the Core of “Data-Driven” is .... 27
  • 28. 28 Used to be Tug of War Between Speed and Quality
  • 29. 29 Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality
  • 30. 30 Cultural Aspects: Partnership Data Scientists and Engineers
  • 31. Interesting Challenges - Metric trade-off, e.g. between engagement vs. monetization - Real-time everything? - A/B test in a social network - Human judge for personalized search - Value of an action 31
  • 32. It Took a Village 32 Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey! https://engineering.linkedin.com/data