SlideShare a Scribd company logo
1 of 36
Download to read offline
A Small Overview of Big Data Products,
Analytics and Infrastructure at Linkedin
Bhaskar Ghosh
Senior Director of Engineering
Data Infrastructure
LinkedIn Confidential ©2013 All Rights Reserved
Big Data Science
A Symposium in Honor of Martin Schultz
Yale University
26 Oct 2012
Outline
LinkedIn Confidential ©2013 All Rights Reserved 2
1. Martin and Me
2. Company and Mission
3. Products and Science
4. Data Infrastructure
5. P, S, DI: People You May Know
6. Linkedin + Yale
7. Conclusion
Martin and Me
LinkedIn Confidential ©2013 All Rights Reserved 3
Thank you Martin! Best mentor.
Versatility, big-picture thinking and leadership.
Yale CS Ph.D. 1995 (Parallel Algorithms)
12y @ Informix & Oracle building parallel
database systems
4y @ Yahoo! building Ads systems & leading
the Display Ads Exchange organization
2y+ @ LinkedIn building & leading the
Data Infrastructure Engineering Organization
The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
175M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 4
..and a bunch of Data-Driven Products
LinkedIn Confidential ©2013 All Rights Reserved 5
Pandora Search for People
Events You
May Be
Interested In
Groups browse maps
The LinkedIn Mission.
Connect the world’s professionals to make them
more productive and successful
Linkedin Product Philosophy
LinkedIn Confidential ©2013 All Rights Reserved 7
Goals
Approach
 Provide a uniquely personalized experience to
members (professionals)
 Build an ecosystem to balance the interests of
members and partners (companies)
 Launch Often and Early
 Data-Driven Experiment and Test
 Fail Fast
 Prepare for Virality and Scale
Two Product Families
LinkedIn Confidential ©2013 All Rights Reserved 8
Data
Data Infrastructure
Science and Analytics
Professionals Companies
Connections
Profiles Actions
Content
For Members For Partners
 People You May Know
 Who’s Viewed My Profile
 Jobs You May Be
Interested In
 News/Sharing
 Today
 Search
 Subscriptions
Hire
Market
Sell
The Big-Data Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 9
Value 
Insights 
Scale 
Product
ScienceData
Member
Engagement 
Virality 
Signals 
Refinement 
Infrastructure
Analytics 
LinkedIn Confidential ©2013 All Rights Reserved 10
Product Family Products Science
Identity and
Engagement
Search and
Analysis
Recommendations
Monetization
1. Profile and Connections
2. Activity Streams
3. Messages (email)
4. Endorsements & Skills
Blending and ranking of
heterogeneous content
(e.g. Network Updates,
Group Discussions, Job
Postings)
1. People Search
2. Group Search
3. Who Viewed My Profile
1. People You May Know
2. Jobs You May Be
Interested In
3. Events You May Be
Interested In
Entity
disambiguation and
matching
1. Subscription Packages
2. Sponsored Content
Response Prediction
Inventory Forecasting
Data Infra
Member-Facing Products: Diversity at Scale
Recommendations…Are Effective .. And Drive
LinkedIn Confidential ©2013 All Rights Reserved 11
> 50% of connections
> 50% of job applications > 50% of group joins
• Find data that is useful for Members
• Guiding Principle
• Provide Relevant Content
• Establish Social Connections
• In Appropriate Context
Behavior
Analysis
Behavior
Analysis
Collaborative
Filtering
Collaborative
Filtering
PopularityPopularity
SimilarProfilesSimilarProfiles
ReferralCenterReferralCenter
TalentMatchTalentMatch
PeopleBrowse
Map
PeopleBrowse
Map
People
Recom-
mendation
Types
Shared,
Dynamic,
Unified
Core
Service
Products
Recom-
mendation
Entities
JobsBrowse
Map
JobsBrowse
Map
SimilarJobsSimilarJobs
Jobs
JobsYouMay
beinterestedin
JobsYouMay
beinterestedin
… Ads
Companies
Searches
News
Events
… and more
GYMLGYML
Groups
BrowseMap
Groups
BrowseMap
Groups
SimilarGroupsSimilarGroups
User FeedbackUser Feedback
APIAPI
(R-T) Feature Extraction, Entity
Resolution & Enrichment
(R-T) matching computations
A/BA/B
Offline data munging (hadoop)
LinkedIn Recommendation Engine
LinkedIn Confidential ©2013 All Rights Reserved 13
Product Family Products Science
Identity and
Engagement
Search and
Analysis
Recommendations
Monetization
1. Profile and Connections
2. Activity Streams
3. Messages (email)
4. Endorsements & Skills
Blending and ranking of
heterogeneous content
(e.g. Network Updates,
Group Discussions, Job
Postings)
1. People Search
2. Group Search
3. Who Viewed My Profile
1. People You May Know
2. Jobs You May Be
Interested In
3. Events You May Be
Interested In
Entity
disambiguation and
matching
1. Subscription Packages
2. Sponsored Content
Response prediction
Data Infra
• Scale
• Full text and
secondary ind
• Real-time
• Faceted search
• Near RT index
freshness
• Drill-down
exploration
• Graph analysis
• Content serving
• Real-time tuning
Member-Facing Products: Diversity at Scale
LinkedIn Data Infrastructure: Three-Phase Abstraction
LinkedIn Confidential ©2013 All Rights Reserved 14
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately
• Member Profiles
• Company Profiles
• Connections
• Messages
• Endorsements
• Skills
Near-Line Activity that should be reflected soon
• Activity Streams
• Profile Standardization
• News
• Recommendations
• Search
• Messages
Offline Activity that can be reflected later
• People You May Know
• Connection Strength
• News
• Recommendations
• Next best idea…
LinkedIn Data Infrastructure: Sample Stack
15
Infra challenges in 3-phase
ecosystem are diverse,
complex and specific
Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
LinkedIn Data Infrastructure: Data Stores
LinkedIn Confidential ©2013 All Rights Reserved 16
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
 Transactions
 Rich structures (e.g. indexes)
 Change capture capability
 Key value / document storage
Voldemort
 ICDE 2012 (Data Infra Overview)  FAST 2012 (Voldemort for Serving)
LinkedIn Data Infrastructure: Specialized Indexes
LinkedIn Confidential ©2013 All Rights Reserved 17
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
 Search platform
 Distributed graph engine
Zoie Bobo Sensei
GraphDB
LinkedIn Data Infrastructure: Pipelines
LinkedIn Confidential ©2013 All Rights Reserved 18
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
 Messaging for site events, monitoring
 High throughput
 Change data capture stream
 Reliable, consistent, low latency pipe
 ACM SOCC 2012: “Databus”  IEEE Data Eng. Bulletin 2012: “Kafka”
LinkedIn Data Infrastructure: Off-line Analysis
LinkedIn Confidential ©2013 All Rights Reserved 19
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
 ML, Ranking, Relevance
 Insights and Analytics
 ETL, Metadata and Pipes
 Business Source of Truth
LinkedIn Data Infrastructure: Cluster Management
LinkedIn Confidential ©2013 All Rights Reserved 20
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
 Generic framework for building
distributed systems
 Cluster Management Primitives
 ACM SOCC 2012: Untangling Cluster Management with Helix
HELIX: Generalizing Cluster Management
LinkedIn Confidential ©2013 All Rights Reserved 21
STATE MACHINE
CONSTRAINTS OBJECTIVE
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) )
t1≤ 5
SS
MMOO
t1 t2
t3 t4
minimize(maxnj∈N M(nj) )
Helix
 Declare distributed system behavior via {S, C, O}
 Enforce Partition constraints
 Fault detection and tolerance (e.g. promote S to M)
 Elasticity (e.g. Re-balance; Minimize migrations)
 Used in Espresso, Search, Databus
LinkedIn Data Infrastructure: A few take-aways
LinkedIn Confidential ©2013 All Rights Reserved 22
1. Infrastructure decisions matter and are hard to
transform in a hyper-growth environment.
2. Balance open-source products with home-
grown platforms (**)
3. Operability, Capacity Planning and On-line
Multi-tenancy are hard
4. Data Movement: Pipes and Feedback Loops
are critical (**)
5. Data Model and Integration e2e are key (*)
6. Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
7. Off-line Multi-Platform story is evolving.
Science and Infrastructure: Giving Back
LinkedIn Confidential ©2013 All Rights Reserved 23
Research Publications
 ACM SOCC 2012
 ACM RecSys 2012
 SIGIR 2012
 CIKM 2012
 VLDB 2012
 ICDE 2012
 FAST 2012
 NetDB 2011
 …
Open Source Projects
 Apache Helix new
 ParSeq new
 DataFu new
 Apache Kafka
 Sensei
 Azkaban
 Voldemort
A Recommendation Product:
LinkedIn Confidential ©2013 All Rights Reserved 24
People You May Know (PYMK)
Probability that you may know someone else?
LinkedIn Confidential ©2013 All Rights Reserved 25
Bob
Alice
Carol
Known as “triangle closing”
??
PYMK: Science, Members and Connections
LinkedIn Confidential ©2013 All Rights Reserved 26
1) Feature selection is key
 Common Connections
 Geo
 Company
 Age
2) ML and data model
• Traditional ML (e.g. matrix factorization) on O(n^2) of 175M
tend to not scale easily
3) Interplay: Data Model + ML + Parallel Computation model
4) Adding edges: Why do it?
• Creates positive-feedback social loops for members
• More useful content and activity available to members
• Denser graph improves signal strength in science-driven
products
Virality 
Value 
Insights 
Product
ScienceData
Member
Signals 
The Feedback Loop
PYMK: Off-line Model Build
LinkedIn Confidential ©2013 All Rights Reserved 27
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
 Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.
 Very complex workflow due to extraction and selection of large num of features.
Built Azkaban for Hadoop.
 Small Input and final look-up structure but large intermediate data (100’s of TB)
due to MR. Problem (who you do not know) itself has an inherent blow-up.
 Special optimizations (e.g. Bloom Join to remove connected)
PYMK: Off-line to Near-Line Serving
LinkedIn Confidential ©2013 All Rights Reserved 28
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
 Build serving structure on Hadoop. Scan versus Index compactness tradeoff.
 Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.
 Bulk load for efficiency. Fast Rollback for safety. Atomic swap.
 Serving: Per-partition index in memory. PYMK blobs on disk.
 Retrieval ~msec. Decoration in App FE is more expensive.
PYMK: Science and Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 29
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
 Response vs Latency: Fast refresh helps user experience. (e.g. showing
connections of very recent connections). “Social” phenomenon.
 Very agile feature: Lots of on-line A/B testing and tweaking of features
 Huge Impact: > 50% of accepted invites are created by PYMK
PYMK: Tying It All Together
LinkedIn Confidential ©2013 All Rights Reserved 30
P (B knows C)  large number of features
Distance
Common
connections
Organizational
Overlap
Age
Bob
Alice
Carol
Dave Eve
Offline
Model
Near-Line
Serving
Offline
Near-Line
User Interactions
PYMK
Application
LinkedIn + Yale
LinkedIn Confidential ©2013 All Rights Reserved 31
 What is my career path?
 How can I prepare?
 How do I get my first
internship and first job?
Students
 Where did my students go
after they left the
university?
 How is my school seeding
the various industries with
the best talent?
 How does my school
compare with other
institutions
Students:
 Transformation of
Careers
Yale:
 Get a data-driven view
 Uncover opportunities
Wins based on data and insights
Thank you colleagues for the beautiful slides!
LinkedIn Confidential ©2013 All Rights Reserved 32
David Henke
SVP Operations
Amy Tang
Sr. Program Manager
Sam Shah
Principal Engineer
Shirshanka Das
Principal Engineer
Kapil Surlaker
Principal Engineer
Anmol Bhasin
Sr. Engineering Manager
Daniel Tunkelang
Principal Data Scientist
Summary
LinkedIn Confidential ©2013 All Rights Reserved 33
Read more @ data.linkedin.com
1. E2E: The Big-Data feedback loop of social-network product design is cool
2. Infrastructure
1. Data Infrastructure needs continuous innovation and iteration to keep
pace for scale and cost.
2. Fast moving, Big, Clean Data + Agile Metadata = Goodness
3. Data-driven products need agile feedback infrastructure and
measurement methodology.
3. Methodology and Science
1. Data-Driven experimentation enables insights and agile products
2. Recommendation-driven products have big impact.
Help us. Come Have Fun with Us!
LinkedIn Confidential ©2013 All Rights Reserved 34
Info: data.linkedin.com
1. Science and Data Mining: Recommendation and Optimization Problems
2. Next-generation ad-hoc and OLAP query processing on Hadoop
3. Graph Computations: Off-line mining and On-line integration loops
4. nRT Data Streams in Near-line infrastructure
5. And much more…
In Closing
LinkedIn Confidential ©2013 All Rights Reserved 35
bghosh@linkedin.com
Thank You!
LinkedIn Confidential ©2013 All Rights Reserved 36

More Related Content

What's hot

II-SDV 2012 Merging Information from Structured and Unstructured Information ...
II-SDV 2012 Merging Information from Structured and Unstructured Information ...II-SDV 2012 Merging Information from Structured and Unstructured Information ...
II-SDV 2012 Merging Information from Structured and Unstructured Information ...
Dr. Haxel Consult
 
II-SDV 2012 Actionable Intelligence for the Whole Enterprise
II-SDV 2012 Actionable Intelligence for the Whole EnterpriseII-SDV 2012 Actionable Intelligence for the Whole Enterprise
II-SDV 2012 Actionable Intelligence for the Whole Enterprise
Dr. Haxel Consult
 
Data Architecture Process in a BI environment
Data Architecture Process in a BI environmentData Architecture Process in a BI environment
Data Architecture Process in a BI environment
Sasha Citino
 
Big_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaBig_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_Reddiboina
Madhu Reddiboina
 

What's hot (20)

Rise of the Data Democracy
Rise of the Data DemocracyRise of the Data Democracy
Rise of the Data Democracy
 
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
 
II-SDV 2012 Merging Information from Structured and Unstructured Information ...
II-SDV 2012 Merging Information from Structured and Unstructured Information ...II-SDV 2012 Merging Information from Structured and Unstructured Information ...
II-SDV 2012 Merging Information from Structured and Unstructured Information ...
 
II-SDV 2012 Actionable Intelligence for the Whole Enterprise
II-SDV 2012 Actionable Intelligence for the Whole EnterpriseII-SDV 2012 Actionable Intelligence for the Whole Enterprise
II-SDV 2012 Actionable Intelligence for the Whole Enterprise
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven Organization
 
The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New ...
The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New ...The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New ...
The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New ...
 
The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New...
 The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New... The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New...
The Future Paradigm Shifts of the Cloud and Big Data: Security Impacts & New...
 
Usama Fayyad talk in South Africa: From BigData to Data Science
Usama Fayyad talk in South Africa:  From BigData to Data ScienceUsama Fayyad talk in South Africa:  From BigData to Data Science
Usama Fayyad talk in South Africa: From BigData to Data Science
 
Itag usama bigdata-6-2015-full
Itag usama bigdata-6-2015-fullItag usama bigdata-6-2015-full
Itag usama bigdata-6-2015-full
 
Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
 
Information is at the heart of all architecture disciplines & why Conceptual ...
Information is at the heart of all architecture disciplines & why Conceptual ...Information is at the heart of all architecture disciplines & why Conceptual ...
Information is at the heart of all architecture disciplines & why Conceptual ...
 
A next generation introduction to data science and its potential to change bu...
A next generation introduction to data science and its potential to change bu...A next generation introduction to data science and its potential to change bu...
A next generation introduction to data science and its potential to change bu...
 
Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...
 
Data Architecture Process in a BI environment
Data Architecture Process in a BI environmentData Architecture Process in a BI environment
Data Architecture Process in a BI environment
 
Data, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to ComeData, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to Come
 
Evolution of Content Services
Evolution of Content ServicesEvolution of Content Services
Evolution of Content Services
 
Business Process and Enterprise Content alternatives to Sharepoint
Business Process and Enterprise Content alternatives to SharepointBusiness Process and Enterprise Content alternatives to Sharepoint
Business Process and Enterprise Content alternatives to Sharepoint
 
Big_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaBig_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_Reddiboina
 
Data-Centric Business Transformation Using Knowledge Graphs
Data-Centric Business Transformation Using Knowledge GraphsData-Centric Business Transformation Using Knowledge Graphs
Data-Centric Business Transformation Using Knowledge Graphs
 
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.ai
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.aiDemocratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.ai
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.ai
 

Viewers also liked

windowsAlia's Summer 2016 Presentation
windowsAlia's Summer 2016 PresentationwindowsAlia's Summer 2016 Presentation
windowsAlia's Summer 2016 Presentation
Alia Alshammary
 
No sql and data scalability
No sql and data scalabilityNo sql and data scalability
No sql and data scalability
Roger Xia
 
Swiss elearning Institute Certificate
Swiss elearning Institute CertificateSwiss elearning Institute Certificate
Swiss elearning Institute Certificate
Vinay Prakash Oommen
 
Metoda Kërkimi Shkencor - Kapitulli 4 - Zgjedhja
Metoda Kërkimi Shkencor - Kapitulli 4 - ZgjedhjaMetoda Kërkimi Shkencor - Kapitulli 4 - Zgjedhja
Metoda Kërkimi Shkencor - Kapitulli 4 - Zgjedhja
Sokol Luzi
 

Viewers also liked (19)

PArty Bus
PArty BusPArty Bus
PArty Bus
 
I Am A Red Raider: A Marketing Campaign Designed with Engagement in Mind
I Am A Red Raider: A Marketing Campaign Designed with Engagement in MindI Am A Red Raider: A Marketing Campaign Designed with Engagement in Mind
I Am A Red Raider: A Marketing Campaign Designed with Engagement in Mind
 
Klondike wilderness rescue company
Klondike wilderness rescue companyKlondike wilderness rescue company
Klondike wilderness rescue company
 
La palabra
La palabraLa palabra
La palabra
 
windowsAlia's Summer 2016 Presentation
windowsAlia's Summer 2016 PresentationwindowsAlia's Summer 2016 Presentation
windowsAlia's Summer 2016 Presentation
 
Corrosion in water industry and wind turbines
Corrosion in water industry and wind turbinesCorrosion in water industry and wind turbines
Corrosion in water industry and wind turbines
 
to analyze the shifting people towards super markets
to analyze the shifting people towards super marketsto analyze the shifting people towards super markets
to analyze the shifting people towards super markets
 
No sql and data scalability
No sql and data scalabilityNo sql and data scalability
No sql and data scalability
 
Elementos de la comunicación oral
Elementos de la comunicación oralElementos de la comunicación oral
Elementos de la comunicación oral
 
Swiss elearning Institute Certificate
Swiss elearning Institute CertificateSwiss elearning Institute Certificate
Swiss elearning Institute Certificate
 
Hombres que vencieron la adversidad
Hombres que vencieron la adversidadHombres que vencieron la adversidad
Hombres que vencieron la adversidad
 
Mitos y leyendas de tubara
Mitos y leyendas de tubaraMitos y leyendas de tubara
Mitos y leyendas de tubara
 
Polimeret sintetike
Polimeret sintetikePolimeret sintetike
Polimeret sintetike
 
Treated wastewater for Irrigation
Treated wastewater for IrrigationTreated wastewater for Irrigation
Treated wastewater for Irrigation
 
Fieldking farm Implements, Rotavator, Power Harrow, Cultivator, Laser Levelle...
Fieldking farm Implements, Rotavator, Power Harrow, Cultivator, Laser Levelle...Fieldking farm Implements, Rotavator, Power Harrow, Cultivator, Laser Levelle...
Fieldking farm Implements, Rotavator, Power Harrow, Cultivator, Laser Levelle...
 
Metoda Kërkimi Shkencor - Kapitulli 4 - Zgjedhja
Metoda Kërkimi Shkencor - Kapitulli 4 - ZgjedhjaMetoda Kërkimi Shkencor - Kapitulli 4 - Zgjedhja
Metoda Kërkimi Shkencor - Kapitulli 4 - Zgjedhja
 
Python for the Network Nerd
Python for the Network NerdPython for the Network Nerd
Python for the Network Nerd
 
Barnat kunder dhimbjes (Analgjeziket)
Barnat kunder dhimbjes (Analgjeziket)Barnat kunder dhimbjes (Analgjeziket)
Barnat kunder dhimbjes (Analgjeziket)
 
Ambalazhet e plastikes
Ambalazhet e plastikesAmbalazhet e plastikes
Ambalazhet e plastikes
 

Similar to Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

IDC-IL Webcast on Enterprise Content Collaboration
IDC-IL Webcast on Enterprise Content Collaboration IDC-IL Webcast on Enterprise Content Collaboration
IDC-IL Webcast on Enterprise Content Collaboration
Sri Chilukuri
 

Similar to Bg linkedin bigdata_martinschultz_symposium_yale_oct2012 (20)

Big Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInBig Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedIn
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
 
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
 
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsEnterprise Data Marketplace: A Centralized Portal for All Your Data Assets
Enterprise Data Marketplace: A Centralized Portal for All Your Data Assets
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Choosing Current Topic
Choosing Current TopicChoosing Current Topic
Choosing Current Topic
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data Strategy
 
IDC-IL Webcast on Enterprise Content Collaboration
IDC-IL Webcast on Enterprise Content Collaboration IDC-IL Webcast on Enterprise Content Collaboration
IDC-IL Webcast on Enterprise Content Collaboration
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?
Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?
Finding Your Ideal Data Architecture: Data Fabric, Data Mesh or Both?
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

  • 1. A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin Bhaskar Ghosh Senior Director of Engineering Data Infrastructure LinkedIn Confidential ©2013 All Rights Reserved Big Data Science A Symposium in Honor of Martin Schultz Yale University 26 Oct 2012
  • 2. Outline LinkedIn Confidential ©2013 All Rights Reserved 2 1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. Conclusion
  • 3. Martin and Me LinkedIn Confidential ©2013 All Rights Reserved 3 Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms) 12y @ Informix & Oracle building parallel database systems 4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization 2y+ @ LinkedIn building & leading the Data Infrastructure Engineering Organization
  • 4. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 175M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 4
  • 5. ..and a bunch of Data-Driven Products LinkedIn Confidential ©2013 All Rights Reserved 5 Pandora Search for People Events You May Be Interested In Groups browse maps
  • 6. The LinkedIn Mission. Connect the world’s professionals to make them more productive and successful
  • 7. Linkedin Product Philosophy LinkedIn Confidential ©2013 All Rights Reserved 7 Goals Approach  Provide a uniquely personalized experience to members (professionals)  Build an ecosystem to balance the interests of members and partners (companies)  Launch Often and Early  Data-Driven Experiment and Test  Fail Fast  Prepare for Virality and Scale
  • 8. Two Product Families LinkedIn Confidential ©2013 All Rights Reserved 8 Data Data Infrastructure Science and Analytics Professionals Companies Connections Profiles Actions Content For Members For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Market Sell
  • 9. The Big-Data Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 9 Value  Insights  Scale  Product ScienceData Member Engagement  Virality  Signals  Refinement  Infrastructure Analytics 
  • 10. LinkedIn Confidential ©2013 All Rights Reserved 10 Product Family Products Science Identity and Engagement Search and Analysis Recommendations Monetization 1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings) 1. People Search 2. Group Search 3. Who Viewed My Profile 1. People You May Know 2. Jobs You May Be Interested In 3. Events You May Be Interested In Entity disambiguation and matching 1. Subscription Packages 2. Sponsored Content Response Prediction Inventory Forecasting Data Infra Member-Facing Products: Diversity at Scale
  • 11. Recommendations…Are Effective .. And Drive LinkedIn Confidential ©2013 All Rights Reserved 11 > 50% of connections > 50% of job applications > 50% of group joins • Find data that is useful for Members • Guiding Principle • Provide Relevant Content • Establish Social Connections • In Appropriate Context
  • 13. LinkedIn Confidential ©2013 All Rights Reserved 13 Product Family Products Science Identity and Engagement Search and Analysis Recommendations Monetization 1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings) 1. People Search 2. Group Search 3. Who Viewed My Profile 1. People You May Know 2. Jobs You May Be Interested In 3. Events You May Be Interested In Entity disambiguation and matching 1. Subscription Packages 2. Sponsored Content Response prediction Data Infra • Scale • Full text and secondary ind • Real-time • Faceted search • Near RT index freshness • Drill-down exploration • Graph analysis • Content serving • Real-time tuning Member-Facing Products: Diversity at Scale
  • 14. LinkedIn Data Infrastructure: Three-Phase Abstraction LinkedIn Confidential ©2013 All Rights Reserved 14 Users Online Data Infra Near-Line Infra Application Offline Data Infra Infrastructure Latency & Freshness Requirements Products Online Activity that should be reflected immediately • Member Profiles • Company Profiles • Connections • Messages • Endorsements • Skills Near-Line Activity that should be reflected soon • Activity Streams • Profile Standardization • News • Recommendations • Search • Messages Offline Activity that can be reflected later • People You May Know • Connection Strength • News • Recommendations • Next best idea…
  • 15. LinkedIn Data Infrastructure: Sample Stack 15 Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms
  • 16. LinkedIn Data Infrastructure: Data Stores LinkedIn Confidential ©2013 All Rights Reserved 16 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Transactions  Rich structures (e.g. indexes)  Change capture capability  Key value / document storage Voldemort  ICDE 2012 (Data Infra Overview)  FAST 2012 (Voldemort for Serving)
  • 17. LinkedIn Data Infrastructure: Specialized Indexes LinkedIn Confidential ©2013 All Rights Reserved 17 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Search platform  Distributed graph engine Zoie Bobo Sensei GraphDB
  • 18. LinkedIn Data Infrastructure: Pipelines LinkedIn Confidential ©2013 All Rights Reserved 18 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Messaging for site events, monitoring  High throughput  Change data capture stream  Reliable, consistent, low latency pipe  ACM SOCC 2012: “Databus”  IEEE Data Eng. Bulletin 2012: “Kafka”
  • 19. LinkedIn Data Infrastructure: Off-line Analysis LinkedIn Confidential ©2013 All Rights Reserved 19 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  ML, Ranking, Relevance  Insights and Analytics  ETL, Metadata and Pipes  Business Source of Truth
  • 20. LinkedIn Data Infrastructure: Cluster Management LinkedIn Confidential ©2013 All Rights Reserved 20 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Generic framework for building distributed systems  Cluster Management Primitives  ACM SOCC 2012: Untangling Cluster Management with Helix
  • 21. HELIX: Generalizing Cluster Management LinkedIn Confidential ©2013 All Rights Reserved 21 STATE MACHINE CONSTRAINTS OBJECTIVE COUNT=2 COUNT=1 minimize(maxnj∈N S(nj) ) t1≤ 5 SS MMOO t1 t2 t3 t4 minimize(maxnj∈N M(nj) ) Helix  Declare distributed system behavior via {S, C, O}  Enforce Partition constraints  Fault detection and tolerance (e.g. promote S to M)  Elasticity (e.g. Re-balance; Minimize migrations)  Used in Espresso, Search, Databus
  • 22. LinkedIn Data Infrastructure: A few take-aways LinkedIn Confidential ©2013 All Rights Reserved 22 1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment. 2. Balance open-source products with home- grown platforms (**) 3. Operability, Capacity Planning and On-line Multi-tenancy are hard 4. Data Movement: Pipes and Feedback Loops are critical (**) 5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving.
  • 23. Science and Infrastructure: Giving Back LinkedIn Confidential ©2013 All Rights Reserved 23 Research Publications  ACM SOCC 2012  ACM RecSys 2012  SIGIR 2012  CIKM 2012  VLDB 2012  ICDE 2012  FAST 2012  NetDB 2011  … Open Source Projects  Apache Helix new  ParSeq new  DataFu new  Apache Kafka  Sensei  Azkaban  Voldemort
  • 24. A Recommendation Product: LinkedIn Confidential ©2013 All Rights Reserved 24 People You May Know (PYMK)
  • 25. Probability that you may know someone else? LinkedIn Confidential ©2013 All Rights Reserved 25 Bob Alice Carol Known as “triangle closing” ??
  • 26. PYMK: Science, Members and Connections LinkedIn Confidential ©2013 All Rights Reserved 26 1) Feature selection is key  Common Connections  Geo  Company  Age 2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M tend to not scale easily 3) Interplay: Data Model + ML + Parallel Computation model 4) Adding edges: Why do it? • Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven products Virality  Value  Insights  Product ScienceData Member Signals  The Feedback Loop
  • 27. PYMK: Off-line Model Build LinkedIn Confidential ©2013 All Rights Reserved 27 Users Online Data Infra Near-Line Infra Application Offline Data Infra  Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.  Very complex workflow due to extraction and selection of large num of features. Built Azkaban for Hadoop.  Small Input and final look-up structure but large intermediate data (100’s of TB) due to MR. Problem (who you do not know) itself has an inherent blow-up.  Special optimizations (e.g. Bloom Join to remove connected)
  • 28. PYMK: Off-line to Near-Line Serving LinkedIn Confidential ©2013 All Rights Reserved 28 Users Online Data Infra Near-Line Infra Application Offline Data Infra  Build serving structure on Hadoop. Scan versus Index compactness tradeoff.  Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.  Bulk load for efficiency. Fast Rollback for safety. Atomic swap.  Serving: Per-partition index in memory. PYMK blobs on disk.  Retrieval ~msec. Decoration in App FE is more expensive.
  • 29. PYMK: Science and Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 29 Users Online Data Infra Near-Line Infra Application Offline Data Infra  Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.  Very agile feature: Lots of on-line A/B testing and tweaking of features  Huge Impact: > 50% of accepted invites are created by PYMK
  • 30. PYMK: Tying It All Together LinkedIn Confidential ©2013 All Rights Reserved 30 P (B knows C)  large number of features Distance Common connections Organizational Overlap Age Bob Alice Carol Dave Eve Offline Model Near-Line Serving Offline Near-Line User Interactions PYMK Application
  • 31. LinkedIn + Yale LinkedIn Confidential ©2013 All Rights Reserved 31  What is my career path?  How can I prepare?  How do I get my first internship and first job? Students  Where did my students go after they left the university?  How is my school seeding the various industries with the best talent?  How does my school compare with other institutions Students:  Transformation of Careers Yale:  Get a data-driven view  Uncover opportunities Wins based on data and insights
  • 32. Thank you colleagues for the beautiful slides! LinkedIn Confidential ©2013 All Rights Reserved 32 David Henke SVP Operations Amy Tang Sr. Program Manager Sam Shah Principal Engineer Shirshanka Das Principal Engineer Kapil Surlaker Principal Engineer Anmol Bhasin Sr. Engineering Manager Daniel Tunkelang Principal Data Scientist
  • 33. Summary LinkedIn Confidential ©2013 All Rights Reserved 33 Read more @ data.linkedin.com 1. E2E: The Big-Data feedback loop of social-network product design is cool 2. Infrastructure 1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost. 2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and measurement methodology. 3. Methodology and Science 1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact.
  • 34. Help us. Come Have Fun with Us! LinkedIn Confidential ©2013 All Rights Reserved 34 Info: data.linkedin.com 1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more…
  • 35. In Closing LinkedIn Confidential ©2013 All Rights Reserved 35 bghosh@linkedin.com Thank You!
  • 36. LinkedIn Confidential ©2013 All Rights Reserved 36