More Related Content Similar to Bg linkedin bigdata_martinschultz_symposium_yale_oct2012 (20) Bg linkedin bigdata_martinschultz_symposium_yale_oct20121. A Small Overview of Big Data Products,
Analytics and Infrastructure at Linkedin
Bhaskar Ghosh
Senior Director of Engineering
Data Infrastructure
LinkedIn Confidential ©2013 All Rights Reserved
Big Data Science
A Symposium in Honor of Martin Schultz
Yale University
26 Oct 2012
2. Outline
LinkedIn Confidential ©2013 All Rights Reserved 2
1. Martin and Me
2. Company and Mission
3. Products and Science
4. Data Infrastructure
5. P, S, DI: People You May Know
6. Linkedin + Yale
7. Conclusion
3. Martin and Me
LinkedIn Confidential ©2013 All Rights Reserved 3
Thank you Martin! Best mentor.
Versatility, big-picture thinking and leadership.
Yale CS Ph.D. 1995 (Parallel Algorithms)
12y @ Informix & Oracle building parallel
database systems
4y @ Yahoo! building Ads systems & leading
the Display Ads Exchange organization
2y+ @ LinkedIn building & leading the
Data Infrastructure Engineering Organization
4. The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
175M+ 2M+
Company Pages
Connecting Talent Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 4
5. ..and a bunch of Data-Driven Products
LinkedIn Confidential ©2013 All Rights Reserved 5
Pandora Search for People
Events You
May Be
Interested In
Groups browse maps
7. Linkedin Product Philosophy
LinkedIn Confidential ©2013 All Rights Reserved 7
Goals
Approach
Provide a uniquely personalized experience to
members (professionals)
Build an ecosystem to balance the interests of
members and partners (companies)
Launch Often and Early
Data-Driven Experiment and Test
Fail Fast
Prepare for Virality and Scale
8. Two Product Families
LinkedIn Confidential ©2013 All Rights Reserved 8
Data
Data Infrastructure
Science and Analytics
Professionals Companies
Connections
Profiles Actions
Content
For Members For Partners
People You May Know
Who’s Viewed My Profile
Jobs You May Be
Interested In
News/Sharing
Today
Search
Subscriptions
Hire
Market
Sell
9. The Big-Data Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 9
Value
Insights
Scale
Product
ScienceData
Member
Engagement
Virality
Signals
Refinement
Infrastructure
Analytics
10. LinkedIn Confidential ©2013 All Rights Reserved 10
Product Family Products Science
Identity and
Engagement
Search and
Analysis
Recommendations
Monetization
1. Profile and Connections
2. Activity Streams
3. Messages (email)
4. Endorsements & Skills
Blending and ranking of
heterogeneous content
(e.g. Network Updates,
Group Discussions, Job
Postings)
1. People Search
2. Group Search
3. Who Viewed My Profile
1. People You May Know
2. Jobs You May Be
Interested In
3. Events You May Be
Interested In
Entity
disambiguation and
matching
1. Subscription Packages
2. Sponsored Content
Response Prediction
Inventory Forecasting
Data Infra
Member-Facing Products: Diversity at Scale
11. Recommendations…Are Effective .. And Drive
LinkedIn Confidential ©2013 All Rights Reserved 11
> 50% of connections
> 50% of job applications > 50% of group joins
• Find data that is useful for Members
• Guiding Principle
• Provide Relevant Content
• Establish Social Connections
• In Appropriate Context
13. LinkedIn Confidential ©2013 All Rights Reserved 13
Product Family Products Science
Identity and
Engagement
Search and
Analysis
Recommendations
Monetization
1. Profile and Connections
2. Activity Streams
3. Messages (email)
4. Endorsements & Skills
Blending and ranking of
heterogeneous content
(e.g. Network Updates,
Group Discussions, Job
Postings)
1. People Search
2. Group Search
3. Who Viewed My Profile
1. People You May Know
2. Jobs You May Be
Interested In
3. Events You May Be
Interested In
Entity
disambiguation and
matching
1. Subscription Packages
2. Sponsored Content
Response prediction
Data Infra
• Scale
• Full text and
secondary ind
• Real-time
• Faceted search
• Near RT index
freshness
• Drill-down
exploration
• Graph analysis
• Content serving
• Real-time tuning
Member-Facing Products: Diversity at Scale
14. LinkedIn Data Infrastructure: Three-Phase Abstraction
LinkedIn Confidential ©2013 All Rights Reserved 14
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately
• Member Profiles
• Company Profiles
• Connections
• Messages
• Endorsements
• Skills
Near-Line Activity that should be reflected soon
• Activity Streams
• Profile Standardization
• News
• Recommendations
• Search
• Messages
Offline Activity that can be reflected later
• People You May Know
• Connection Strength
• News
• Recommendations
• Next best idea…
15. LinkedIn Data Infrastructure: Sample Stack
15
Infra challenges in 3-phase
ecosystem are diverse,
complex and specific
Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
16. LinkedIn Data Infrastructure: Data Stores
LinkedIn Confidential ©2013 All Rights Reserved 16
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
Transactions
Rich structures (e.g. indexes)
Change capture capability
Key value / document storage
Voldemort
ICDE 2012 (Data Infra Overview) FAST 2012 (Voldemort for Serving)
17. LinkedIn Data Infrastructure: Specialized Indexes
LinkedIn Confidential ©2013 All Rights Reserved 17
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
Search platform
Distributed graph engine
Zoie Bobo Sensei
GraphDB
18. LinkedIn Data Infrastructure: Pipelines
LinkedIn Confidential ©2013 All Rights Reserved 18
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
Messaging for site events, monitoring
High throughput
Change data capture stream
Reliable, consistent, low latency pipe
ACM SOCC 2012: “Databus” IEEE Data Eng. Bulletin 2012: “Kafka”
19. LinkedIn Data Infrastructure: Off-line Analysis
LinkedIn Confidential ©2013 All Rights Reserved 19
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
ML, Ranking, Relevance
Insights and Analytics
ETL, Metadata and Pipes
Business Source of Truth
20. LinkedIn Data Infrastructure: Cluster Management
LinkedIn Confidential ©2013 All Rights Reserved 20
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Systems Capabilities
Generic framework for building
distributed systems
Cluster Management Primitives
ACM SOCC 2012: Untangling Cluster Management with Helix
21. HELIX: Generalizing Cluster Management
LinkedIn Confidential ©2013 All Rights Reserved 21
STATE MACHINE
CONSTRAINTS OBJECTIVE
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) )
t1≤ 5
SS
MMOO
t1 t2
t3 t4
minimize(maxnj∈N M(nj) )
Helix
Declare distributed system behavior via {S, C, O}
Enforce Partition constraints
Fault detection and tolerance (e.g. promote S to M)
Elasticity (e.g. Re-balance; Minimize migrations)
Used in Espresso, Search, Databus
22. LinkedIn Data Infrastructure: A few take-aways
LinkedIn Confidential ©2013 All Rights Reserved 22
1. Infrastructure decisions matter and are hard to
transform in a hyper-growth environment.
2. Balance open-source products with home-
grown platforms (**)
3. Operability, Capacity Planning and On-line
Multi-tenancy are hard
4. Data Movement: Pipes and Feedback Loops
are critical (**)
5. Data Model and Integration e2e are key (*)
6. Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
7. Off-line Multi-Platform story is evolving.
23. Science and Infrastructure: Giving Back
LinkedIn Confidential ©2013 All Rights Reserved 23
Research Publications
ACM SOCC 2012
ACM RecSys 2012
SIGIR 2012
CIKM 2012
VLDB 2012
ICDE 2012
FAST 2012
NetDB 2011
…
Open Source Projects
Apache Helix new
ParSeq new
DataFu new
Apache Kafka
Sensei
Azkaban
Voldemort
25. Probability that you may know someone else?
LinkedIn Confidential ©2013 All Rights Reserved 25
Bob
Alice
Carol
Known as “triangle closing”
??
26. PYMK: Science, Members and Connections
LinkedIn Confidential ©2013 All Rights Reserved 26
1) Feature selection is key
Common Connections
Geo
Company
Age
2) ML and data model
• Traditional ML (e.g. matrix factorization) on O(n^2) of 175M
tend to not scale easily
3) Interplay: Data Model + ML + Parallel Computation model
4) Adding edges: Why do it?
• Creates positive-feedback social loops for members
• More useful content and activity available to members
• Denser graph improves signal strength in science-driven
products
Virality
Value
Insights
Product
ScienceData
Member
Signals
The Feedback Loop
27. PYMK: Off-line Model Build
LinkedIn Confidential ©2013 All Rights Reserved 27
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.
Very complex workflow due to extraction and selection of large num of features.
Built Azkaban for Hadoop.
Small Input and final look-up structure but large intermediate data (100’s of TB)
due to MR. Problem (who you do not know) itself has an inherent blow-up.
Special optimizations (e.g. Bloom Join to remove connected)
28. PYMK: Off-line to Near-Line Serving
LinkedIn Confidential ©2013 All Rights Reserved 28
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Build serving structure on Hadoop. Scan versus Index compactness tradeoff.
Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.
Bulk load for efficiency. Fast Rollback for safety. Atomic swap.
Serving: Per-partition index in memory. PYMK blobs on disk.
Retrieval ~msec. Decoration in App FE is more expensive.
29. PYMK: Science and Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 29
Users Online Data
Infra
Near-Line
Infra
Application Offline
Data Infra
Response vs Latency: Fast refresh helps user experience. (e.g. showing
connections of very recent connections). “Social” phenomenon.
Very agile feature: Lots of on-line A/B testing and tweaking of features
Huge Impact: > 50% of accepted invites are created by PYMK
30. PYMK: Tying It All Together
LinkedIn Confidential ©2013 All Rights Reserved 30
P (B knows C) large number of features
Distance
Common
connections
Organizational
Overlap
Age
Bob
Alice
Carol
Dave Eve
Offline
Model
Near-Line
Serving
Offline
Near-Line
User Interactions
PYMK
Application
31. LinkedIn + Yale
LinkedIn Confidential ©2013 All Rights Reserved 31
What is my career path?
How can I prepare?
How do I get my first
internship and first job?
Students
Where did my students go
after they left the
university?
How is my school seeding
the various industries with
the best talent?
How does my school
compare with other
institutions
Students:
Transformation of
Careers
Yale:
Get a data-driven view
Uncover opportunities
Wins based on data and insights
32. Thank you colleagues for the beautiful slides!
LinkedIn Confidential ©2013 All Rights Reserved 32
David Henke
SVP Operations
Amy Tang
Sr. Program Manager
Sam Shah
Principal Engineer
Shirshanka Das
Principal Engineer
Kapil Surlaker
Principal Engineer
Anmol Bhasin
Sr. Engineering Manager
Daniel Tunkelang
Principal Data Scientist
33. Summary
LinkedIn Confidential ©2013 All Rights Reserved 33
Read more @ data.linkedin.com
1. E2E: The Big-Data feedback loop of social-network product design is cool
2. Infrastructure
1. Data Infrastructure needs continuous innovation and iteration to keep
pace for scale and cost.
2. Fast moving, Big, Clean Data + Agile Metadata = Goodness
3. Data-driven products need agile feedback infrastructure and
measurement methodology.
3. Methodology and Science
1. Data-Driven experimentation enables insights and agile products
2. Recommendation-driven products have big impact.
34. Help us. Come Have Fun with Us!
LinkedIn Confidential ©2013 All Rights Reserved 34
Info: data.linkedin.com
1. Science and Data Mining: Recommendation and Optimization Problems
2. Next-generation ad-hoc and OLAP query processing on Hadoop
3. Graph Computations: Off-line mining and On-line integration loops
4. nRT Data Streams in Near-line infrastructure
5. And much more…