Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 36 Anzeige

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Herunterladen, um offline zu lesen

Presentation on an overview of LinkedIn data driven products and infrastructure given on 26 Oct 2012 in the big-data symposium given in honor of the retirement of my PhD advisor Dr Martin H. Schultz.

Presentation on an overview of LinkedIn data driven products and infrastructure given on 26 Oct 2012 in the big-data symposium given in honor of the retirement of my PhD advisor Dr Martin H. Schultz.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (19)

Anzeige

Ähnlich wie Bg linkedin bigdata_martinschultz_symposium_yale_oct2012 (20)

Aktuellste (20)

Anzeige

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

  1. 1. A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin Bhaskar Ghosh Senior Director of Engineering Data Infrastructure LinkedIn Confidential ©2013 All Rights Reserved Big Data Science A Symposium in Honor of Martin Schultz Yale University 26 Oct 2012
  2. 2. Outline LinkedIn Confidential ©2013 All Rights Reserved 2 1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. Conclusion
  3. 3. Martin and Me LinkedIn Confidential ©2013 All Rights Reserved 3 Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms) 12y @ Informix & Oracle building parallel database systems 4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization 2y+ @ LinkedIn building & leading the Data Infrastructure Engineering Organization
  4. 4. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 175M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 4
  5. 5. ..and a bunch of Data-Driven Products LinkedIn Confidential ©2013 All Rights Reserved 5 Pandora Search for People Events You May Be Interested In Groups browse maps
  6. 6. The LinkedIn Mission. Connect the world’s professionals to make them more productive and successful
  7. 7. Linkedin Product Philosophy LinkedIn Confidential ©2013 All Rights Reserved 7 Goals Approach  Provide a uniquely personalized experience to members (professionals)  Build an ecosystem to balance the interests of members and partners (companies)  Launch Often and Early  Data-Driven Experiment and Test  Fail Fast  Prepare for Virality and Scale
  8. 8. Two Product Families LinkedIn Confidential ©2013 All Rights Reserved 8 Data Data Infrastructure Science and Analytics Professionals Companies Connections Profiles Actions Content For Members For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Market Sell
  9. 9. The Big-Data Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 9 Value  Insights  Scale  Product ScienceData Member Engagement  Virality  Signals  Refinement  Infrastructure Analytics 
  10. 10. LinkedIn Confidential ©2013 All Rights Reserved 10 Product Family Products Science Identity and Engagement Search and Analysis Recommendations Monetization 1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings) 1. People Search 2. Group Search 3. Who Viewed My Profile 1. People You May Know 2. Jobs You May Be Interested In 3. Events You May Be Interested In Entity disambiguation and matching 1. Subscription Packages 2. Sponsored Content Response Prediction Inventory Forecasting Data Infra Member-Facing Products: Diversity at Scale
  11. 11. Recommendations…Are Effective .. And Drive LinkedIn Confidential ©2013 All Rights Reserved 11 > 50% of connections > 50% of job applications > 50% of group joins • Find data that is useful for Members • Guiding Principle • Provide Relevant Content • Establish Social Connections • In Appropriate Context
  12. 12. Behavior Analysis Behavior Analysis Collaborative Filtering Collaborative Filtering PopularityPopularity SimilarProfilesSimilarProfiles ReferralCenterReferralCenter TalentMatchTalentMatch PeopleBrowse Map PeopleBrowse Map People Recom- mendation Types Shared, Dynamic, Unified Core Service Products Recom- mendation Entities JobsBrowse Map JobsBrowse Map SimilarJobsSimilarJobs Jobs JobsYouMay beinterestedin JobsYouMay beinterestedin … Ads Companies Searches News Events … and more GYMLGYML Groups BrowseMap Groups BrowseMap Groups SimilarGroupsSimilarGroups User FeedbackUser Feedback APIAPI (R-T) Feature Extraction, Entity Resolution & Enrichment (R-T) matching computations A/BA/B Offline data munging (hadoop) LinkedIn Recommendation Engine
  13. 13. LinkedIn Confidential ©2013 All Rights Reserved 13 Product Family Products Science Identity and Engagement Search and Analysis Recommendations Monetization 1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings) 1. People Search 2. Group Search 3. Who Viewed My Profile 1. People You May Know 2. Jobs You May Be Interested In 3. Events You May Be Interested In Entity disambiguation and matching 1. Subscription Packages 2. Sponsored Content Response prediction Data Infra • Scale • Full text and secondary ind • Real-time • Faceted search • Near RT index freshness • Drill-down exploration • Graph analysis • Content serving • Real-time tuning Member-Facing Products: Diversity at Scale
  14. 14. LinkedIn Data Infrastructure: Three-Phase Abstraction LinkedIn Confidential ©2013 All Rights Reserved 14 Users Online Data Infra Near-Line Infra Application Offline Data Infra Infrastructure Latency & Freshness Requirements Products Online Activity that should be reflected immediately • Member Profiles • Company Profiles • Connections • Messages • Endorsements • Skills Near-Line Activity that should be reflected soon • Activity Streams • Profile Standardization • News • Recommendations • Search • Messages Offline Activity that can be reflected later • People You May Know • Connection Strength • News • Recommendations • Next best idea…
  15. 15. LinkedIn Data Infrastructure: Sample Stack 15 Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms
  16. 16. LinkedIn Data Infrastructure: Data Stores LinkedIn Confidential ©2013 All Rights Reserved 16 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Transactions  Rich structures (e.g. indexes)  Change capture capability  Key value / document storage Voldemort  ICDE 2012 (Data Infra Overview)  FAST 2012 (Voldemort for Serving)
  17. 17. LinkedIn Data Infrastructure: Specialized Indexes LinkedIn Confidential ©2013 All Rights Reserved 17 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Search platform  Distributed graph engine Zoie Bobo Sensei GraphDB
  18. 18. LinkedIn Data Infrastructure: Pipelines LinkedIn Confidential ©2013 All Rights Reserved 18 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Messaging for site events, monitoring  High throughput  Change data capture stream  Reliable, consistent, low latency pipe  ACM SOCC 2012: “Databus”  IEEE Data Eng. Bulletin 2012: “Kafka”
  19. 19. LinkedIn Data Infrastructure: Off-line Analysis LinkedIn Confidential ©2013 All Rights Reserved 19 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  ML, Ranking, Relevance  Insights and Analytics  ETL, Metadata and Pipes  Business Source of Truth
  20. 20. LinkedIn Data Infrastructure: Cluster Management LinkedIn Confidential ©2013 All Rights Reserved 20 Users Online Data Infra Near-Line Infra Application Offline Data Infra Systems Capabilities  Generic framework for building distributed systems  Cluster Management Primitives  ACM SOCC 2012: Untangling Cluster Management with Helix
  21. 21. HELIX: Generalizing Cluster Management LinkedIn Confidential ©2013 All Rights Reserved 21 STATE MACHINE CONSTRAINTS OBJECTIVE COUNT=2 COUNT=1 minimize(maxnj∈N S(nj) ) t1≤ 5 SS MMOO t1 t2 t3 t4 minimize(maxnj∈N M(nj) ) Helix  Declare distributed system behavior via {S, C, O}  Enforce Partition constraints  Fault detection and tolerance (e.g. promote S to M)  Elasticity (e.g. Re-balance; Minimize migrations)  Used in Espresso, Search, Databus
  22. 22. LinkedIn Data Infrastructure: A few take-aways LinkedIn Confidential ©2013 All Rights Reserved 22 1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment. 2. Balance open-source products with home- grown platforms (**) 3. Operability, Capacity Planning and On-line Multi-tenancy are hard 4. Data Movement: Pipes and Feedback Loops are critical (**) 5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving.
  23. 23. Science and Infrastructure: Giving Back LinkedIn Confidential ©2013 All Rights Reserved 23 Research Publications  ACM SOCC 2012  ACM RecSys 2012  SIGIR 2012  CIKM 2012  VLDB 2012  ICDE 2012  FAST 2012  NetDB 2011  … Open Source Projects  Apache Helix new  ParSeq new  DataFu new  Apache Kafka  Sensei  Azkaban  Voldemort
  24. 24. A Recommendation Product: LinkedIn Confidential ©2013 All Rights Reserved 24 People You May Know (PYMK)
  25. 25. Probability that you may know someone else? LinkedIn Confidential ©2013 All Rights Reserved 25 Bob Alice Carol Known as “triangle closing” ??
  26. 26. PYMK: Science, Members and Connections LinkedIn Confidential ©2013 All Rights Reserved 26 1) Feature selection is key  Common Connections  Geo  Company  Age 2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M tend to not scale easily 3) Interplay: Data Model + ML + Parallel Computation model 4) Adding edges: Why do it? • Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven products Virality  Value  Insights  Product ScienceData Member Signals  The Feedback Loop
  27. 27. PYMK: Off-line Model Build LinkedIn Confidential ©2013 All Rights Reserved 27 Users Online Data Infra Near-Line Infra Application Offline Data Infra  Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.  Very complex workflow due to extraction and selection of large num of features. Built Azkaban for Hadoop.  Small Input and final look-up structure but large intermediate data (100’s of TB) due to MR. Problem (who you do not know) itself has an inherent blow-up.  Special optimizations (e.g. Bloom Join to remove connected)
  28. 28. PYMK: Off-line to Near-Line Serving LinkedIn Confidential ©2013 All Rights Reserved 28 Users Online Data Infra Near-Line Infra Application Offline Data Infra  Build serving structure on Hadoop. Scan versus Index compactness tradeoff.  Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.  Bulk load for efficiency. Fast Rollback for safety. Atomic swap.  Serving: Per-partition index in memory. PYMK blobs on disk.  Retrieval ~msec. Decoration in App FE is more expensive.
  29. 29. PYMK: Science and Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 29 Users Online Data Infra Near-Line Infra Application Offline Data Infra  Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.  Very agile feature: Lots of on-line A/B testing and tweaking of features  Huge Impact: > 50% of accepted invites are created by PYMK
  30. 30. PYMK: Tying It All Together LinkedIn Confidential ©2013 All Rights Reserved 30 P (B knows C)  large number of features Distance Common connections Organizational Overlap Age Bob Alice Carol Dave Eve Offline Model Near-Line Serving Offline Near-Line User Interactions PYMK Application
  31. 31. LinkedIn + Yale LinkedIn Confidential ©2013 All Rights Reserved 31  What is my career path?  How can I prepare?  How do I get my first internship and first job? Students  Where did my students go after they left the university?  How is my school seeding the various industries with the best talent?  How does my school compare with other institutions Students:  Transformation of Careers Yale:  Get a data-driven view  Uncover opportunities Wins based on data and insights
  32. 32. Thank you colleagues for the beautiful slides! LinkedIn Confidential ©2013 All Rights Reserved 32 David Henke SVP Operations Amy Tang Sr. Program Manager Sam Shah Principal Engineer Shirshanka Das Principal Engineer Kapil Surlaker Principal Engineer Anmol Bhasin Sr. Engineering Manager Daniel Tunkelang Principal Data Scientist
  33. 33. Summary LinkedIn Confidential ©2013 All Rights Reserved 33 Read more @ data.linkedin.com 1. E2E: The Big-Data feedback loop of social-network product design is cool 2. Infrastructure 1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost. 2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and measurement methodology. 3. Methodology and Science 1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact.
  34. 34. Help us. Come Have Fun with Us! LinkedIn Confidential ©2013 All Rights Reserved 34 Info: data.linkedin.com 1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more…
  35. 35. In Closing LinkedIn Confidential ©2013 All Rights Reserved 35 bghosh@linkedin.com Thank You!
  36. 36. LinkedIn Confidential ©2013 All Rights Reserved 36

×