SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
A Small Overview of Big Data Products,
Analytics and Infrastructure at Linkedin

Bhaskar Ghosh                                     Big Data Science
                                                  A Symposium in Honor of Martin Schultz
Senior Director of Engineering
                                                  Yale University
Data Infrastructure                               26 Oct 2012



LinkedIn Confidential ©2013 All Rights Reserved
Outline


           1.        Martin and Me
           2.        Company and Mission
           3.        Products and Science
           4.        Data Infrastructure
           5.        P, S, DI: People You May Know
           6.        Linkedin + Yale
           7.        Conclusion




LinkedIn Confidential ©2013 All Rights Reserved      2
Martin and Me

                                                  Thank you Martin! Best mentor.
                                                  Versatility, big-picture thinking and leadership.
                                                  Yale CS Ph.D. 1995 (Parallel Algorithms)



                                                  12y @ Informix & Oracle building parallel
                                                  database systems


                                                  4y @ Yahoo! building Ads systems & leading
                                                  the Display Ads Exchange organization



                                                  2y+ @ LinkedIn building & leading the
                                                  Data Infrastructure Engineering Organization



LinkedIn Confidential ©2013 All Rights Reserved                                                       3
The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…




175M+ 2 new                                                     100M+                     2M+
Members Worldwide                          Members Per Second   Monthly Unique Visitors   Company Pages


LinkedIn Confidential ©2013 All Rights Reserved                                                           4
..and a bunch of Data-Driven Products

                                                  Pandora Search for People




                                                                    Groups browse maps
                                                  Events You
                                                  May Be
                                                  Interested In




LinkedIn Confidential ©2013 All Rights Reserved                                          5
The LinkedIn Mission.
Connect the world’s professionals to make them
more productive and successful
Linkedin Product Philosophy

    Goals
                                                   Provide a uniquely personalized experience to
                                                    members (professionals)
                                                   Build an ecosystem to balance the interests of
                                                    members and partners (companies)




      Approach
                                                   Launch Often and Early
                                                   Data-Driven Experiment and Test
                                                   Fail Fast
                                                   Prepare for Virality and Scale

LinkedIn Confidential ©2013 All Rights Reserved                                                      7
Two Product Families

                                          For Members                     For Partners
                               People You May Know
                                                                   Hire
Professionals
                               Who’s Viewed My Profile                                  Companies
                               Jobs You May Be
                                Interested In                      Market
                               News/Sharing
                               Today                              Sell
                               Search
                               Subscriptions



                                                      Science and Analytics

                                                       Data Infrastructure
                                                  Profiles                   Actions
                                                                Data
                                                  Connections                Content

    LinkedIn Confidential ©2013 All Rights Reserved                                              8
The Big-Data Feedback Loop

   Engagement ↑                                                            Refinement ↑
                                                         Value ↑
                                          Member                     Product


                          Virality ↑                                      Insights ↑



                                                         Signals ↑
                                                  Data               Science

      Scale ↑                                                               Analytics ↑
                                        Infrastructure

LinkedIn Confidential ©2013 All Rights Reserved                                           9
Member-Facing Products: Diversity at Scale
      Product Family                                  Products                    Science             Data Infra

                                            1.    Profile and Connections   Blending and ranking of
                                            2.    Activity Streams          heterogeneous content
    Identity and
                                            3.    Messages (email)          (e.g. Network Updates,
    Engagement                                                              Group Discussions, Job
                                            4.    Endorsements & Skills     Postings)


                                           1.     People Search
     Search and
                                           2.     Group Search
      Analysis
                                           3.     Who Viewed My Profile


                                           1.     People You May Know
                                           2.     Jobs You May Be           Entity
Recommendations                                   Interested In             disambiguation and
                                           3.     Events You May Be         matching
                                                  Interested In


                                           1.     Subscription Packages     Response Prediction
     Monetization
                                           2.     Sponsored Content         Inventory Forecasting

LinkedIn Confidential ©2013 All Rights Reserved                                                                    10
Recommendations…Are Effective .. And Drive

> 50% of connections                                •   Find data that is useful for Members
                                                    •   Guiding Principle
                                                         • Provide Relevant Content
                                                         • Establish Social Connections
                                                         • In Appropriate Context

                                    > 50% of job applications       > 50% of group joins




LinkedIn Confidential ©2013 All Rights Reserved                                                11
LinkedIn Recommendation Engine

Recom-                        People                                                   Jobs                              Groups
mendation
Entities                                                                                                                                                …    Ads
                                                                                                                                                             Companies
                                                                                                                                                             Searches




                                                                               be interested in
                                                             Referral Center
                          People Browse


                                          Similar Profiles




                                                                                                                                       Similar Groups
                                                                                Jobs You May

                                                                                Jobs Browse




                                                                                                                          Browse Map
            TalentMatch




                                                                                                  Similar Jobs
                                                                                                                                                             News




                                                                                                                            Groups
                                                                                                                  GYML
                                                                                                                                                             Events
                              Map




                                                                                     Map
                                                                                                                                                             … and more
Products



                                                                                                                 A/B
                                                                                                                 API
Recom-
                          Behavior                                                Collaborative
mendation                                                                                                                  Popularity                       User Feedback
Types
                          Analysis                                                  Filtering

Shared,                               (R-T) Feature Extraction, Entity  (R-T) matching computations
Dynamic,                                 Resolution & Enrichment
Unified                                                                Offline data munging (hadoop)
Core
Service
Member-Facing Products: Diversity at Scale
      Product Family                                  Products                    Science                Data Infra

                                            1.    Profile and Connections   Blending and ranking of   • Scale
                                            2.    Activity Streams          heterogeneous content
    Identity and                                                                                      • Full text and
                                            3.    Messages (email)          (e.g. Network Updates,
    Engagement                                                                                          secondary ind
                                                                            Group Discussions, Job
                                            4.    Endorsements & Skills     Postings)                 • Real-time

                                                                                                      • Faceted search
                                           1.     People Search                                       • Near RT index
     Search and
                                           2.     Group Search                                          freshness
      Analysis
                                           3.     Who Viewed My Profile                               • Drill-down
                                                                                                        exploration

                                           1.     People You May Know
                                           2.     Jobs You May Be           Entity                    • Graph analysis
Recommendations                                   Interested In             disambiguation and        • Content serving
                                           3.     Events You May Be         matching                  • Real-time tuning
                                                  Interested In


                                           1.     Subscription Packages
     Monetization                                                           Response prediction
                                           2.     Sponsored Content

LinkedIn Confidential ©2013 All Rights Reserved                                                                            13
LinkedIn Data Infrastructure: Three-Phase Abstraction

                                                                      Near-Line
                                                                        Infra




                                Application                                                                    Offline
                                                                                                              Data Infra

     Users                                                            Online Data
                                                                         Infra




Infrastructure                   Latency & Freshness Requirements                                   Products
                                                                               •      Member Profiles      • Messages
   Online                    Activity that should be reflected immediately     •      Company Profiles     • Endorsements
                                                                               •      Connections          • Skills
                                                                               •      Activity Streams        •   Recommendations
  Near-Line                  Activity that should be reflected soon            •      Profile Standardization •   Search
                                                                               •      News                    •   Messages
                                                                                  •   People You May Know •       Recommendations
     Offline                  Activity that can be reflected later                •   Connection Strength •       Next best idea…
                                                                                  •   News
LinkedIn Confidential ©2013 All Rights Reserved                                                                                     14
LinkedIn Data Infrastructure: Sample Stack




 Infra challenges in 3-phase       Some off-the-shelf.
    ecosystem are diverse,      Significant investment in
     complex and specific       home-grown, deep and
                                  interesting platforms

                                                            15
LinkedIn Data Infrastructure: Data Stores

                                                              Near-Line
                                                                Infra




                                Application                                                   Offline
                                                                                             Data Infra

     Users                                                    Online Data
                                                                 Infra




                            ICDE 2012 (Data Infra Overview)           FAST 2012 (Voldemort for Serving)


                            Systems                                              Capabilities

                                                                    Transactions
                                                                    Rich structures (e.g. indexes)
                                                                    Change capture capability
                                                  Voldemort         Key value / document storage

LinkedIn Confidential ©2013 All Rights Reserved                                                             16
LinkedIn Data Infrastructure: Specialized Indexes

                                                         Near-Line
                                                           Infra




                                Application                                               Offline
                                                                                         Data Infra

     Users                                               Online Data
                                                            Infra




                            Systems                                           Capabilities


                  Zoie                        Bobo   Sensei           Search platform

         GraphDB                                                      Distributed graph engine



LinkedIn Confidential ©2013 All Rights Reserved                                                       17
LinkedIn Data Infrastructure: Pipelines

                                                             Near-Line
                                                               Infra




                                Application                                                 Offline
                                                                                           Data Infra

     Users                                                Online Data
                                                             Infra




                                 ACM SOCC 2012: “Databus”         IEEE Data Eng. Bulletin 2012: “Kafka”


                            Systems                                            Capabilities
                                                     Messaging for site events, monitoring
                                                     High throughput

                                                     Change data capture stream
                                                     Reliable, consistent, low latency pipe
LinkedIn Confidential ©2013 All Rights Reserved                                                             18
LinkedIn Data Infrastructure: Off-line Analysis

                                                  Near-Line
                                                    Infra




                                Application                                       Offline
                                                                                 Data Infra

     Users                                        Online Data
                                                     Infra




                            Systems                                   Capabilities


                                                               ML, Ranking, Relevance
                                                               Insights and Analytics
                                                               ETL, Metadata and Pipes
                                                               Business Source of Truth
LinkedIn Confidential ©2013 All Rights Reserved                                               19
LinkedIn Data Infrastructure: Cluster Management

                                                             Near-Line
                                                               Infra




                                Application                                              Offline
                                                                                        Data Infra

     Users                                                   Online Data
                                                                Infra




                                 ACM SOCC 2012: Untangling Cluster Management with Helix


                            Systems                                          Capabilities

                                                              Generic framework for building
                                                               distributed systems
                                                              Cluster Management Primitives


LinkedIn Confidential ©2013 All Rights Reserved                                                      20
HELIX: Generalizing Cluster Management

                                  COUNT=2
              t1≤ 5                                                     STATE MACHINE
                                      S
                     t1                                  t2

                             t3                   t4
           O                                                  M           Helix
                                                       COUNT=1


                      minimize(maxnj∈N S(nj) )
                                                                  CONSTRAINTS     OBJECTIVE
                      minimize(maxnj∈N M(nj) )



        Declare distributed system behavior via {S, C, O}
           Enforce Partition constraints
           Fault detection and tolerance (e.g. promote S to M)
           Elasticity (e.g. Re-balance; Minimize migrations)
        Used in Espresso, Search, Databus

LinkedIn Confidential ©2013 All Rights Reserved                                               21
LinkedIn Data Infrastructure: A few take-aways

                                   1.         Infrastructure decisions matter and are hard to
                                              transform in a hyper-growth environment.
                                   2.         Balance open-source products with home-
                                              grown platforms (**)
                                   3.         Operability, Capacity Planning and On-line
                                              Multi-tenancy are hard
                                   4.         Data Movement: Pipes and Feedback Loops
                                              are critical (**)
                                   5.         Data Model and Integration e2e are key (*)
                                   6.         Few vs Many: Balance over-specialized (agile)
                                              vs generic efforts (leverage-able) platforms (*)
                                   7.         Off-line Multi-Platform story is evolving.


LinkedIn Confidential ©2013 All Rights Reserved                                                  22
Science and Infrastructure: Giving Back

       Research Publications                       Open Source Projects
       ACM SOCC 2012                              Apache Helix new
       ACM RecSys 2012
                                                   ParSeq new
       SIGIR 2012
       CIKM 2012                                  DataFu new
       VLDB 2012                                  Apache Kafka
       ICDE 2012
       FAST 2012                                  Sensei
       NetDB 2011                                 Azkaban
       …
                                                   Voldemort




LinkedIn Confidential ©2013 All Rights Reserved                           23
A Recommendation Product:

           People You May Know (PYMK)




LinkedIn Confidential ©2013 All Rights Reserved   24
Probability that you may know someone else?




                                                            Alice


                                                             ??

                          Bob                                                   Carol


                                                  Known as “triangle closing”

LinkedIn Confidential ©2013 All Rights Reserved                                         25
PYMK: Science, Members and Connections
1)       Feature selection is key                                  The Feedback Loop
             Common Connections                                         Value ↑
                                                               Member          Product
             Geo
                                                           Virality ↑                Insights ↑
             Company                                                    Signals ↑
             Age                                                 Data          Science

2)       ML and data model
         •    Traditional ML (e.g. matrix factorization) on O(n^2) of 175M
              tend to not scale easily
3)       Interplay: Data Model + ML + Parallel Computation model
4)       Adding edges: Why do it?
         •    Creates positive-feedback social loops for members
         •    More useful content and activity available to members
         •    Denser graph improves signal strength in science-driven
              products



LinkedIn Confidential ©2013 All Rights Reserved                                                   26
PYMK: Off-line Model Build

                                                   Near-Line
                                                     Infra




                                Application                                      Offline
                                                                                Data Infra

     Users                                        Online Data
                                                     Infra




            Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.
            Very complex workflow due to extraction and selection of large num of features.
             Built Azkaban for Hadoop.
            Small Input and final look-up structure but large intermediate data (100’s of TB)
             due to MR. Problem (who you do not know) itself has an inherent blow-up.
            Special optimizations (e.g. Bloom Join to remove connected)



LinkedIn Confidential ©2013 All Rights Reserved                                                  27
PYMK: Off-line to Near-Line Serving

                                                    Near-Line
                                                      Infra




                                Application                                      Offline
                                                                                Data Infra

     Users                                         Online Data
                                                      Infra




               Build serving structure on Hadoop. Scan versus Index compactness tradeoff.
               Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.
               Bulk load for efficiency. Fast Rollback for safety. Atomic swap.
               Serving: Per-partition index in memory. PYMK blobs on disk.
               Retrieval ~msec. Decoration in App FE is more expensive.



LinkedIn Confidential ©2013 All Rights Reserved                                                  28
PYMK: Science and Feedback Loop

                                                  Near-Line
                                                    Infra




                                Application                                    Offline
                                                                              Data Infra

     Users                                        Online Data
                                                     Infra




            Response vs Latency: Fast refresh helps user experience. (e.g. showing
             connections of very recent connections). “Social” phenomenon.
            Very agile feature: Lots of on-line A/B testing and tweaking of features
            Huge Impact: > 50% of accepted invites are created by PYMK



LinkedIn Confidential ©2013 All Rights Reserved                                            29
PYMK: Tying It All Together
                                                                                                   PYMK
                                                                                                 Application
          User Interactions

                                                                                                 Near-Line
                                                                                                  Serving


Near-Line

Offline
                                            P (B knows C) α large number of features

                                                      Common
                     Alice                           connections

                                                    Organizational
                                                       Overlap
                                                                                       Offline
   Bob                                  Carol
                                                                                       Model
                                                         Age

                                                       Distance


          Dave                  Eve
  LinkedIn Confidential ©2013 All Rights Reserved                                                              30
LinkedIn + Yale

               Students




     What is my career path?                         Where did my students go      Students:
     How can I prepare?                               after they left the               Transformation of
     How do I get my first                            university?                         Careers
      internship and first job?                       How is my school seeding      Yale:
                                                       the various industries with
                                                                                         Get a data-driven view
                                                       the best talent?
                                                                                         Uncover opportunities
                                                      How does my school
                                                       compare with other
                                                       institutions




                                                  Wins based on data and insights

LinkedIn Confidential ©2013 All Rights Reserved                                                                    31
Thank you colleagues for the beautiful slides!




      Amy Tang                                 Anmol Bhasin            Daniel Tunkelang           David Henke
Sr. Program Manager                       Sr. Engineering Manager    Principal Data Scientist    SVP Operations




                           Kapil Surlaker                     Sam Shah               Shirshanka Das
                          Principal Engineer              Principal Engineer         Principal Engineer


 LinkedIn Confidential ©2013 All Rights Reserved                                                                  32
Summary

1.      E2E: The Big-Data feedback loop of social-network product design is cool
2.      Infrastructure
        1. Data Infrastructure needs continuous innovation and iteration to keep
             pace for scale and cost.
        2. Fast moving, Big, Clean Data + Agile Metadata = Goodness
        3. Data-driven products need agile feedback infrastructure and
             measurement methodology.
3.      Methodology
        1. Data-Driven experimentation enables insights and agile products
        2. Recommendation-driven products have big impact.



                              Read more @ data.linkedin.com

 LinkedIn Confidential ©2013 All Rights Reserved                                   33
Help us. Come Have Fun with Us!



 1.         Science and Data Mining: Recommendation and Optimization Problems
 2.         Next-generation ad-hoc and OLAP query processing on Hadoop
 3.         Graph Computations: Off-line mining and On-line integration loops
 4.         nRT Data Streams in Near-line infrastructure
 5.         And much more…




                                                  Info: data.linkedin.com

LinkedIn Confidential ©2013 All Rights Reserved                                 34
In Closing




                                                    bghosh@linkedin.com




                                                  Thank You!
LinkedIn Confidential ©2013 All Rights Reserved                           35
LinkedIn Confidential ©2013 All Rights Reserved   36

Weitere ähnliche Inhalte

Was ist angesagt?

Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Tanguy MOAL
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for BioinformaticsHadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for Bioinformaticsosintegrators
 
Optimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data ScienceOptimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data ScienceVital.AI
 
An Introduction to Graph: Database, Analytics, and Cloud Services
An Introduction to Graph:  Database, Analytics, and Cloud ServicesAn Introduction to Graph:  Database, Analytics, and Cloud Services
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewNeo4j
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Semantic Web Application Development
Semantic Web Application DevelopmentSemantic Web Application Development
Semantic Web Application DevelopmentDaniel Slamowitz
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataMicrosoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataHortonworks
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
Bigdata and ai in p2 p industry: Knowledge graph and inference
Bigdata and ai in p2 p industry:  Knowledge graph and inferenceBigdata and ai in p2 p industry:  Knowledge graph and inference
Bigdata and ai in p2 p industry: Knowledge graph and inferencesfbiganalytics
 
Neo4j MySql MS-SQL comparison
Neo4j MySql MS-SQL comparisonNeo4j MySql MS-SQL comparison
Neo4j MySql MS-SQL comparisonDhaval Dalal
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?DATAVERSITY
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceIBM Cloud Data Services
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark graphdevroom
 
How Semantics Solves Big Data Challenges
How Semantics Solves Big Data ChallengesHow Semantics Solves Big Data Challenges
How Semantics Solves Big Data ChallengesDATAVERSITY
 

Was ist angesagt? (20)

Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Integrating Semantic Systems
Integrating Semantic SystemsIntegrating Semantic Systems
Integrating Semantic Systems
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for BioinformaticsHadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for Bioinformatics
 
Optimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data ScienceOptimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data Science
 
An Introduction to Graph: Database, Analytics, and Cloud Services
An Introduction to Graph:  Database, Analytics, and Cloud ServicesAn Introduction to Graph:  Database, Analytics, and Cloud Services
An Introduction to Graph: Database, Analytics, and Cloud Services
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j Overview
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Semantic Web Application Development
Semantic Web Application DevelopmentSemantic Web Application Development
Semantic Web Application Development
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataMicrosoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Bigdata and ai in p2 p industry: Knowledge graph and inference
Bigdata and ai in p2 p industry:  Knowledge graph and inferenceBigdata and ai in p2 p industry:  Knowledge graph and inference
Bigdata and ai in p2 p industry: Knowledge graph and inference
 
Neo4j MySql MS-SQL comparison
Neo4j MySql MS-SQL comparisonNeo4j MySql MS-SQL comparison
Neo4j MySql MS-SQL comparison
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a Service
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark
 
How Semantics Solves Big Data Challenges
How Semantics Solves Big Data ChallengesHow Semantics Solves Big Data Challenges
How Semantics Solves Big Data Challenges
 

Ähnlich wie A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh
 
LinkedIn Overview Feb 2012
LinkedIn Overview Feb 2012LinkedIn Overview Feb 2012
LinkedIn Overview Feb 2012Dan Green
 
Linked In Corporate Presentation 050312
Linked In Corporate Presentation 050312Linked In Corporate Presentation 050312
Linked In Corporate Presentation 050312Andy Solty
 
Left Brain, Right Brain: How to Unify Enterprise Analytics
Left Brain, Right Brain: How to Unify Enterprise AnalyticsLeft Brain, Right Brain: How to Unify Enterprise Analytics
Left Brain, Right Brain: How to Unify Enterprise AnalyticsInside Analysis
 
Empowering the Business with Agile Analytics
Empowering the Business with Agile AnalyticsEmpowering the Business with Agile Analytics
Empowering the Business with Agile AnalyticsInside Analysis
 
LinkedIn Data Products
LinkedIn Data ProductsLinkedIn Data Products
LinkedIn Data ProductsVitaly Gordon
 
The Best Analytics Tools
The Best Analytics ToolsThe Best Analytics Tools
The Best Analytics ToolsDatalicious
 
When Worlds Collide: Intelligence, Analytics and Operations
When Worlds Collide: Intelligence, Analytics and OperationsWhen Worlds Collide: Intelligence, Analytics and Operations
When Worlds Collide: Intelligence, Analytics and OperationsInside Analysis
 
The ibm social journey
The ibm social journeyThe ibm social journey
The ibm social journeyLetsConnect
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analyticsdmurph4
 
Nyheterna I IBM Connections version 4
Nyheterna I IBM Connections version 4Nyheterna I IBM Connections version 4
Nyheterna I IBM Connections version 4IBM Sverige
 
Lessons learnt from implementing enterprise social software at cisco
Lessons learnt from implementing enterprise social software at ciscoLessons learnt from implementing enterprise social software at cisco
Lessons learnt from implementing enterprise social software at ciscoUnified Communications Online
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
InfoFusion Overview And Roadmap
InfoFusion Overview And RoadmapInfoFusion Overview And Roadmap
InfoFusion Overview And RoadmapMarten den Haring
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureInside Analysis
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 
Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...
Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...
Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...Mike Gotta
 
hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...
hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...
hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...City University London
 

Ähnlich wie A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn (20)

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
LinkedIn Overview Feb 2012
LinkedIn Overview Feb 2012LinkedIn Overview Feb 2012
LinkedIn Overview Feb 2012
 
Linked In Corporate Presentation 050312
Linked In Corporate Presentation 050312Linked In Corporate Presentation 050312
Linked In Corporate Presentation 050312
 
Left Brain, Right Brain: How to Unify Enterprise Analytics
Left Brain, Right Brain: How to Unify Enterprise AnalyticsLeft Brain, Right Brain: How to Unify Enterprise Analytics
Left Brain, Right Brain: How to Unify Enterprise Analytics
 
Empowering the Business with Agile Analytics
Empowering the Business with Agile AnalyticsEmpowering the Business with Agile Analytics
Empowering the Business with Agile Analytics
 
LinkedIn Data Products
LinkedIn Data ProductsLinkedIn Data Products
LinkedIn Data Products
 
The Best Analytics Tools
The Best Analytics ToolsThe Best Analytics Tools
The Best Analytics Tools
 
When Worlds Collide: Intelligence, Analytics and Operations
When Worlds Collide: Intelligence, Analytics and OperationsWhen Worlds Collide: Intelligence, Analytics and Operations
When Worlds Collide: Intelligence, Analytics and Operations
 
The ibm social journey
The ibm social journeyThe ibm social journey
The ibm social journey
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
 
Nyheterna I IBM Connections version 4
Nyheterna I IBM Connections version 4Nyheterna I IBM Connections version 4
Nyheterna I IBM Connections version 4
 
Lessons learnt from implementing enterprise social software at cisco
Lessons learnt from implementing enterprise social software at ciscoLessons learnt from implementing enterprise social software at cisco
Lessons learnt from implementing enterprise social software at cisco
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
InfoFusion Overview And Roadmap
InfoFusion Overview And RoadmapInfoFusion Overview And Roadmap
InfoFusion Overview And Roadmap
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...
Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...
Design Considerations For Enterprise Social Networks: Identity, Graphs, Strea...
 
Facebook & Google+ for Small Business : Strategies & Concepts
Facebook & Google+ for Small Business : Strategies & ConceptsFacebook & Google+ for Small Business : Strategies & Concepts
Facebook & Google+ for Small Business : Strategies & Concepts
 
hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...
hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...
hcid2011 - RED: a multi-disciplinary approach to experience design - Jarnail ...
 

Mehr von Amy W. Tang

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to DatabusAmy W. Tang
 

Mehr von Amy W. Tang (12)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 

Kürzlich hochgeladen

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

  • 1. A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin Bhaskar Ghosh Big Data Science A Symposium in Honor of Martin Schultz Senior Director of Engineering Yale University Data Infrastructure 26 Oct 2012 LinkedIn Confidential ©2013 All Rights Reserved
  • 2. Outline 1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. Conclusion LinkedIn Confidential ©2013 All Rights Reserved 2
  • 3. Martin and Me Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms) 12y @ Informix & Oracle building parallel database systems 4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization 2y+ @ LinkedIn building & leading the Data Infrastructure Engineering Organization LinkedIn Confidential ©2013 All Rights Reserved 3
  • 4. The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 175M+ 2 new 100M+ 2M+ Members Worldwide Members Per Second Monthly Unique Visitors Company Pages LinkedIn Confidential ©2013 All Rights Reserved 4
  • 5. ..and a bunch of Data-Driven Products Pandora Search for People Groups browse maps Events You May Be Interested In LinkedIn Confidential ©2013 All Rights Reserved 5
  • 6. The LinkedIn Mission. Connect the world’s professionals to make them more productive and successful
  • 7. Linkedin Product Philosophy Goals  Provide a uniquely personalized experience to members (professionals)  Build an ecosystem to balance the interests of members and partners (companies) Approach  Launch Often and Early  Data-Driven Experiment and Test  Fail Fast  Prepare for Virality and Scale LinkedIn Confidential ©2013 All Rights Reserved 7
  • 8. Two Product Families For Members For Partners  People You May Know Hire Professionals  Who’s Viewed My Profile Companies  Jobs You May Be Interested In Market  News/Sharing  Today Sell  Search  Subscriptions Science and Analytics Data Infrastructure Profiles Actions Data Connections Content LinkedIn Confidential ©2013 All Rights Reserved 8
  • 9. The Big-Data Feedback Loop Engagement ↑ Refinement ↑ Value ↑ Member Product Virality ↑ Insights ↑ Signals ↑ Data Science Scale ↑ Analytics ↑ Infrastructure LinkedIn Confidential ©2013 All Rights Reserved 9
  • 10. Member-Facing Products: Diversity at Scale Product Family Products Science Data Infra 1. Profile and Connections Blending and ranking of 2. Activity Streams heterogeneous content Identity and 3. Messages (email) (e.g. Network Updates, Engagement Group Discussions, Job 4. Endorsements & Skills Postings) 1. People Search Search and 2. Group Search Analysis 3. Who Viewed My Profile 1. People You May Know 2. Jobs You May Be Entity Recommendations Interested In disambiguation and 3. Events You May Be matching Interested In 1. Subscription Packages Response Prediction Monetization 2. Sponsored Content Inventory Forecasting LinkedIn Confidential ©2013 All Rights Reserved 10
  • 11. Recommendations…Are Effective .. And Drive > 50% of connections • Find data that is useful for Members • Guiding Principle • Provide Relevant Content • Establish Social Connections • In Appropriate Context > 50% of job applications > 50% of group joins LinkedIn Confidential ©2013 All Rights Reserved 11
  • 12. LinkedIn Recommendation Engine Recom- People Jobs Groups mendation Entities … Ads Companies Searches be interested in Referral Center People Browse Similar Profiles Similar Groups Jobs You May Jobs Browse Browse Map TalentMatch Similar Jobs News Groups GYML Events Map Map … and more Products A/B API Recom- Behavior Collaborative mendation Popularity User Feedback Types Analysis Filtering Shared, (R-T) Feature Extraction, Entity (R-T) matching computations Dynamic, Resolution & Enrichment Unified Offline data munging (hadoop) Core Service
  • 13. Member-Facing Products: Diversity at Scale Product Family Products Science Data Infra 1. Profile and Connections Blending and ranking of • Scale 2. Activity Streams heterogeneous content Identity and • Full text and 3. Messages (email) (e.g. Network Updates, Engagement secondary ind Group Discussions, Job 4. Endorsements & Skills Postings) • Real-time • Faceted search 1. People Search • Near RT index Search and 2. Group Search freshness Analysis 3. Who Viewed My Profile • Drill-down exploration 1. People You May Know 2. Jobs You May Be Entity • Graph analysis Recommendations Interested In disambiguation and • Content serving 3. Events You May Be matching • Real-time tuning Interested In 1. Subscription Packages Monetization Response prediction 2. Sponsored Content LinkedIn Confidential ©2013 All Rights Reserved 13
  • 14. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line Infra Application Offline Data Infra Users Online Data Infra Infrastructure Latency & Freshness Requirements Products • Member Profiles • Messages Online Activity that should be reflected immediately • Company Profiles • Endorsements • Connections • Skills • Activity Streams • Recommendations Near-Line Activity that should be reflected soon • Profile Standardization • Search • News • Messages • People You May Know • Recommendations Offline Activity that can be reflected later • Connection Strength • Next best idea… • News LinkedIn Confidential ©2013 All Rights Reserved 14
  • 15. LinkedIn Data Infrastructure: Sample Stack Infra challenges in 3-phase Some off-the-shelf. ecosystem are diverse, Significant investment in complex and specific home-grown, deep and interesting platforms 15
  • 16. LinkedIn Data Infrastructure: Data Stores Near-Line Infra Application Offline Data Infra Users Online Data Infra  ICDE 2012 (Data Infra Overview)  FAST 2012 (Voldemort for Serving) Systems Capabilities  Transactions  Rich structures (e.g. indexes)  Change capture capability Voldemort  Key value / document storage LinkedIn Confidential ©2013 All Rights Reserved 16
  • 17. LinkedIn Data Infrastructure: Specialized Indexes Near-Line Infra Application Offline Data Infra Users Online Data Infra Systems Capabilities Zoie Bobo Sensei  Search platform GraphDB  Distributed graph engine LinkedIn Confidential ©2013 All Rights Reserved 17
  • 18. LinkedIn Data Infrastructure: Pipelines Near-Line Infra Application Offline Data Infra Users Online Data Infra  ACM SOCC 2012: “Databus”  IEEE Data Eng. Bulletin 2012: “Kafka” Systems Capabilities  Messaging for site events, monitoring  High throughput  Change data capture stream  Reliable, consistent, low latency pipe LinkedIn Confidential ©2013 All Rights Reserved 18
  • 19. LinkedIn Data Infrastructure: Off-line Analysis Near-Line Infra Application Offline Data Infra Users Online Data Infra Systems Capabilities  ML, Ranking, Relevance  Insights and Analytics  ETL, Metadata and Pipes  Business Source of Truth LinkedIn Confidential ©2013 All Rights Reserved 19
  • 20. LinkedIn Data Infrastructure: Cluster Management Near-Line Infra Application Offline Data Infra Users Online Data Infra  ACM SOCC 2012: Untangling Cluster Management with Helix Systems Capabilities  Generic framework for building distributed systems  Cluster Management Primitives LinkedIn Confidential ©2013 All Rights Reserved 20
  • 21. HELIX: Generalizing Cluster Management COUNT=2 t1≤ 5 STATE MACHINE S t1 t2 t3 t4 O M Helix COUNT=1 minimize(maxnj∈N S(nj) ) CONSTRAINTS OBJECTIVE minimize(maxnj∈N M(nj) )  Declare distributed system behavior via {S, C, O}  Enforce Partition constraints  Fault detection and tolerance (e.g. promote S to M)  Elasticity (e.g. Re-balance; Minimize migrations)  Used in Espresso, Search, Databus LinkedIn Confidential ©2013 All Rights Reserved 21
  • 22. LinkedIn Data Infrastructure: A few take-aways 1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment. 2. Balance open-source products with home- grown platforms (**) 3. Operability, Capacity Planning and On-line Multi-tenancy are hard 4. Data Movement: Pipes and Feedback Loops are critical (**) 5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving. LinkedIn Confidential ©2013 All Rights Reserved 22
  • 23. Science and Infrastructure: Giving Back Research Publications Open Source Projects  ACM SOCC 2012  Apache Helix new  ACM RecSys 2012  ParSeq new  SIGIR 2012  CIKM 2012  DataFu new  VLDB 2012  Apache Kafka  ICDE 2012  FAST 2012  Sensei  NetDB 2011  Azkaban  …  Voldemort LinkedIn Confidential ©2013 All Rights Reserved 23
  • 24. A Recommendation Product: People You May Know (PYMK) LinkedIn Confidential ©2013 All Rights Reserved 24
  • 25. Probability that you may know someone else? Alice ?? Bob Carol Known as “triangle closing” LinkedIn Confidential ©2013 All Rights Reserved 25
  • 26. PYMK: Science, Members and Connections 1) Feature selection is key The Feedback Loop  Common Connections Value ↑ Member Product  Geo Virality ↑ Insights ↑  Company Signals ↑  Age Data Science 2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M tend to not scale easily 3) Interplay: Data Model + ML + Parallel Computation model 4) Adding edges: Why do it? • Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven products LinkedIn Confidential ©2013 All Rights Reserved 26
  • 27. PYMK: Off-line Model Build Near-Line Infra Application Offline Data Infra Users Online Data Infra  Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.  Very complex workflow due to extraction and selection of large num of features. Built Azkaban for Hadoop.  Small Input and final look-up structure but large intermediate data (100’s of TB) due to MR. Problem (who you do not know) itself has an inherent blow-up.  Special optimizations (e.g. Bloom Join to remove connected) LinkedIn Confidential ©2013 All Rights Reserved 27
  • 28. PYMK: Off-line to Near-Line Serving Near-Line Infra Application Offline Data Infra Users Online Data Infra  Build serving structure on Hadoop. Scan versus Index compactness tradeoff.  Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.  Bulk load for efficiency. Fast Rollback for safety. Atomic swap.  Serving: Per-partition index in memory. PYMK blobs on disk.  Retrieval ~msec. Decoration in App FE is more expensive. LinkedIn Confidential ©2013 All Rights Reserved 28
  • 29. PYMK: Science and Feedback Loop Near-Line Infra Application Offline Data Infra Users Online Data Infra  Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.  Very agile feature: Lots of on-line A/B testing and tweaking of features  Huge Impact: > 50% of accepted invites are created by PYMK LinkedIn Confidential ©2013 All Rights Reserved 29
  • 30. PYMK: Tying It All Together PYMK Application User Interactions Near-Line Serving Near-Line Offline P (B knows C) α large number of features Common Alice connections Organizational Overlap Offline Bob Carol Model Age Distance Dave Eve LinkedIn Confidential ©2013 All Rights Reserved 30
  • 31. LinkedIn + Yale Students  What is my career path?  Where did my students go Students:  How can I prepare? after they left the  Transformation of  How do I get my first university? Careers internship and first job?  How is my school seeding Yale: the various industries with  Get a data-driven view the best talent?  Uncover opportunities  How does my school compare with other institutions Wins based on data and insights LinkedIn Confidential ©2013 All Rights Reserved 31
  • 32. Thank you colleagues for the beautiful slides! Amy Tang Anmol Bhasin Daniel Tunkelang David Henke Sr. Program Manager Sr. Engineering Manager Principal Data Scientist SVP Operations Kapil Surlaker Sam Shah Shirshanka Das Principal Engineer Principal Engineer Principal Engineer LinkedIn Confidential ©2013 All Rights Reserved 32
  • 33. Summary 1. E2E: The Big-Data feedback loop of social-network product design is cool 2. Infrastructure 1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost. 2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and measurement methodology. 3. Methodology 1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact. Read more @ data.linkedin.com LinkedIn Confidential ©2013 All Rights Reserved 33
  • 34. Help us. Come Have Fun with Us! 1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more… Info: data.linkedin.com LinkedIn Confidential ©2013 All Rights Reserved 34
  • 35. In Closing bghosh@linkedin.com Thank You! LinkedIn Confidential ©2013 All Rights Reserved 35
  • 36. LinkedIn Confidential ©2013 All Rights Reserved 36