SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Building Data Products
using Hadoop at Linkedin
                Mitul Tiwari
    Search, Network, and Analytics (SNA)
                 LinkedIn
                     1
                                           1
Who am I?




    2
            2
What do I mean by Data Products?




               3
                                   3
People You May Know




         4
                      4
Profile Stats: WVMP




        5
                     5
Viewers of this profile also ...




               6
                                  6
Skills




  7
         7
InMaps




  8
         8
Data Products: Key Ideas

Recommendations
 People You May Know, Viewers of this profile ...

Analytics and Insight
 Profile Stats: Who Viewed My Profile, Skills

Visualization
 InMaps

                       9
                                                   9
Data Products: Challenges

 LinkedIn: 2nd largest social network

 120 million members on LinkedIn

 Billions of connections

 Billions of pageviews

 Terabytes of data to process

                      10
                                        10
Outline
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance          11
                                    11
Systems and Tools

Kafka (LinkedIn)

Hadoop (Apache)

Azkaban (LinkedIn)

Voldemort (LinkedIn)


                     12
                          12
Systems and Tools
Kafka
 publish-subscribe messaging system

 transfer data from production to HDFS

Hadoop

Azkaban

Voldemort

                      13
                                         13
Systems and Tools
Kafka

Hadoop
 Java MapReduce and Pig

 process data

Azkaban

Voldemort

                    14
                          14
Systems and Tools
Kafka

Hadoop

Azkaban
 Hadoop workflow management tool

 to manage hundreds of Hadoop jobs

Voldemort

                     15
                                     15
Systems and Tools
Kafka

Hadoop

Azkaban

Voldemort
 Key-value store

 store output of Hadoop jobs and serve in production

                      16
                                                       16
Outline
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance          17
                                    17
People You May Know
 How do people            Alice
know each other?



               Bob                Carol




                     18
                                          18
People You May Know
 How do people            Alice
know each other?



               Bob                Carol




                     19
                                          19
People You May Know
 How do people                 Alice
know each other?



               Bob                     Carol



                   Triangle closing


                          20
                                               20
People You May Know
 How do people                Alice
know each other?



               Bob                    Carol



                 Triangle closing
Prob(Bob knows Carol) ~ the # of common connections

                         21
                                                      21
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      22
                                                                   22
Pig Overview
Load: load data, specify format

Store: store data, specify format

Foreach, Generate: Projections, similar to select

Group by: group by column(s)

Join, Filter, Limit, Order, ...

User Defined Functions (UDFs)
                        23
                                                    23
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      24
                                                                   24
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      25
                                                                   25
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      26
                                                                   26
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      27
                                                                   27
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      28
                                                                   28
Triangle Closing Example
                                   Alice




                  Bob                       Carol

                               connections = LOAD `connections` USING
1.(A,B),(B,A),(A,C),(C,A)      PigStorage();
2.(A,{B,C}),(B,{A}),(C,{A})
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
                              29
                                                                        29
Triangle Closing Example
                                    Alice




                  Bob                         Carol


1.(A,B),(B,A),(A,C),(C,A)
                              group_conn = GROUP connections BY
2.(A,{B,C}),(B,{A}),(C,{A})   source_id;
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
                               30
                                                                  30
Triangle Closing Example
                                     Alice




                  Bob                             Carol


1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})
                              pairs = FOREACH group_conn GENERATE
3.(A,{B,C}),(A,{C,B})         generatePair(connections.dest_id) as (id1, id2);
4.(B,C,1), (C,B,1)
                                31
                                                                                 31
Triangle Closing Example
                                     Alice




                  Bob                           Carol


1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})   common_conn = GROUP pairs BY (id1, id2);
                              common_conn = FOREACH common_conn
3.(A,{B,C}),(A,{C,B})         GENERATE flatten(group) as (source_id, dest_id),
4.(B,C,1), (C,B,1)            COUNT(pairs) as common_connections;
                                32
                                                                            32
Our Workflow

 triangle-closing




            33
                    33
Our Workflow

 triangle-closing




     top-n




             34
                    34
Our Workflow

 triangle-closing




     top-n




  push-to-prod



             35
                    35
Outline
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance          36
                                    36
Our Workflow

 triangle-closing




     top-n




  push-to-prod



             37
                    37
Our Workflow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              38
                    38
Our Workflow
              triangle-closing


                 remove
               connections



                   top-n



push-to-qa     push-to-prod

                           39
                                 39
PYMK Workflow




     40
               40
Workflow Requirements
Dependency management
Regular Scheduling
Monitoring
Diverse jobs: Java, Pig, Clojure
Configuration/Parameters
Resource control/locking
Restart/Stop/Retry
Visualization
History
Logs
                         41
                                   41
Workflow Requirements
Dependency management
Regular Scheduling
Monitoring
Diverse jobs: Java, Pig, Clojure
Configuration/Parameters
Resource control/locking
Restart/Stop/Retry
Visualization
History
                         Azkaban
Logs
                      42
                                   42
Sample Azkaban Job Spec
type=pig

pig.script=top-n.pig

dependencies=remove-connections

top.n.size=100




                       43
                                  43
Azkaban Workflow




       44
                  44
Azkaban Workflow




       45
                  45
Azkaban Workflow




       46
                  46
Our Workflow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              47
                    47
Our Workflow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              48
                    48
Outline
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance
                     49
                                    49
Production Storage

Requirements
 Large amount of data/Scalable

 Quick lookup/low latency

 Versioning and Rollback

 Fault tolerance

 Offline index building

                         50
                                 50
Voldemort Storage

Large amount of data/Scalable

Quick lookup/low latency

Versioning and Rollback

Fault tolerance through replication

Read only

Offline index building

                        51
                                      51
Data Cycle




    52
             52
Voldemort RO Store




        53
                     53
Our Workflow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              54
                    54
Outline
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance          55
                                    55
Data Quality

Verification

QA store with viewer

Explain

Versioning/Rollback

Unit tests

                      56
                            56
Outline
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance          57
                                    57
Performance




     58
              58
Performance

Symmetry
 Bob knows Carol then Carol knows Bob




                     58
                                        58
Performance

Symmetry
 Bob knows Carol then Carol knows Bob

Limit
 Ignore members with > k connections




                     58
                                        58
Performance

Symmetry
 Bob knows Carol then Carol knows Bob

Limit
 Ignore members with > k connections

Sampling
 Sample k-connections

                        58
                                        58
Things Covered
What do I mean by Data Products?

Systems and Tools we use

Let’s build “People You May Know”

Managing workflow

Serving data in production

Data Quality

Performance          59
                                    59
SNA Team


Thanks to SNA Team at LinkedIn

http://sna-projects.com

We are hiring!



                    60
                                 60
Questions?




    61
             61

Weitere ähnliche Inhalte

Was ist angesagt?

Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
pugpe
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
MongoSF
 
Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cython
Anderson Dantas
 
Ciklum net sat12112011-alexander fomin-expressions and all, all, all
Ciklum net sat12112011-alexander fomin-expressions and all, all, allCiklum net sat12112011-alexander fomin-expressions and all, all, all
Ciklum net sat12112011-alexander fomin-expressions and all, all, all
Ciklum Ukraine
 

Was ist angesagt? (17)

Visualization of Supervised Learning with {arules} + {arulesViz}
Visualization of Supervised Learning with {arules} + {arulesViz}Visualization of Supervised Learning with {arules} + {arulesViz}
Visualization of Supervised Learning with {arules} + {arulesViz}
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
手把手教你 R 語言分析實務
手把手教你 R 語言分析實務手把手教你 R 語言分析實務
手把手教你 R 語言分析實務
 
Functional Pe(a)rls - the Purely Functional Datastructures edition
Functional Pe(a)rls - the Purely Functional Datastructures editionFunctional Pe(a)rls - the Purely Functional Datastructures edition
Functional Pe(a)rls - the Purely Functional Datastructures edition
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practice
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
MongoDB With Style
MongoDB With StyleMongoDB With Style
MongoDB With Style
 
Programming Java - Lection 04 - Generics and Lambdas - Lavrentyev Fedor
Programming Java - Lection 04 - Generics and Lambdas - Lavrentyev FedorProgramming Java - Lection 04 - Generics and Lambdas - Lavrentyev Fedor
Programming Java - Lection 04 - Generics and Lambdas - Lavrentyev Fedor
 
Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cython
 
Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語Haskellで学ぶ関数型言語
Haskellで学ぶ関数型言語
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
 
Ciklum net sat12112011-alexander fomin-expressions and all, all, all
Ciklum net sat12112011-alexander fomin-expressions and all, all, allCiklum net sat12112011-alexander fomin-expressions and all, all, all
Ciklum net sat12112011-alexander fomin-expressions and all, all, all
 

Mehr von Mitul Tiwari

Mehr von Mitul Tiwari (9)

Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
Large scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluation
 
Metaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMetaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendations
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedIn
 
Structural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsStructural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender Systems
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
 
Large-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityLarge-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and Opportunity
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Building Data Driven Products at Linkedin

  • 1. Building Data Products using Hadoop at Linkedin Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn 1 1
  • 2. Who am I? 2 2
  • 3. What do I mean by Data Products? 3 3
  • 4. People You May Know 4 4
  • 6. Viewers of this profile also ... 6 6
  • 9. Data Products: Key Ideas Recommendations People You May Know, Viewers of this profile ... Analytics and Insight Profile Stats: Who Viewed My Profile, Skills Visualization InMaps 9 9
  • 10. Data Products: Challenges LinkedIn: 2nd largest social network 120 million members on LinkedIn Billions of connections Billions of pageviews Terabytes of data to process 10 10
  • 11. Outline What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 11 11
  • 12. Systems and Tools Kafka (LinkedIn) Hadoop (Apache) Azkaban (LinkedIn) Voldemort (LinkedIn) 12 12
  • 13. Systems and Tools Kafka publish-subscribe messaging system transfer data from production to HDFS Hadoop Azkaban Voldemort 13 13
  • 14. Systems and Tools Kafka Hadoop Java MapReduce and Pig process data Azkaban Voldemort 14 14
  • 15. Systems and Tools Kafka Hadoop Azkaban Hadoop workflow management tool to manage hundreds of Hadoop jobs Voldemort 15 15
  • 16. Systems and Tools Kafka Hadoop Azkaban Voldemort Key-value store store output of Hadoop jobs and serve in production 16 16
  • 17. Outline What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 17 17
  • 18. People You May Know How do people Alice know each other? Bob Carol 18 18
  • 19. People You May Know How do people Alice know each other? Bob Carol 19 19
  • 20. People You May Know How do people Alice know each other? Bob Carol Triangle closing 20 20
  • 21. People You May Know How do people Alice know each other? Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections 21 21
  • 22. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 22 22
  • 23. Pig Overview Load: load data, specify format Store: store data, specify format Foreach, Generate: Projections, similar to select Group by: group by column(s) Join, Filter, Limit, Order, ... User Defined Functions (UDFs) 23 23
  • 24. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 24 24
  • 25. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 25 25
  • 26. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 26 26
  • 27. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 27 27
  • 28. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 28 28
  • 29. Triangle Closing Example Alice Bob Carol connections = LOAD `connections` USING 1.(A,B),(B,A),(A,C),(C,A) PigStorage(); 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 29 29
  • 30. Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) group_conn = GROUP connections BY 2.(A,{B,C}),(B,{A}),(C,{A}) source_id; 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 30 30
  • 31. Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) pairs = FOREACH group_conn GENERATE 3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2); 4.(B,C,1), (C,B,1) 31 31
  • 32. Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn 3.(A,{B,C}),(A,{C,B}) GENERATE flatten(group) as (source_id, dest_id), 4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections; 32 32
  • 35. Our Workflow triangle-closing top-n push-to-prod 35 35
  • 36. Outline What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 36 36
  • 37. Our Workflow triangle-closing top-n push-to-prod 37 37
  • 38. Our Workflow triangle-closing remove connections top-n push-to-prod 38 38
  • 39. Our Workflow triangle-closing remove connections top-n push-to-qa push-to-prod 39 39
  • 40. PYMK Workflow 40 40
  • 41. Workflow Requirements Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Configuration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Logs 41 41
  • 42. Workflow Requirements Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Configuration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Azkaban Logs 42 42
  • 43. Sample Azkaban Job Spec type=pig pig.script=top-n.pig dependencies=remove-connections top.n.size=100 43 43
  • 47. Our Workflow triangle-closing remove connections top-n push-to-prod 47 47
  • 48. Our Workflow triangle-closing remove connections top-n push-to-prod 48 48
  • 49. Outline What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 49 49
  • 50. Production Storage Requirements Large amount of data/Scalable Quick lookup/low latency Versioning and Rollback Fault tolerance Offline index building 50 50
  • 51. Voldemort Storage Large amount of data/Scalable Quick lookup/low latency Versioning and Rollback Fault tolerance through replication Read only Offline index building 51 51
  • 52. Data Cycle 52 52
  • 54. Our Workflow triangle-closing remove connections top-n push-to-prod 54 54
  • 55. Outline What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 55 55
  • 56. Data Quality Verification QA store with viewer Explain Versioning/Rollback Unit tests 56 56
  • 57. Outline What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 57 57
  • 58. Performance 58 58
  • 59. Performance Symmetry Bob knows Carol then Carol knows Bob 58 58
  • 60. Performance Symmetry Bob knows Carol then Carol knows Bob Limit Ignore members with > k connections 58 58
  • 61. Performance Symmetry Bob knows Carol then Carol knows Bob Limit Ignore members with > k connections Sampling Sample k-connections 58 58
  • 62. Things Covered What do I mean by Data Products? Systems and Tools we use Let’s build “People You May Know” Managing workflow Serving data in production Data Quality Performance 59 59
  • 63. SNA Team Thanks to SNA Team at LinkedIn http://sna-projects.com We are hiring! 60 60
  • 64. Questions? 61 61