SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Data Analysis at Facebook


                  Jeff Hammerbacher, Ding Zhou*
                  Facebook Inc.
Outline
• How does Facebook work
• Managing Big Data
• Data Analysis for Business Intelligence
• Data Analysis for “Artificial Intelligence”
• Questions
How does Facebook work?
Profile page - content generation portal
Newsfeed page - content consumption portal
Friends page - social graph portal
App page - social app platform
Facebook Data
▪   Social Graph Data
    ▪   The Nodes:
        ▪
            100m+ users; 100+ dimensions each user (numerical, text, categorical);
        ▪
            350k registrations daily;
    ▪   The Edges:
        ▪
            200+ friends each user (median);
        ▪
            20 categories of edges (fb friends, co-workers, family, etc);

▪   Social Behavior Data
    ▪   Social Interactions: interactions among users, via 100+ interaction types;
    ▪   Social Actions: between users and 33k+ facebook apps, via 200+ action types;

▪   Social Content Data
    ▪   Content of Posts, Notes, Photos, Video, etc
Managing Big Data
▪   Data scale [backend]:
    ▪   Over 1.3 PB raw capacity in largest cluster;
    ▪   Nearly 2 TB uncompressed data per day;
    ▪   Over 20 TB read/write per day;
▪   Distributed Data management:
    ▪   HDFS/Hadoop (MapReduce in Java);
    ▪   MetaStore (MetaData management);
    ▪   Hive QL (Query language on Hadoop+MetaStore);
    ▪   Usage:
        ▪
            at least 50 engineers have run hadoop jobs
        ▪
            3,514 Jobs weekly
        ▪
            821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
Hadoop - MapReduce in Java


                     facebook:1
                     data:1                                  analysis:1
                     team:1                                  data:1
                                                             data:1
                                                             facebook:1   analysis:1
facebook data team           uses: 1                                      data:2
uses hadoop for              hadoop: 1                                    facebook:1
data analysis                for: 1                                       for:1
                                                                          hadoop:1
                                                                          team:1
                                                             for:1
                                                                          uses: 1
                                                             hadoop:1
                                                             team:1
                                                             uses: 1
                             data:1
                             analysis:1



                          MapReduce Execution Flow
                           [Dean, J and Ghemawat, S, 2004]
Data Analysis for Business Intelligence
Data for Business Intelligence
▪   General Goal:
    ▪   support growth and monetization strategies, and product decisions
▪   User Behavior Studies
    ▪   NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive
        of engagement;
    ▪   Identity*: Unsupervised learning over user session data to identify common usage patterns.
        Techniques employed include K-Means, PageRank, dimension reduction methods;
▪   Experimentation Platform
    ▪   Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);
    ▪
        Columbus++*: A/B testing for impact of site change on site health metrics;;

▪   Reporting System
    ▪   ad-hoc analysis done by Hive queries
                                                              * - underlined are projects that Ding Zhou participates in;
Columbus
                           Geographical bird-view of
                           growth by country




      Comparison between
      user groups
Data Analysis for “Artificial Intelligence”
                       -- predicting user social behavior
who the user will
    interact with

• predict interactions between friends

• features are user profile and browsing history

• tried linear models and tree models

• applied for search, newsfeed, etc
who the user hasn’t
      found yet

• missing edge prediction problem

• observations are friend/non-friend pairs

• features include profile and local graph info

• profile info more informative

• graph info supplemental if profile incomplete
what applications the
    user may like*

• 33k apps, only 0.1% of them used;

• a different recommendation problem;

• prediction model not applicable,
 user preference unavailable;

• build a prediction model to infer “user ratings”;

• user-based + item-based recommendation

• how to combine profile, social graph, ratings?



                  * projects that Ding Zhou participates in;
what content is
          interesting*
• newsfeed as the main content distribution channel

• stories generated by 100s of social actions:
 on the site, platform, or the Web

• <0.1% of possible stories are shown

• predictions built on story features, and user
 browsing history




                    * projects that Ding Zhou participates in;
Challenges in Data
- 100s of TBs of meaningful data available
- 1,000s of non-trivial features
- sampling not always applicable (e.g. small app has no user data)
- prediction requirements
 ▪   models regularly applied for 10 billion novel samples
 ▪   models used on-the-fly for 100k samples in 50 ms
Special Machine Learning Problems
- use machine learning to predict user behavior
 ▪   labels: insufficient; inferred implicitly; imbalanced;
 ▪   features: high-dimensional; strongly correlated; noisy;


- scale requires distributed algorithms
 ▪   in-house implementation of tree ensemble methods (bagging predictors)
 ▪   larger training sets grant performance improvements


- speed and accuracy improvements underway
tip of the iceberg

    Questions?
(c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Andere mochten auch

PilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsPilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankings
Bjorn M
 
After 55 facebook_tutorial
After 55 facebook_tutorialAfter 55 facebook_tutorial
After 55 facebook_tutorial
Tammy Fry, Ph.D.
 
Facebook Tutorial Video
Facebook Tutorial VideoFacebook Tutorial Video
Facebook Tutorial Video
Maggie Ansell
 

Andere mochten auch (13)

PilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsPilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankings
 
Infographic: UK social media usage - Facebook
Infographic: UK social media usage - FacebookInfographic: UK social media usage - Facebook
Infographic: UK social media usage - Facebook
 
Tutorial on Twitter
Tutorial on TwitterTutorial on Twitter
Tutorial on Twitter
 
Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial
 
Facebook tutorial
Facebook tutorialFacebook tutorial
Facebook tutorial
 
Facebook Usage Stats
Facebook Usage StatsFacebook Usage Stats
Facebook Usage Stats
 
Facebook Tutorial
Facebook TutorialFacebook Tutorial
Facebook Tutorial
 
Creating facebook page tutorial 2014
Creating facebook page tutorial 2014 Creating facebook page tutorial 2014
Creating facebook page tutorial 2014
 
After 55 facebook_tutorial
After 55 facebook_tutorialAfter 55 facebook_tutorial
After 55 facebook_tutorial
 
Facebook Tutorial Video
Facebook Tutorial VideoFacebook Tutorial Video
Facebook Tutorial Video
 
Facebook 101 personal usage
Facebook 101 personal usageFacebook 101 personal usage
Facebook 101 personal usage
 
Twitter tutorial
Twitter tutorialTwitter tutorial
Twitter tutorial
 
AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015
 

Ähnlich wie joint statistical meeting 2008

Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
Sally Sadosky
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
Open Analytics
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
itnig
 

Ähnlich wie joint statistical meeting 2008 (20)

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
 
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdf
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Sept 15 2012 bxb show me the numbers
Sept 15 2012  bxb show me the numbersSept 15 2012  bxb show me the numbers
Sept 15 2012 bxb show me the numbers
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Jan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metricsJan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metrics
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Wimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity ReportWimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity Report
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

joint statistical meeting 2008

  • 1.
  • 2. Data Analysis at Facebook Jeff Hammerbacher, Ding Zhou* Facebook Inc.
  • 3. Outline • How does Facebook work • Managing Big Data • Data Analysis for Business Intelligence • Data Analysis for “Artificial Intelligence” • Questions
  • 5. Profile page - content generation portal
  • 6. Newsfeed page - content consumption portal
  • 7. Friends page - social graph portal
  • 8. App page - social app platform
  • 9. Facebook Data ▪ Social Graph Data ▪ The Nodes: ▪ 100m+ users; 100+ dimensions each user (numerical, text, categorical); ▪ 350k registrations daily; ▪ The Edges: ▪ 200+ friends each user (median); ▪ 20 categories of edges (fb friends, co-workers, family, etc); ▪ Social Behavior Data ▪ Social Interactions: interactions among users, via 100+ interaction types; ▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types; ▪ Social Content Data ▪ Content of Posts, Notes, Photos, Video, etc
  • 10. Managing Big Data ▪ Data scale [backend]: ▪ Over 1.3 PB raw capacity in largest cluster; ▪ Nearly 2 TB uncompressed data per day; ▪ Over 20 TB read/write per day; ▪ Distributed Data management: ▪ HDFS/Hadoop (MapReduce in Java); ▪ MetaStore (MetaData management); ▪ Hive QL (Query language on Hadoop+MetaStore); ▪ Usage: ▪ at least 50 engineers have run hadoop jobs ▪ 3,514 Jobs weekly ▪ 821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
  • 11. Hadoop - MapReduce in Java facebook:1 data:1 analysis:1 team:1 data:1 data:1 facebook:1 analysis:1 facebook data team uses: 1 data:2 uses hadoop for hadoop: 1 facebook:1 data analysis for: 1 for:1 hadoop:1 team:1 for:1 uses: 1 hadoop:1 team:1 uses: 1 data:1 analysis:1 MapReduce Execution Flow [Dean, J and Ghemawat, S, 2004]
  • 12. Data Analysis for Business Intelligence
  • 13. Data for Business Intelligence ▪ General Goal: ▪ support growth and monetization strategies, and product decisions ▪ User Behavior Studies ▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive of engagement; ▪ Identity*: Unsupervised learning over user session data to identify common usage patterns. Techniques employed include K-Means, PageRank, dimension reduction methods; ▪ Experimentation Platform ▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...); ▪ Columbus++*: A/B testing for impact of site change on site health metrics;; ▪ Reporting System ▪ ad-hoc analysis done by Hive queries * - underlined are projects that Ding Zhou participates in;
  • 14. Columbus Geographical bird-view of growth by country Comparison between user groups
  • 15. Data Analysis for “Artificial Intelligence” -- predicting user social behavior
  • 16. who the user will interact with • predict interactions between friends • features are user profile and browsing history • tried linear models and tree models • applied for search, newsfeed, etc
  • 17. who the user hasn’t found yet • missing edge prediction problem • observations are friend/non-friend pairs • features include profile and local graph info • profile info more informative • graph info supplemental if profile incomplete
  • 18. what applications the user may like* • 33k apps, only 0.1% of them used; • a different recommendation problem; • prediction model not applicable, user preference unavailable; • build a prediction model to infer “user ratings”; • user-based + item-based recommendation • how to combine profile, social graph, ratings? * projects that Ding Zhou participates in;
  • 19. what content is interesting* • newsfeed as the main content distribution channel • stories generated by 100s of social actions: on the site, platform, or the Web • <0.1% of possible stories are shown • predictions built on story features, and user browsing history * projects that Ding Zhou participates in;
  • 20. Challenges in Data - 100s of TBs of meaningful data available - 1,000s of non-trivial features - sampling not always applicable (e.g. small app has no user data) - prediction requirements ▪ models regularly applied for 10 billion novel samples ▪ models used on-the-fly for 100k samples in 50 ms
  • 21. Special Machine Learning Problems - use machine learning to predict user behavior ▪ labels: insufficient; inferred implicitly; imbalanced; ▪ features: high-dimensional; strongly correlated; noisy; - scale requires distributed algorithms ▪ in-house implementation of tree ensemble methods (bagging predictors) ▪ larger training sets grant performance improvements - speed and accuracy improvements underway
  • 22. tip of the iceberg Questions?
  • 23. (c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0