SlideShare ist ein Scribd-Unternehmen logo
Data Analysis at Facebook


                  Jeff Hammerbacher, Ding Zhou*
                  Facebook Inc.
Outline
• How does Facebook work
• Managing Big Data
• Data Analysis for Business Intelligence
• Data Analysis for “Artificial Intelligence”
• Questions
How does Facebook work?
Profile page - content generation portal
Newsfeed page - content consumption portal
Friends page - social graph portal
App page - social app platform
Facebook Data
▪   Social Graph Data
    ▪   The Nodes:
        ▪
            100m+ users; 100+ dimensions each user (numerical, text, categorical);
        ▪
            350k registrations daily;
    ▪   The Edges:
        ▪
            200+ friends each user (median);
        ▪
            20 categories of edges (fb friends, co-workers, family, etc);

▪   Social Behavior Data
    ▪   Social Interactions: interactions among users, via 100+ interaction types;
    ▪   Social Actions: between users and 33k+ facebook apps, via 200+ action types;

▪   Social Content Data
    ▪   Content of Posts, Notes, Photos, Video, etc
Managing Big Data
▪   Data scale [backend]:
    ▪   Over 1.3 PB raw capacity in largest cluster;
    ▪   Nearly 2 TB uncompressed data per day;
    ▪   Over 20 TB read/write per day;
▪   Distributed Data management:
    ▪   HDFS/Hadoop (MapReduce in Java);
    ▪   MetaStore (MetaData management);
    ▪   Hive QL (Query language on Hadoop+MetaStore);
    ▪   Usage:
        ▪
            at least 50 engineers have run hadoop jobs
        ▪
            3,514 Jobs weekly
        ▪
            821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
Hadoop - MapReduce in Java


                     facebook:1
                     data:1                                  analysis:1
                     team:1                                  data:1
                                                             data:1
                                                             facebook:1   analysis:1
facebook data team           uses: 1                                      data:2
uses hadoop for              hadoop: 1                                    facebook:1
data analysis                for: 1                                       for:1
                                                                          hadoop:1
                                                                          team:1
                                                             for:1
                                                                          uses: 1
                                                             hadoop:1
                                                             team:1
                                                             uses: 1
                             data:1
                             analysis:1



                          MapReduce Execution Flow
                           [Dean, J and Ghemawat, S, 2004]
Data Analysis for Business Intelligence
Data for Business Intelligence
▪   General Goal:
    ▪   support growth and monetization strategies, and product decisions
▪   User Behavior Studies
    ▪   NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive
        of engagement;
    ▪   Identity*: Unsupervised learning over user session data to identify common usage patterns.
        Techniques employed include K-Means, PageRank, dimension reduction methods;
▪   Experimentation Platform
    ▪   Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);
    ▪
        Columbus++*: A/B testing for impact of site change on site health metrics;;

▪   Reporting System
    ▪   ad-hoc analysis done by Hive queries
                                                              * - underlined are projects that Ding Zhou participates in;
Columbus
                           Geographical bird-view of
                           growth by country




      Comparison between
      user groups
Data Analysis for “Artificial Intelligence”
                       -- predicting user social behavior
who the user will
    interact with

• predict interactions between friends

• features are user profile and browsing history

• tried linear models and tree models

• applied for search, newsfeed, etc
who the user hasn’t
      found yet

• missing edge prediction problem

• observations are friend/non-friend pairs

• features include profile and local graph info

• profile info more informative

• graph info supplemental if profile incomplete
what applications the
    user may like*

• 33k apps, only 0.1% of them used;

• a different recommendation problem;

• prediction model not applicable,
 user preference unavailable;

• build a prediction model to infer “user ratings”;

• user-based + item-based recommendation

• how to combine profile, social graph, ratings?



                  * projects that Ding Zhou participates in;
what content is
          interesting*
• newsfeed as the main content distribution channel

• stories generated by 100s of social actions:
 on the site, platform, or the Web

• <0.1% of possible stories are shown

• predictions built on story features, and user
 browsing history




                    * projects that Ding Zhou participates in;
Challenges in Data
- 100s of TBs of meaningful data available
- 1,000s of non-trivial features
- sampling not always applicable (e.g. small app has no user data)
- prediction requirements
 ▪   models regularly applied for 10 billion novel samples
 ▪   models used on-the-fly for 100k samples in 50 ms
Special Machine Learning Problems
- use machine learning to predict user behavior
 ▪   labels: insufficient; inferred implicitly; imbalanced;
 ▪   features: high-dimensional; strongly correlated; noisy;


- scale requires distributed algorithms
 ▪   in-house implementation of tree ensemble methods (bagging predictors)
 ▪   larger training sets grant performance improvements


- speed and accuracy improvements underway
tip of the iceberg

    Questions?
(c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Andere mochten auch

PilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsPilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsBjorn M
 
Infographic: UK social media usage - Facebook
Infographic: UK social media usage - FacebookInfographic: UK social media usage - Facebook
Infographic: UK social media usage - Facebook
Harris Interactive UK
 
Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial
KARMUN1295
 
Facebook tutorial
Facebook tutorialFacebook tutorial
Facebook tutorial
KFCPRB
 
Facebook Usage Stats
Facebook Usage StatsFacebook Usage Stats
Facebook Usage Stats
Neiman Outlen
 
Facebook Tutorial
Facebook TutorialFacebook Tutorial
Facebook Tutorial
Queens Library
 
Creating facebook page tutorial 2014
Creating facebook page tutorial 2014 Creating facebook page tutorial 2014
Creating facebook page tutorial 2014
Jaymar Villamor
 
After 55 facebook_tutorial
After 55 facebook_tutorialAfter 55 facebook_tutorial
After 55 facebook_tutorialTammy Fry, Ph.D.
 
Facebook Tutorial Video
Facebook Tutorial VideoFacebook Tutorial Video
Facebook Tutorial VideoMaggie Ansell
 
Facebook 101 personal usage
Facebook 101 personal usageFacebook 101 personal usage
Facebook 101 personal usage
Kristi Kirkland
 
Twitter tutorial
Twitter tutorialTwitter tutorial
Twitter tutorial
Hyatt Rocillo
 
AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015
Neiman Outlen
 

Andere mochten auch (13)

PilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankingsPilotLabs IBS - Facebook analysis rankings
PilotLabs IBS - Facebook analysis rankings
 
Infographic: UK social media usage - Facebook
Infographic: UK social media usage - FacebookInfographic: UK social media usage - Facebook
Infographic: UK social media usage - Facebook
 
Tutorial on Twitter
Tutorial on TwitterTutorial on Twitter
Tutorial on Twitter
 
Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial Facebook Privacy Setting Tutorial
Facebook Privacy Setting Tutorial
 
Facebook tutorial
Facebook tutorialFacebook tutorial
Facebook tutorial
 
Facebook Usage Stats
Facebook Usage StatsFacebook Usage Stats
Facebook Usage Stats
 
Facebook Tutorial
Facebook TutorialFacebook Tutorial
Facebook Tutorial
 
Creating facebook page tutorial 2014
Creating facebook page tutorial 2014 Creating facebook page tutorial 2014
Creating facebook page tutorial 2014
 
After 55 facebook_tutorial
After 55 facebook_tutorialAfter 55 facebook_tutorial
After 55 facebook_tutorial
 
Facebook Tutorial Video
Facebook Tutorial VideoFacebook Tutorial Video
Facebook Tutorial Video
 
Facebook 101 personal usage
Facebook 101 personal usageFacebook 101 personal usage
Facebook 101 personal usage
 
Twitter tutorial
Twitter tutorialTwitter tutorial
Twitter tutorial
 
AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015AthleteTrax Marketing Strategy 2015
AthleteTrax Marketing Strategy 2015
 

Ähnlich wie joint statistical meeting 2008

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
AIST
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
Fabien Gandon
 
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
Fabien Gandon
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdf
AbhiThorat6
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Sept 15 2012 bxb show me the numbers
Sept 15 2012  bxb show me the numbersSept 15 2012  bxb show me the numbers
Sept 15 2012 bxb show me the numbers
Hack the Hood
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
guest5b1607
 
Jan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metricsJan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metrics
Hack the Hood
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
Sotiris Beis
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
ikanow
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
Symeon Papadopoulos
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
eXascale Infolab
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
ZixunZhou
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Wimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity ReportWimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity Report
Fabien Gandon
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersitnig
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
Outliers Collective
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
ideaport
 

Ähnlich wie joint statistical meeting 2008 (20)

Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
 
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
One Web of pages, One Web of peoples, One Web of Services, One Web of Data, O...
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdf
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Sept 15 2012 bxb show me the numbers
Sept 15 2012  bxb show me the numbersSept 15 2012  bxb show me the numbers
Sept 15 2012 bxb show me the numbers
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Jan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metricsJan 11 2013 learning lab 2013 show me the metrics
Jan 11 2013 learning lab 2013 show me the metrics
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Wimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity ReportWimmics Research Team 2015 Activity Report
Wimmics Research Team 2015 Activity Report
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 

Kürzlich hochgeladen

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 

Kürzlich hochgeladen (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 

joint statistical meeting 2008

  • 1.
  • 2. Data Analysis at Facebook Jeff Hammerbacher, Ding Zhou* Facebook Inc.
  • 3. Outline • How does Facebook work • Managing Big Data • Data Analysis for Business Intelligence • Data Analysis for “Artificial Intelligence” • Questions
  • 5. Profile page - content generation portal
  • 6. Newsfeed page - content consumption portal
  • 7. Friends page - social graph portal
  • 8. App page - social app platform
  • 9. Facebook Data ▪ Social Graph Data ▪ The Nodes: ▪ 100m+ users; 100+ dimensions each user (numerical, text, categorical); ▪ 350k registrations daily; ▪ The Edges: ▪ 200+ friends each user (median); ▪ 20 categories of edges (fb friends, co-workers, family, etc); ▪ Social Behavior Data ▪ Social Interactions: interactions among users, via 100+ interaction types; ▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types; ▪ Social Content Data ▪ Content of Posts, Notes, Photos, Video, etc
  • 10. Managing Big Data ▪ Data scale [backend]: ▪ Over 1.3 PB raw capacity in largest cluster; ▪ Nearly 2 TB uncompressed data per day; ▪ Over 20 TB read/write per day; ▪ Distributed Data management: ▪ HDFS/Hadoop (MapReduce in Java); ▪ MetaStore (MetaData management); ▪ Hive QL (Query language on Hadoop+MetaStore); ▪ Usage: ▪ at least 50 engineers have run hadoop jobs ▪ 3,514 Jobs weekly ▪ 821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
  • 11. Hadoop - MapReduce in Java facebook:1 data:1 analysis:1 team:1 data:1 data:1 facebook:1 analysis:1 facebook data team uses: 1 data:2 uses hadoop for hadoop: 1 facebook:1 data analysis for: 1 for:1 hadoop:1 team:1 for:1 uses: 1 hadoop:1 team:1 uses: 1 data:1 analysis:1 MapReduce Execution Flow [Dean, J and Ghemawat, S, 2004]
  • 12. Data Analysis for Business Intelligence
  • 13. Data for Business Intelligence ▪ General Goal: ▪ support growth and monetization strategies, and product decisions ▪ User Behavior Studies ▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive of engagement; ▪ Identity*: Unsupervised learning over user session data to identify common usage patterns. Techniques employed include K-Means, PageRank, dimension reduction methods; ▪ Experimentation Platform ▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...); ▪ Columbus++*: A/B testing for impact of site change on site health metrics;; ▪ Reporting System ▪ ad-hoc analysis done by Hive queries * - underlined are projects that Ding Zhou participates in;
  • 14. Columbus Geographical bird-view of growth by country Comparison between user groups
  • 15. Data Analysis for “Artificial Intelligence” -- predicting user social behavior
  • 16. who the user will interact with • predict interactions between friends • features are user profile and browsing history • tried linear models and tree models • applied for search, newsfeed, etc
  • 17. who the user hasn’t found yet • missing edge prediction problem • observations are friend/non-friend pairs • features include profile and local graph info • profile info more informative • graph info supplemental if profile incomplete
  • 18. what applications the user may like* • 33k apps, only 0.1% of them used; • a different recommendation problem; • prediction model not applicable, user preference unavailable; • build a prediction model to infer “user ratings”; • user-based + item-based recommendation • how to combine profile, social graph, ratings? * projects that Ding Zhou participates in;
  • 19. what content is interesting* • newsfeed as the main content distribution channel • stories generated by 100s of social actions: on the site, platform, or the Web • <0.1% of possible stories are shown • predictions built on story features, and user browsing history * projects that Ding Zhou participates in;
  • 20. Challenges in Data - 100s of TBs of meaningful data available - 1,000s of non-trivial features - sampling not always applicable (e.g. small app has no user data) - prediction requirements ▪ models regularly applied for 10 billion novel samples ▪ models used on-the-fly for 100k samples in 50 ms
  • 21. Special Machine Learning Problems - use machine learning to predict user behavior ▪ labels: insufficient; inferred implicitly; imbalanced; ▪ features: high-dimensional; strongly correlated; noisy; - scale requires distributed algorithms ▪ in-house implementation of tree ensemble methods (bagging predictors) ▪ larger training sets grant performance improvements - speed and accuracy improvements underway
  • 22. tip of the iceberg Questions?
  • 23. (c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0