SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Hadoop and Pig @Twitter
              Kevin Weil -- @kevinweil
              Analytics Lead, Twitter




                                         TM




Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
My Background
           ‣     Mathematics and Physics at Harvard, Physics at
                 Stanford
           ‣     Tropos Networks (city-wide wireless): mesh
                 routing algorithms, GBs of data
           ‣     Cooliris (web media): Hadoop and Pig for
                 analytics, TBs of data
           ‣     Twitter: Hadoop, Pig, HBase, Cassandra,
                 machine learning, visualization, social graph
                 analysis, soon to be PBs data



Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
Data is Getting Big
           ‣     NYSE: 1 TB/day
           ‣     Facebook: 20+ TB
                 compressed/day
           ‣     CERN/LHC: 40 TB/day
                 (15 PB/year)
           ‣     And growth is
                 accelerating
           ‣     Need multiple machines,
                 horizontal scalability


Friday, July 23, 2010
Hadoop
           ‣      Distributed file system (hard to store a PB)
           ‣      Fault-tolerant, handles replication, node failure,
                  etc
           ‣      MapReduce-based parallel computation
                  (even harder to process a PB)
           ‣      Generic key-value based computation interface
                  allows for wide applicability




Friday, July 23, 2010
Hadoop
           ‣      Open source: top-level Apache project
           ‣      Scalable: Y! has a 4000-node cluster
           ‣      Powerful: sorted a TB of random integers in 62
                  seconds


           ‣      Easy Packaging: Cloudera RPMs, DEBs




Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
But...
           ‣     Analysis typically in Java
           ‣     Single-input, two-stage
                 data flow is rigid
           ‣     Projections, filters:
                 custom code
           ‣     Joins are lengthy, error-prone
           ‣     Hard to manage n-stage jobs
           ‣     Exploration requires compilation!



Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
Enter Pig
          ‣      High level language
          ‣      Transformations on
                 sets of records
          ‣      Process data one step at a time
          ‣      Easier than SQL?


          ‣      Top-level Apache project



Friday, July 23, 2010
Why Pig?
             ‣      Because I bet you can read the following script.




Friday, July 23, 2010
A Real Pig Script




Friday, July 23, 2010
Now, just for fun...
             ‣      The same calculation in vanilla MapReduce




Friday, July 23, 2010
No, seriously.




Friday, July 23, 2010
Pig Democratizes Large-scale
           Data Analysis
           ‣     The Pig version is:
           ‣            5% of the code
           ‣            5% of the development time
           ‣            Within 25% of the execution time
           ‣            Readable, reusable




Friday, July 23, 2010
One Thing I’ve Learned
           ‣     It’s easy to answer questions
           ‣     It’s hard to ask the right questions


           ‣     Value the system that promotes innovation and
                 iteration




Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
MySQL, MySQL, MySQL
           ‣     We all start there.
           ‣     But MySQL is not built for analysis.
           ‣     select count(*) from users? Maybe.
           ‣     select count(*) from tweets? Uh...
           ‣     Imagine joining them.
           ‣     And grouping.
           ‣     Then sorting.



Friday, July 23, 2010
Non-Pig Hadoop at Twitter
           ‣     Data Sink via Scribe
           ‣     Distributed Grep
           ‣     A few performance-critical, simple jobs
           ‣     People Search




Friday, July 23, 2010
People Search?
           ‣     First real product built with Hadoop
           ‣     “Find People”
           ‣     Old version: offline process on
                 a single node
           ‣     New version: complex graph
                 calculations, hit internal network
                 services, custom indexing
           ‣     	      Faster, more reliable,
                 more observable
Friday, July 23, 2010
People Search
           ‣     Import user data into HBase
           ‣     Periodic MapReduce job reading from HBase
           ‣      Hits FlockDB, other internal services in
                 mapper
           ‣            Custom partitioning
           ‣     Data sucked across to sharded, replicated,
                 horizontally scalable, in-memory, low-latency
                 Scala service
           ‣       Build a trie, do case folding/normalization,
                 suggestions, etc
Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
Order of Operations

          ‣      Counting



          ‣      Correlating



          ‣      Research/
                 Algorithmic
                 Learning

Friday, July 23, 2010
Counting
           ‣     How many requests per day?
           ‣     What’s the average latency? 95% latency?
           ‣     What’s the response code distribution?
           ‣     How many searches per day? Unique users?
           ‣     What’s the geographic breakdown of requests?
           ‣     How many tweets? From what clients?
           ‣     How many signups? Profile completeness?
           ‣     How many SMS notifications did we send?


Friday, July 23, 2010
Correlating
           ‣     How does usage differ for mobile users?
           ‣     ... for desktop client users (Tweetdeck, etc)?
           ‣     Cohort analyses
           ‣     What services fail at the same time?
           ‣     What features get users hooked?
           ‣     What do successful users do often?
           ‣     How does tweet volume change over time?



Friday, July 23, 2010
Research
           ‣     What can we infer from a user’s tweets?
           ‣     ... from the tweets of their followers? followees?
           ‣     What features tend to get a tweet retweeted?
           ‣     ... and what influences the retweet tree depth?
           ‣     Duplicate detection, language detection
           ‣     What graph structures lead to increased usage?
           ‣     Sentiment analysis, entity extraction
           ‣     User reputation


Friday, July 23, 2010
If We Had More Time...
           ‣     HBase
           ‣     LZO compression and Hadoop
           ‣     Protocol buffers
           ‣     Our open source: hadoop-lzo, elephant-bird
           ‣     Analytics and Cassandra




Friday, July 23, 2010
Questions?
                   Follow me at
                   twitter.com/kevinweil



                                           TM




Friday, July 23, 2010

Weitere ähnliche Inhalte

Andere mochten auch

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 

Andere mochten auch (20)

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Un introduction à Pig
Un introduction à PigUn introduction à Pig
Un introduction à Pig
 
RainBird
RainBirdRainBird
RainBird
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Pig statements
Pig statementsPig statements
Pig statements
 
Apache pig
Apache pigApache pig
Apache pig
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Kürzlich hochgeladen (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Hadoop and pig at twitter (oscon 2010)

  • 1. Hadoop and Pig @Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM Friday, July 23, 2010
  • 2. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, machine learning, visualization, social graph analysis, soon to be PBs data Friday, July 23, 2010
  • 4. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 5. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability Friday, July 23, 2010
  • 6. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability Friday, July 23, 2010
  • 7. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds ‣ Easy Packaging: Cloudera RPMs, DEBs Friday, July 23, 2010
  • 8. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 9. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 10. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 11. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 12. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 13. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 14. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 15. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins are lengthy, error-prone ‣ Hard to manage n-stage jobs ‣ Exploration requires compilation! Friday, July 23, 2010
  • 16. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 17. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL? ‣ Top-level Apache project Friday, July 23, 2010
  • 18. Why Pig? ‣ Because I bet you can read the following script. Friday, July 23, 2010
  • 19. A Real Pig Script Friday, July 23, 2010
  • 20. Now, just for fun... ‣ The same calculation in vanilla MapReduce Friday, July 23, 2010
  • 22. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the development time ‣ Within 25% of the execution time ‣ Readable, reusable Friday, July 23, 2010
  • 23. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration Friday, July 23, 2010
  • 24. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 25. MySQL, MySQL, MySQL ‣ We all start there. ‣ But MySQL is not built for analysis. ‣ select count(*) from users? Maybe. ‣ select count(*) from tweets? Uh... ‣ Imagine joining them. ‣ And grouping. ‣ Then sorting. Friday, July 23, 2010
  • 26. Non-Pig Hadoop at Twitter ‣ Data Sink via Scribe ‣ Distributed Grep ‣ A few performance-critical, simple jobs ‣ People Search Friday, July 23, 2010
  • 27. People Search? ‣ First real product built with Hadoop ‣ “Find People” ‣ Old version: offline process on a single node ‣ New version: complex graph calculations, hit internal network services, custom indexing ‣ Faster, more reliable, more observable Friday, July 23, 2010
  • 28. People Search ‣ Import user data into HBase ‣ Periodic MapReduce job reading from HBase ‣ Hits FlockDB, other internal services in mapper ‣ Custom partitioning ‣ Data sucked across to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service ‣ Build a trie, do case folding/normalization, suggestions, etc Friday, July 23, 2010
  • 29. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 30. Order of Operations ‣ Counting ‣ Correlating ‣ Research/ Algorithmic Learning Friday, July 23, 2010
  • 31. Counting ‣ How many requests per day? ‣ What’s the average latency? 95% latency? ‣ What’s the response code distribution? ‣ How many searches per day? Unique users? ‣ What’s the geographic breakdown of requests? ‣ How many tweets? From what clients? ‣ How many signups? Profile completeness? ‣ How many SMS notifications did we send? Friday, July 23, 2010
  • 32. Correlating ‣ How does usage differ for mobile users? ‣ ... for desktop client users (Tweetdeck, etc)? ‣ Cohort analyses ‣ What services fail at the same time? ‣ What features get users hooked? ‣ What do successful users do often? ‣ How does tweet volume change over time? Friday, July 23, 2010
  • 33. Research ‣ What can we infer from a user’s tweets? ‣ ... from the tweets of their followers? followees? ‣ What features tend to get a tweet retweeted? ‣ ... and what influences the retweet tree depth? ‣ Duplicate detection, language detection ‣ What graph structures lead to increased usage? ‣ Sentiment analysis, entity extraction ‣ User reputation Friday, July 23, 2010
  • 34. If We Had More Time... ‣ HBase ‣ LZO compression and Hadoop ‣ Protocol buffers ‣ Our open source: hadoop-lzo, elephant-bird ‣ Analytics and Cassandra Friday, July 23, 2010
  • 35. Questions? Follow me at twitter.com/kevinweil TM Friday, July 23, 2010