SlideShare ist ein Scribd-Unternehmen logo
1 von 32
30 Billion Events a Day with Hadoop




Michael Brown, CTO, comScore, Inc.
May 10th, 2012
comScore is a Global Leader in Measuring the Digital World

                                                  NASDAQ            SCOR
                                                  Clients           1860+ worldwide
                                                  Employees         1000+
                                                  Headquarters      Reston, VA
                                                                    170+ countries under measurement;
                                                  Global Coverage
                                                                    43 markets reported

                                                  Local Presence    32 locations in 23 countries




                © comScore, Inc.   Proprietary.             2                                      V1011
Some of our Clients
 Media   Agencies   Telecom/Mobile            Financial   Retail   Travel   CPG   Pharma   Technology




                       © comScore, Inc.   Proprietary.      3                                   V1011
The Trusted Source for Digital Intelligence Across Vertical Markets


       9   out of the top   10                               9 out of the top 10
       INVESTMENT BANKS                                      AUTO INSURERS


       4   out of the top   4                                11   out of the top   12
       WIRELESS CARRIERS                                     INTERNET SERVICE
                                                             PROVIDERS

       47 out of the top 50                                  14   out of the top   15
       ONLINE PROPERTIES                                     PHARMACEUTICAL
                                                             COMPANIES

       45    out of the top     50                           11   out of the top   12
       ADVERTISING AGENCIES                                  CONSUMER FINANCE
                                                             COMPANIES

       9 out of the top 10                                   8   out of the top   10
       MAJOR MEDIA COMPANIES                                 CPG COMPANIES


                       © comScore, Inc.   Proprietary.   4                              V1011
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration


     Global PERSON                                              Global DEVICE
      Measurement                                               Measurement




         PANEL                                                          CENSUS




             Unified Digital Measurement (UDM)
                                Patent-Pending Methodology
                      Adopted by 90% of Top 100 U.S. Media Properties


                 © comScore, Inc.   Proprietary.   5                             V0411
Beacon Heat Map




              © comScore, Inc.   Proprietary.   6
Worldwide Tags per Month

                                                                        Monthly Records Collection
               1,000,000,000,000


                900,000,000,000


                800,000,000,000


                700,000,000,000


                600,000,000,000
# of records




                500,000,000,000


                400,000,000,000


                300,000,000,000


                200,000,000,000


                100,000,000,000


                              0
                                   Jul
                                         Aug
                                               Sep
                                                     Oct
                                                           Nov
                                                                 Dec
                                                                       Jan
                                                                              Feb


                                                                                          Apr


                                                                                                      Jun
                                                                                                            Jul
                                                                                                                  Aug
                                                                                                                        Sep
                                                                                                                              Oct
                                                                                                                                    Nov
                                                                                                                                          Dec
                                                                                                                                                Jan
                                                                                                                                                      Feb


                                                                                                                                                                  Apr


                                                                                                                                                                              Jun
                                                                                                                                                                                    Jul
                                                                                                                                                                                          Aug
                                                                                                                                                                                                Sep
                                                                                                                                                                                                      Oct
                                                                                                                                                                                                            Nov
                                                                                                                                                                                                                  Dec
                                                                                                                                                                                                                        Jan
                                                                                                                                                                                                                              Feb


                                                                                                                                                                                                                                          Apr
                                                                                    Mar




                                                                                                                                                            Mar




                                                                                                                                                                                                                                    Mar
                                                                                                May




                                                                                                                                                                        May




                                                                                                                                                                                                                                                May
                                               2009                                                   2010                                                                    2011                                              2012

                                                                       Panel Records                    Beacon Records
                                                     © comScore, Inc.        Proprietary.                     7
Our Event Volume in Perspective

                                                   Property            Page Views (MM)

                            FACEBOOK.COM                                       472,814

                                            Google Sites                       302,802

                                            Yahoo! Sites                        90,448

                                                           Total               866,064




Source: comScore MediaMetrix Worldwide April 2012




                         © comScore, Inc.   Proprietary.           8
Growth Slides
1,600,000,000,000


                                                          R² = 0.9335
1,400,000,000,000



1,200,000,000,000



1,000,000,000,000



 800,000,000,000



 600,000,000,000



 400,000,000,000



 200,000,000,000



               -




                    © comScore, Inc.   Proprietary.   9
The Project:
Census Web Agg




           © comScore, Inc.   Proprietary.   10
The Problem Statement

§  Calculate the number of events and unique cookies for each key
§  Key take aways
  –  Data on input will be sessionized daily
  –  Need to process all data for a month
  –  Need to calculate values for Total Internet and for each site under
    measurement




                     © comScore, Inc.   Proprietary.   11
Counting Uniques from a Time Ordered Log File



         A                                                Major Downsides:
                                              Need to keep all key elements in memory.
         D                                 Constrained to one machine for final aggregation.


         B

         C

         B

         A

         A


               © comScore, Inc.   Proprietary.       12
Counting Uniques from a Key Ordered Log File



         A                                                   Major Downsides:
                                                       Need to sort data in advance.
         A                                       The sort time increases as volume grows.


         A

         B

         B

         C

         D


               © comScore, Inc.   Proprietary.     13
Scaling Issue

§  As our volume has grown we have the following stats:
  –  Over 900 billion events per month
  –  Over 150 billion sessions per month
  –  Over 5,000 reportable sites
  –  Over 50 countries
  –  We see 15 billion distinct cookies in a month
  –  5 sites have over 1 billion cookies in a month
  –  The sum of all distinct cookies is 377 billion
  –  We only need to output 15 million rows




                     © comScore, Inc.   Proprietary.   14
Counting Uniques from a Key Ordered Log File




               © comScore, Inc.   Proprietary.   15
Windows v1 (Single Server)

§  Time to process data for first few months
       Month                                Wall Time (hours)

      Jul 2009                              8
      Aug 2009                          10
      Sep 2009                          11
      Oct 2009                          16
      Nov 2009                          37




§  V1 Processed sessions at roughly 250K rows/sec


§  Problems with this version:
  –  Slow
  –  Not Scalable
  –  Dedicated Server
  –  Bottleneck for delivering production


                         © comScore, Inc.   Proprietary.        16
Counting Uniques from Sharded Key Ordered Log Files




               © comScore, Inc.   Proprietary.   17
Windows v2

§  Features of this version
  –  Distributed (32 servers)
  –  Multithreaded
  –  Data Localization
  –  Very low network data transfer
  –  Handling the data growth

§  The V2 code processed data over 8 million rows/sec
  –  1 hour for Dec 2009; 5 hours for April 2012

§  Issues
  –  Data is distributed by ID into 64 parts
  –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node
  –  All data replication is manual, along with recovery
  –  Results cannot be calculated if any node is down
  –  Adding new servers or change in parts is a ton of effort
  –  Overhead to maintain framework to run distributed jobs




                          © comScore, Inc.   Proprietary.   18
Enter the Elephant

§  Why Hadoop?
 –  Scalable
 –  Low risk to lose data due to replication
 –  Run on a shared production cluster
 –  No overhead to maintain framework
 –  Easy job submission and management




                   © comScore, Inc.   Proprietary.   19
Basic Approach

§  Leverage Pig for POC
  –  Pig Latin is easy for developers and data analysts to learn
  –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/
    Reduce)
  –  Extendable via UDFs




                         © comScore, Inc.   Proprietary.   20
Performance of Basic Approach on Various Samples

                                                  Aggregation Performance
                 80.00


                 70.00


                 60.00


                 50.00
Time (minutes)




                 40.00


                 30.00


                 20.00


                 10.00


                  0.00
                         372 GB (3%)                              744 GB (6%)                                  1116 GB (9%)
                                                                 Input data size




                               © comScore, Inc.   Proprietary.   21     Note: Target data size is over 10 TB
M/R Data Flow


       B    C                                         A        B       C       A



     Mapper
       Map                                            Mapper           Mapper
                                                        Map              Map


        A       A                                         B        B       C       C

      Reduce                                          Reduce               Reduce

            A                                                 B                C




                    © comScore, Inc.   Proprietary.           22
Basic Approach Retrospective

§  Processing speed is not scaling to our needs on a sample of the input data
§  Diagnosis
  –  Most aggregations could not take significant advantage of combiners. Not a Pig issue.
  –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
    Hadoop cluster compared to the current architecture


§  Diagnosis
  –  A new approach is required to reduce the shuffle




                        © comScore, Inc.   Proprietary.   23
Solution to reduce the shuffle

§  The Problem:
  –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles
     and job performance issues

§  The Idea:
  –  Partition and sort data on a daily basis
  –  Create a custom input format to merge daily partitions for monthly aggregations




                         © comScore, Inc.   Proprietary.   24
Custom Input Format with Map Side Aggregation


       B       C                                       A        B    C    A



   A Mapper
       Map                                           B Mapper
                                                         Map        C Mapper
                                                                        Map

     Combiner                                        Combiner        Combiner

           A                                               B               C

       Reduce                                          Reduce            Reduce

           A                                               B               C

                   © comScore, Inc.   Proprietary.         25
Performance of v2 on Various Samples

                                                       Aggregation Performance
                 120.00



                 100.00



                  80.00
Time (minutes)




                  60.00



                  40.00



                  20.00



                   0.00
                          372 GB (3%)                             744 GB (6%)                     1116 GB (9%)   10304 GB (100%)
                                                                                Input data size


                                                                      Pig   Custom Input Format



                                    © comScore, Inc.   Proprietary.             26
Partitioning Summary

§  Benefits:
  –  A large portion of the aggregation can be completed in the map phase
  –  Applications can now take advantage of combiners
  –  Shuffles sizes are minimal

§  Risks:
  –  Data locality loss
  –  Map failures might result in long run times. This is dependent on the size of the partitions




                          © comScore, Inc.   Proprietary.   27
Full Sample Performance

§  Full set of data analysis
  –  10 TB of input data
  –  150 billion session rows


§  Total Time
  –  1 hour, 45 minutes
  –  Over 23,000,000 rows/sec




                    © comScore, Inc.   Proprietary.   28
Future Ideas

§  HBase
  –  Unique cookie calculations are free as data is more organized
  –  How will data loading fare?


§  Data Locality
  –  Ideally it would be great to provide additional clues to the storage of the data
  –  Not sure if it will be included in Hadoop


§  Connection to a MPP DB
  –  We also leverage Greenplum DB, we could connect to each sharded instance




                    © comScore, Inc.   Proprietary.   29
Hadoop Cluster

§  Production Hadoop Cluster
  –  80 nodes: Mix of Dell R710 and R510
  –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
  –  1768 total CPUs
  –  4.7TB total memory
  –  1200TB total disk space
  –  Our distro is MapR M5 1.2.7




                   © comScore, Inc.   Proprietary.   30
Useful Factoids
  Colorful, bite-sized graphical representations of the best discoveries we unearth.




    Visit www.comscoredatamine.com or follow @datagems for the latest gems.


                   © comScore, Inc.   Proprietary.   31
Thank You!


 Michael Brown
 CTO
 comScore, Inc.


 mbrown@comscore.com




             © comScore, Inc.   Proprietary.   32

Weitere ähnliche Inhalte

Was ist angesagt?

The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyThe Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyWeb Managers Group
 
Pultry industry in north america
Pultry industry in north americaPultry industry in north america
Pultry industry in north americaUsapeec
 
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Mike Walker
 
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela
 
10 years of open access at BioMed Central
10 years of open access at BioMed Central10 years of open access at BioMed Central
10 years of open access at BioMed CentralBioMedCentral
 
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with  Agricultural R&D and Policy  ChangeCommentary: Hunger Reduction with  Agricultural R&D and Policy  Change
Commentary: Hunger Reduction with Agricultural R&D and Policy ChangeJoachim von Braun
 

Was ist angesagt? (7)

The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyThe Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case Study
 
Mba applications report
Mba applications reportMba applications report
Mba applications report
 
Pultry industry in north america
Pultry industry in north americaPultry industry in north america
Pultry industry in north america
 
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
 
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008
 
10 years of open access at BioMed Central
10 years of open access at BioMed Central10 years of open access at BioMed Central
10 years of open access at BioMed Central
 
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with  Agricultural R&D and Policy  ChangeCommentary: Hunger Reduction with  Agricultural R&D and Policy  Change
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
 

Ähnlich wie 30B events a day with hadoop

Ähnlich wie 30B events a day with hadoop (7)

NWA Collection
NWA CollectionNWA Collection
NWA Collection
 
Consumer Snapshot January 2013
Consumer Snapshot January 2013Consumer Snapshot January 2013
Consumer Snapshot January 2013
 
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013
 
Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013
 
Pp slides
Pp slidesPp slides
Pp slides
 
Office property market overivew 3Q 2011-India
Office property market overivew  3Q 2011-IndiaOffice property market overivew  3Q 2011-India
Office property market overivew 3Q 2011-India
 
Pink pantehrs
Pink pantehrsPink pantehrs
Pink pantehrs
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 

Kürzlich hochgeladen (20)

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 

30B events a day with hadoop

  • 1. 30 Billion Events a Day with Hadoop Michael Brown, CTO, comScore, Inc. May 10th, 2012
  • 2. comScore is a Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1860+ worldwide Employees 1000+ Headquarters Reston, VA 170+ countries under measurement; Global Coverage 43 markets reported Local Presence 32 locations in 23 countries © comScore, Inc. Proprietary. 2 V1011
  • 3. Some of our Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. 3 V1011
  • 4. The Trusted Source for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. 4 V1011
  • 5. Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Global PERSON Global DEVICE Measurement Measurement PANEL CENSUS Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties © comScore, Inc. Proprietary. 5 V0411
  • 6. Beacon Heat Map © comScore, Inc. Proprietary. 6
  • 7. Worldwide Tags per Month Monthly Records Collection 1,000,000,000,000 900,000,000,000 800,000,000,000 700,000,000,000 600,000,000,000 # of records 500,000,000,000 400,000,000,000 300,000,000,000 200,000,000,000 100,000,000,000 0 Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Mar Mar Mar May May May 2009 2010 2011 2012 Panel Records Beacon Records © comScore, Inc. Proprietary. 7
  • 8. Our Event Volume in Perspective Property Page Views (MM) FACEBOOK.COM 472,814 Google Sites 302,802 Yahoo! Sites 90,448 Total 866,064 Source: comScore MediaMetrix Worldwide April 2012 © comScore, Inc. Proprietary. 8
  • 9. Growth Slides 1,600,000,000,000 R² = 0.9335 1,400,000,000,000 1,200,000,000,000 1,000,000,000,000 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 - © comScore, Inc. Proprietary. 9
  • 10. The Project: Census Web Agg © comScore, Inc. Proprietary. 10
  • 11. The Problem Statement §  Calculate the number of events and unique cookies for each key §  Key take aways –  Data on input will be sessionized daily –  Need to process all data for a month –  Need to calculate values for Total Internet and for each site under measurement © comScore, Inc. Proprietary. 11
  • 12. Counting Uniques from a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary. 12
  • 13. Counting Uniques from a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary. 13
  • 14. Scaling Issue §  As our volume has grown we have the following stats: –  Over 900 billion events per month –  Over 150 billion sessions per month –  Over 5,000 reportable sites –  Over 50 countries –  We see 15 billion distinct cookies in a month –  5 sites have over 1 billion cookies in a month –  The sum of all distinct cookies is 377 billion –  We only need to output 15 million rows © comScore, Inc. Proprietary. 14
  • 15. Counting Uniques from a Key Ordered Log File © comScore, Inc. Proprietary. 15
  • 16. Windows v1 (Single Server) §  Time to process data for first few months Month Wall Time (hours) Jul 2009 8 Aug 2009 10 Sep 2009 11 Oct 2009 16 Nov 2009 37 §  V1 Processed sessions at roughly 250K rows/sec §  Problems with this version: –  Slow –  Not Scalable –  Dedicated Server –  Bottleneck for delivering production © comScore, Inc. Proprietary. 16
  • 17. Counting Uniques from Sharded Key Ordered Log Files © comScore, Inc. Proprietary. 17
  • 18. Windows v2 §  Features of this version –  Distributed (32 servers) –  Multithreaded –  Data Localization –  Very low network data transfer –  Handling the data growth §  The V2 code processed data over 8 million rows/sec –  1 hour for Dec 2009; 5 hours for April 2012 §  Issues –  Data is distributed by ID into 64 parts –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node –  All data replication is manual, along with recovery –  Results cannot be calculated if any node is down –  Adding new servers or change in parts is a ton of effort –  Overhead to maintain framework to run distributed jobs © comScore, Inc. Proprietary. 18
  • 19. Enter the Elephant §  Why Hadoop? –  Scalable –  Low risk to lose data due to replication –  Run on a shared production cluster –  No overhead to maintain framework –  Easy job submission and management © comScore, Inc. Proprietary. 19
  • 20. Basic Approach §  Leverage Pig for POC –  Pig Latin is easy for developers and data analysts to learn –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/ Reduce) –  Extendable via UDFs © comScore, Inc. Proprietary. 20
  • 21. Performance of Basic Approach on Various Samples Aggregation Performance 80.00 70.00 60.00 50.00 Time (minutes) 40.00 30.00 20.00 10.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) Input data size © comScore, Inc. Proprietary. 21 Note: Target data size is over 10 TB
  • 22. M/R Data Flow B C A B C A Mapper Map Mapper Mapper Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 22
  • 23. Basic Approach Retrospective §  Processing speed is not scaling to our needs on a sample of the input data §  Diagnosis –  Most aggregations could not take significant advantage of combiners. Not a Pig issue. –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster compared to the current architecture §  Diagnosis –  A new approach is required to reduce the shuffle © comScore, Inc. Proprietary. 23
  • 24. Solution to reduce the shuffle §  The Problem: –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues §  The Idea: –  Partition and sort data on a daily basis –  Create a custom input format to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary. 24
  • 25. Custom Input Format with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 25
  • 26. Performance of v2 on Various Samples Aggregation Performance 120.00 100.00 80.00 Time (minutes) 60.00 40.00 20.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) 10304 GB (100%) Input data size Pig Custom Input Format © comScore, Inc. Proprietary. 26
  • 27. Partitioning Summary §  Benefits: –  A large portion of the aggregation can be completed in the map phase –  Applications can now take advantage of combiners –  Shuffles sizes are minimal §  Risks: –  Data locality loss –  Map failures might result in long run times. This is dependent on the size of the partitions © comScore, Inc. Proprietary. 27
  • 28. Full Sample Performance §  Full set of data analysis –  10 TB of input data –  150 billion session rows §  Total Time –  1 hour, 45 minutes –  Over 23,000,000 rows/sec © comScore, Inc. Proprietary. 28
  • 29. Future Ideas §  HBase –  Unique cookie calculations are free as data is more organized –  How will data loading fare? §  Data Locality –  Ideally it would be great to provide additional clues to the storage of the data –  Not sure if it will be included in Hadoop §  Connection to a MPP DB –  We also leverage Greenplum DB, we could connect to each sharded instance © comScore, Inc. Proprietary. 29
  • 30. Hadoop Cluster §  Production Hadoop Cluster –  80 nodes: Mix of Dell R710 and R510 –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores) –  1768 total CPUs –  4.7TB total memory –  1200TB total disk space –  Our distro is MapR M5 1.2.7 © comScore, Inc. Proprietary. 30
  • 31. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary. 31
  • 32. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary. 32