SlideShare a Scribd company logo
1 of 40
Download to read offline
Facebook’s Petabyte Scale Data
       Warehouse using Hive and Hadoop




Wednesday, January 27, 2010
Why Another Data Warehousing System?




                               Data, data and more data
                                    200GB per day in March 2008
                              12+TB(compressed) raw data per day today




Wednesday, January 27, 2010
Trends Leading to More Data




Wednesday, January 27, 2010
Trends Leading to More Data




                              Free or low cost of user services




Wednesday, January 27, 2010
Trends Leading to More Data




                              Free or low cost of user services



                Realization that more insights are derived from
                simple algorithms on more data




Wednesday, January 27, 2010
Deficiencies of Existing Technologies




Wednesday, January 27, 2010
Deficiencies of Existing Technologies



         Cost of Analysis and Storage on proprietary systems
         does not support trends towards more data




Wednesday, January 27, 2010
Deficiencies of Existing Technologies



         Cost of Analysis and Storage on proprietary systems
         does not support trends towards more data



                   Limited Scalability does not support trends
                   towards more data




Wednesday, January 27, 2010
Deficiencies of Existing Technologies



         Cost of Analysis and Storage on proprietary systems
         does not support trends towards more data



                   Limited Scalability does not support trends
                   towards more data


                              Closed and Proprietary Systems



Wednesday, January 27, 2010
Lets try Hadoop…

   Pros
        – Superior in availability/scalability/manageability
        – Efficiency not that great, but throw more hardware
        – Partial Availability/resilience/scale more important than ACID


   Cons: Programmability and Metadata
        – Map-reduce hard to program (users know sql/bash/python)
        – Need to publish data in well known schemas

   Solution: HIVE




Wednesday, January 27, 2010
What is HIVE?

   A system for managing and querying structured data built
    on top of Hadoop
        – Map-Reduce for execution
        – HDFS for storage
        – Metadata in an RDBMS


   Key Building Principles:
        –   SQL as a familiar data warehousing tool
        –   Extensibility – Types, Functions, Formats, Scripts
        –   Scalability and Performance
        –   Interoperability




Wednesday, January 27, 2010
Why SQL on Hadoop?
  hive> select key, count(1) from kv1 where key > 100 group by
      key;

  vs.

  $ cat > /tmp/reducer.sh
  uniq -c | awk '{print $2"t"$1}‘
  $ cat > /tmp/map.sh
  awk -F '001' '{if($1 > 100) print $1}‘
  $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -
      mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/
      largekey -numReduceTasks 1
  $ bin/hadoop dfs –cat /tmp/largekey/part*




Wednesday, January 27, 2010
Data Flow Architecture at Facebook




Wednesday, January 27, 2010
Data Flow Architecture at Facebook




        Web Servers




Wednesday, January 27, 2010
Data Flow Architecture at Facebook




        Web Servers           Scribe MidTier




Wednesday, January 27, 2010
Data Flow Architecture at Facebook


                                                     Filers

        Web Servers           Scribe MidTier




                                               Scribe-Hadoop
                                               Cluster




Wednesday, January 27, 2010
Data Flow Architecture at Facebook


                                                     Filers

        Web Servers           Scribe MidTier




                                               Scribe-Hadoop
                                               Cluster




                                                Federated MySQL

Wednesday, January 27, 2010
Data Flow Architecture at Facebook


                                                                     Filers

        Web Servers                     Scribe MidTier




                                                               Scribe-Hadoop
                                                               Cluster




                              Production Hive-Hadoop Cluster
                                                                Federated MySQL

Wednesday, January 27, 2010
Data Flow Architecture at Facebook


                                                                     Filers

        Web Servers                     Scribe MidTier




                                                               Scribe-Hadoop
                                                               Cluster




                              Production Hive-Hadoop Cluster
          Oracle RAC                                            Federated MySQL

Wednesday, January 27, 2010
Data Flow Architecture at Facebook


                                                                      Filers

        Web Servers                      Scribe MidTier




                        Hive
                        replication                             Scribe-Hadoop
                                                                Cluster




                               Production Hive-Hadoop Cluster
          Oracle RAC                                             Federated MySQL

Wednesday, January 27, 2010
Data Flow Architecture at Facebook


                                                                      Filers

        Web Servers                      Scribe MidTier




                        Hive
                        replication                             Scribe-Hadoop
                                                                Cluster
 Adhoc Hive-Hadoop Cluster




                               Production Hive-Hadoop Cluster
          Oracle RAC                                             Federated MySQL

Wednesday, January 27, 2010
Scribe & Hadoop Clusters @ Facebook

   Used to log data from web servers

   Clusters collocated with the web servers

   Network is the biggest bottleneck

   Typical cluster has about 50 nodes.

   Stats:
        – ~ 25TB/day of raw data logged
        – 99% of the time data is available within 20 seconds



Wednesday, January 27, 2010
Hadoop & Hive Cluster @ Facebook

   Hadoop/Hive cluster
        –   8400 cores
        –   Raw Storage capacity ~ 12.5PB
        –   8 cores + 12 TB per node
        –   32 GB RAM per node
        –   Two level network topology
               1 Gbit/sec from node to rack switch
               4 Gbit/sec to top level rack switch


   2 clusters
        – One for adhoc users
        – One for strict SLA jobs




Wednesday, January 27, 2010
Hive & Hadoop Usage @ Facebook

   Statistics per day:
     – 12 TB of compressed new data added per day
     – 135TB of compressed data scanned per day
     – 7500+ Hive jobs per day
     – 80K compute hours per day

   Hive simplifies Hadoop:
    – New engineers go though a Hive training session
    – ~200 people/month run jobs on Hadoop/Hive
    – Analysts (non-engineers) use Hadoop through Hive
    – Most of jobs are Hive Jobs



Wednesday, January 27, 2010
Hive & Hadoop Usage @ Facebook

   Types of Applications:
        – Reporting
               Eg: Daily/Weekly aggregations of impression/click counts
               Measures of user engagement
               Microstrategy reports

        – Ad hoc Analysis
               Eg: how many group admins broken down by state/country

        – Machine Learning (Assembling training data)
               Ad Optimization
               Eg: User Engagement as a function of user attributes

        – Many others


Wednesday, January 27, 2010
More about HIVE




Wednesday, January 27, 2010
Data Model


                              Name                        HDFS Directory



   Table                      pvs                         /wh/pvs



   Partition                  ds = 20090801, ctry = US    /wh/pvs/ds=20090801/ctry=US




                                                          /wh/pvs/ds=20090801/ctry=US/
   Bucket                     user into 32 buckets
                                                          part-00000
                              HDFS file for user hash 0




Wednesday, January 27, 2010
Hive Query Language
   SQL
        –   Sub-queries in from clause
        –   Equi-joins (including Outer joins)
        –   Multi-table Insert
        –   Multi-group-by
        –   Embedding Custom Map/Reduce in SQL
   Sampling
   Primitive Types
        – integer types, float, string, boolean
   Nestable Collections
        – array<any-type> and map<primitive-type, any-type>
   User-defined types
        – Structures with attributes which can be of any-type


Wednesday, January 27, 2010
Optimizations
 Joins try to reduce the number of map/reduce jobs needed.
 Memory efficient joins by streaming largest tables.
 Map Joins
     – User specified small tables stored in hash tables on the mapper
     – No reducer needed

 Map side partial aggregations
     – Hash-based aggregates
     – Serialized key/values in hash tables
     – 90% speed improvement on Query
             SELECT count(1) FROM t;
 Load balancing for data skew



Wednesday, January 27, 2010
Hive: Open & Extensible

   Different on-disk storage(file) formats
        – Text File, Sequence File, …
   Different serialization formats and data types
        – LazySimpleSerDe, ThriftSerDe …
   User-provided map/reduce scripts
        – In any language, use stdin/stdout to transfer data …
   User-defined Functions
        – Substr, Trim, From_unixtime …
   User-defined Aggregation Functions
        – Sum, Average …
   User-define Table Functions
        – Explode …



Wednesday, January 27, 2010
Existing File Formats
                              TEXTFILE     SEQUENCEFILE   RCFILE

     Data type                text only    text/binary    text/binary

     Internal
                              Row-based    Row-based      Column-based
     Storage order

     Compression              File-based   Block-based    Block-based

     Splitable*               YES          YES            YES

     Splitable* after
                              NO           YES            YES
     compression

 * Splitable: Capable of splitting the file so that a single huge
      file can be processed by multiple mappers in parallel.


Wednesday, January 27, 2010
Map/Reduce Scripts Examples

     add file page_url_to_id.py;
     add file my_python_session_cutter.py;
     FROM
        (MAP uhash, page_url, unix_time
           USING 'page_url_to_id.py'
           AS (uhash, page_id, unix_time)
         FROM mylog
         DISTRIBUTE BY uhash
         SORT BY uhash, unix_time) mylog2
      REDUCE uhash, page_id, unix_time
        USING 'my_python_session_cutter.py'
        AS (uhash, session_info);




Wednesday, January 27, 2010
UDF Example
      add jar build/ql/test/test-udfs.jar;
      CREATE TEMPORARY FUNCTION testlength AS
       'org.apache.hadoop.hive.ql.udf.UDFTestLength';
      SELECT testlength(page_url) FROM mylog;
      DROP TEMPORARY FUNCTION testlength;


   UDFTestLength.java:
  package org.apache.hadoop.hive.ql.udf;
  public class UDFTestLength extends UDF {
    public Integer evaluate(String s) {
      if (s == null) {
        return null;
      }
      return s.length();
    }
  }

Wednesday, January 27, 2010
Comparison of UDF/UDAF/UDTF v.s. M/R scripts

                              UDF/UDAF/UDTF        M/R scripts

     language                 Java                 any language

     data format              in-memory objects    serialized streams

     1/1 input/output         supported via UDF    supported

     n/1 input/output         supported via UDAF   supported

     1/n input/output         supported via UDTF   supported

     Speed                    faster               slower




Wednesday, January 27, 2010
Interoperability: Interfaces

   JDBC
        – Enables integration with JDBC based SQL clients
   ODBC
        – Enables integration with Microstrategy
   Thrift
        – Enables writing cross language clients
        – Main form of integration with php based Web UI




Wednesday, January 27, 2010
Interoperability: Microstrategy

   Beta integration with version 8
   Free form SQL support
   Periodically pre-compute the cube




Wednesday, January 27, 2010
Operational Aspects on Adhoc cluster

   Data Discovery
        – coHive
               Discover tables
               Talk to expert users of a table
               Browse table lineage


   Monitoring
        – Resource utilization by individual, project, group
        – SLA monitoring etc.
        – Bad user reports etc.




Wednesday, January 27, 2010
HiPal & CoHive (Not open source)




Wednesday, January 27, 2010
Open Source Community

   Released Hive-0.4 on 10/13/2009
   50 contributors and growing
   11 committers
        – 3 external to Facebook
   Available as a sub project in Hadoop
        -   http://wiki.apache.org/hadoop/Hive (wiki)
        -   http://hadoop.apache.org/hive (home page)
        -   http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)
        -   ##hive (IRC)
        -   Works with hadoop-0.17, 0.18, 0.19, 0.20
   Mailing Lists:
        – hive-{user,dev,commits}@hadoop.apache.org



Wednesday, January 27, 2010
Powered by Hive




Wednesday, January 27, 2010

More Related Content

What's hot

Power BI Architecture
Power BI ArchitecturePower BI Architecture
Power BI ArchitectureArthur Graus
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Data science capabilities
Data science capabilitiesData science capabilities
Data science capabilitiesYann Lecourt
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Data Analytics Strategy
Data Analytics StrategyData Analytics Strategy
Data Analytics StrategyeHealthCareers
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Sap bw 4 hana vs sap bw on hana
Sap bw 4 hana vs sap bw on hanaSap bw 4 hana vs sap bw on hana
Sap bw 4 hana vs sap bw on hanaJasbir Khanuja
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - IntroductionBlandine Larbret
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Large Enterprise - Best Practices - Developing a CoE (1).pdf
Large Enterprise - Best Practices - Developing a CoE (1).pdfLarge Enterprise - Best Practices - Developing a CoE (1).pdf
Large Enterprise - Best Practices - Developing a CoE (1).pdfZiadAlsukairy
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 

What's hot (20)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Sap bw4 hana
Sap bw4 hanaSap bw4 hana
Sap bw4 hana
 
Sqoop
SqoopSqoop
Sqoop
 
Power BI Architecture
Power BI ArchitecturePower BI Architecture
Power BI Architecture
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Data science capabilities
Data science capabilitiesData science capabilities
Data science capabilities
 
Big Data Proof of Concept
Big Data Proof of ConceptBig Data Proof of Concept
Big Data Proof of Concept
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Data Analytics Strategy
Data Analytics StrategyData Analytics Strategy
Data Analytics Strategy
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Sap bw 4 hana vs sap bw on hana
Sap bw 4 hana vs sap bw on hanaSap bw 4 hana vs sap bw on hana
Sap bw 4 hana vs sap bw on hana
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
Big Data Integration
Big Data IntegrationBig Data Integration
Big Data Integration
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Large Enterprise - Best Practices - Developing a CoE (1).pdf
Large Enterprise - Best Practices - Developing a CoE (1).pdfLarge Enterprise - Best Practices - Developing a CoE (1).pdf
Large Enterprise - Best Practices - Developing a CoE (1).pdf
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 

Viewers also liked

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 

Viewers also liked (16)

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Facebook hadoop-summit
Facebook hadoop-summitFacebook hadoop-summit
Facebook hadoop-summitS S
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 

Similar to Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop (20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Facebook hadoop-summit
Facebook hadoop-summitFacebook hadoop-summit
Facebook hadoop-summit
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 

More from royans

Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystemsroyans
 
Flickr Services
Flickr ServicesFlickr Services
Flickr Servicesroyans
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archroyans
 
Flickr Php
Flickr PhpFlickr Php
Flickr Phproyans
 
Grid – Distributed Computing at Scale
Grid – Distributed Computing at ScaleGrid – Distributed Computing at Scale
Grid – Distributed Computing at Scaleroyans
 
How Typepad changed their architecture without taking down the service
How Typepad changed their architecture without taking down the serviceHow Typepad changed their architecture without taking down the service
How Typepad changed their architecture without taking down the serviceroyans
 
Dmk Bo2 K7 Web
Dmk Bo2 K7 WebDmk Bo2 K7 Web
Dmk Bo2 K7 Webroyans
 
21 Www Web Services
21 Www Web Services21 Www Web Services
21 Www Web Servicesroyans
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystemsroyans
 
Web Design World Flickr
Web Design World FlickrWeb Design World Flickr
Web Design World Flickrroyans
 
Flickr Services
Flickr ServicesFlickr Services
Flickr Servicesroyans
 
Filesystems
FilesystemsFilesystems
Filesystemsroyans
 
Scalable Web Arch
Scalable Web ArchScalable Web Arch
Scalable Web Archroyans
 
Web 2.0 Summit Flickr
Web 2.0 Summit FlickrWeb 2.0 Summit Flickr
Web 2.0 Summit Flickrroyans
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystemsroyans
 
Etech2005
Etech2005Etech2005
Etech2005royans
 

More from royans (17)

Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Flickr Services
Flickr ServicesFlickr Services
Flickr Services
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Flickr Php
Flickr PhpFlickr Php
Flickr Php
 
Grid – Distributed Computing at Scale
Grid – Distributed Computing at ScaleGrid – Distributed Computing at Scale
Grid – Distributed Computing at Scale
 
How Typepad changed their architecture without taking down the service
How Typepad changed their architecture without taking down the serviceHow Typepad changed their architecture without taking down the service
How Typepad changed their architecture without taking down the service
 
Dmk Bo2 K7 Web
Dmk Bo2 K7 WebDmk Bo2 K7 Web
Dmk Bo2 K7 Web
 
21 Www Web Services
21 Www Web Services21 Www Web Services
21 Www Web Services
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Web Design World Flickr
Web Design World FlickrWeb Design World Flickr
Web Design World Flickr
 
Flickr Services
Flickr ServicesFlickr Services
Flickr Services
 
Filesystems
FilesystemsFilesystems
Filesystems
 
Scalable Web Arch
Scalable Web ArchScalable Web Arch
Scalable Web Arch
 
Web 2.0 Summit Flickr
Web 2.0 Summit FlickrWeb 2.0 Summit Flickr
Web 2.0 Summit Flickr
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Etech2005
Etech2005Etech2005
Etech2005
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop

  • 1. Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop Wednesday, January 27, 2010
  • 2. Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Wednesday, January 27, 2010
  • 3. Trends Leading to More Data Wednesday, January 27, 2010
  • 4. Trends Leading to More Data Free or low cost of user services Wednesday, January 27, 2010
  • 5. Trends Leading to More Data Free or low cost of user services Realization that more insights are derived from simple algorithms on more data Wednesday, January 27, 2010
  • 6. Deficiencies of Existing Technologies Wednesday, January 27, 2010
  • 7. Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Wednesday, January 27, 2010
  • 8. Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Limited Scalability does not support trends towards more data Wednesday, January 27, 2010
  • 9. Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Limited Scalability does not support trends towards more data Closed and Proprietary Systems Wednesday, January 27, 2010
  • 10. Lets try Hadoop…  Pros – Superior in availability/scalability/manageability – Efficiency not that great, but throw more hardware – Partial Availability/resilience/scale more important than ACID  Cons: Programmability and Metadata – Map-reduce hard to program (users know sql/bash/python) – Need to publish data in well known schemas  Solution: HIVE Wednesday, January 27, 2010
  • 11. What is HIVE?  A system for managing and querying structured data built on top of Hadoop – Map-Reduce for execution – HDFS for storage – Metadata in an RDBMS  Key Building Principles: – SQL as a familiar data warehousing tool – Extensibility – Types, Functions, Formats, Scripts – Scalability and Performance – Interoperability Wednesday, January 27, 2010
  • 12. Why SQL on Hadoop? hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"t"$1}‘ $ cat > /tmp/map.sh awk -F '001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 - mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/ largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part* Wednesday, January 27, 2010
  • 13. Data Flow Architecture at Facebook Wednesday, January 27, 2010
  • 14. Data Flow Architecture at Facebook Web Servers Wednesday, January 27, 2010
  • 15. Data Flow Architecture at Facebook Web Servers Scribe MidTier Wednesday, January 27, 2010
  • 16. Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Wednesday, January 27, 2010
  • 17. Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Federated MySQL Wednesday, January 27, 2010
  • 18. Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Production Hive-Hadoop Cluster Federated MySQL Wednesday, January 27, 2010
  • 19. Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Scribe-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Wednesday, January 27, 2010
  • 20. Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Hive replication Scribe-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Wednesday, January 27, 2010
  • 21. Data Flow Architecture at Facebook Filers Web Servers Scribe MidTier Hive replication Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Wednesday, January 27, 2010
  • 22. Scribe & Hadoop Clusters @ Facebook  Used to log data from web servers  Clusters collocated with the web servers  Network is the biggest bottleneck  Typical cluster has about 50 nodes.  Stats: – ~ 25TB/day of raw data logged – 99% of the time data is available within 20 seconds Wednesday, January 27, 2010
  • 23. Hadoop & Hive Cluster @ Facebook  Hadoop/Hive cluster – 8400 cores – Raw Storage capacity ~ 12.5PB – 8 cores + 12 TB per node – 32 GB RAM per node – Two level network topology  1 Gbit/sec from node to rack switch  4 Gbit/sec to top level rack switch  2 clusters – One for adhoc users – One for strict SLA jobs Wednesday, January 27, 2010
  • 24. Hive & Hadoop Usage @ Facebook  Statistics per day: – 12 TB of compressed new data added per day – 135TB of compressed data scanned per day – 7500+ Hive jobs per day – 80K compute hours per day  Hive simplifies Hadoop: – New engineers go though a Hive training session – ~200 people/month run jobs on Hadoop/Hive – Analysts (non-engineers) use Hadoop through Hive – Most of jobs are Hive Jobs Wednesday, January 27, 2010
  • 25. Hive & Hadoop Usage @ Facebook  Types of Applications: – Reporting  Eg: Daily/Weekly aggregations of impression/click counts  Measures of user engagement  Microstrategy reports – Ad hoc Analysis  Eg: how many group admins broken down by state/country – Machine Learning (Assembling training data)  Ad Optimization  Eg: User Engagement as a function of user attributes – Many others Wednesday, January 27, 2010
  • 26. More about HIVE Wednesday, January 27, 2010
  • 27. Data Model Name HDFS Directory Table pvs /wh/pvs Partition ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=US /wh/pvs/ds=20090801/ctry=US/ Bucket user into 32 buckets part-00000 HDFS file for user hash 0 Wednesday, January 27, 2010
  • 28. Hive Query Language  SQL – Sub-queries in from clause – Equi-joins (including Outer joins) – Multi-table Insert – Multi-group-by – Embedding Custom Map/Reduce in SQL  Sampling  Primitive Types – integer types, float, string, boolean  Nestable Collections – array<any-type> and map<primitive-type, any-type>  User-defined types – Structures with attributes which can be of any-type Wednesday, January 27, 2010
  • 29. Optimizations  Joins try to reduce the number of map/reduce jobs needed.  Memory efficient joins by streaming largest tables.  Map Joins – User specified small tables stored in hash tables on the mapper – No reducer needed  Map side partial aggregations – Hash-based aggregates – Serialized key/values in hash tables – 90% speed improvement on Query  SELECT count(1) FROM t;  Load balancing for data skew Wednesday, January 27, 2010
  • 30. Hive: Open & Extensible  Different on-disk storage(file) formats – Text File, Sequence File, …  Different serialization formats and data types – LazySimpleSerDe, ThriftSerDe …  User-provided map/reduce scripts – In any language, use stdin/stdout to transfer data …  User-defined Functions – Substr, Trim, From_unixtime …  User-defined Aggregation Functions – Sum, Average …  User-define Table Functions – Explode … Wednesday, January 27, 2010
  • 31. Existing File Formats TEXTFILE SEQUENCEFILE RCFILE Data type text only text/binary text/binary Internal Row-based Row-based Column-based Storage order Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after NO YES YES compression * Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel. Wednesday, January 27, 2010
  • 32. Map/Reduce Scripts Examples  add file page_url_to_id.py;  add file my_python_session_cutter.py;  FROM (MAP uhash, page_url, unix_time USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 REDUCE uhash, page_id, unix_time USING 'my_python_session_cutter.py' AS (uhash, session_info); Wednesday, January 27, 2010
  • 33. UDF Example  add jar build/ql/test/test-udfs.jar;  CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength';  SELECT testlength(page_url) FROM mylog;  DROP TEMPORARY FUNCTION testlength;  UDFTestLength.java: package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); } } Wednesday, January 27, 2010
  • 34. Comparison of UDF/UDAF/UDTF v.s. M/R scripts UDF/UDAF/UDTF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster slower Wednesday, January 27, 2010
  • 35. Interoperability: Interfaces  JDBC – Enables integration with JDBC based SQL clients  ODBC – Enables integration with Microstrategy  Thrift – Enables writing cross language clients – Main form of integration with php based Web UI Wednesday, January 27, 2010
  • 36. Interoperability: Microstrategy  Beta integration with version 8  Free form SQL support  Periodically pre-compute the cube Wednesday, January 27, 2010
  • 37. Operational Aspects on Adhoc cluster  Data Discovery – coHive  Discover tables  Talk to expert users of a table  Browse table lineage  Monitoring – Resource utilization by individual, project, group – SLA monitoring etc. – Bad user reports etc. Wednesday, January 27, 2010
  • 38. HiPal & CoHive (Not open source) Wednesday, January 27, 2010
  • 39. Open Source Community  Released Hive-0.4 on 10/13/2009  50 contributors and growing  11 committers – 3 external to Facebook  Available as a sub project in Hadoop - http://wiki.apache.org/hadoop/Hive (wiki) - http://hadoop.apache.org/hive (home page) - http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) - ##hive (IRC) - Works with hadoop-0.17, 0.18, 0.19, 0.20  Mailing Lists: – hive-{user,dev,commits}@hadoop.apache.org Wednesday, January 27, 2010
  • 40. Powered by Hive Wednesday, January 27, 2010