SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Hadoop and Hive Development at
   Facebook

Dhruba Borthakur Zheng Shao
{dhruba, zshao}@facebook.com
Presented at Hadoop World, New York
October 2, 2009
Hadoop @ Facebook
Who generates this data?

  Lots of data is generated on Facebook
    –  300+ million active users
    –  30 million users update their statuses at least once each
       day
    –  More than 1 billion photos uploaded each month
    –  More than 10 million videos uploaded each month
    –  More than 1 billion pieces of content (web links, news
       stories, blog posts, notes, photos, etc.) shared each
       week
Data Usage

  Statistics per day:
    –  4 TB of compressed new data added per day
    –  135TB of compressed data scanned per day
    –  7500+ Hive jobs on production cluster per day
    –  80K compute hours per day
  Barrier to entry is significantly reduced:
    –  New engineers go though a Hive training session
    –  ~200 people/month run jobs on Hadoop/Hive
    –  Analysts (non-engineers) use Hadoop through Hive
Where is this data stored?

  Hadoop/Hive Warehouse
  –  4800 cores, 5.5 PetaBytes
  –  12 TB per node
  –  Two level network topology
       1 Gbit/sec from node to rack switch
       4 Gbit/sec to top level rack switch
Data Flow into Hadoop Cloud
                                                     Network

                                                     Storage

                                                     and

                                                     Servers

    Web
Servers

                        Scribe
MidTier





     Oracle
RAC
   Hadoop
Hive
Warehouse
   MySQL

Hadoop Scribe: Avoid Costly Filers
                                                          Scribe
Writers


                                                                      RealBme

                                                                      Hadoop

                                                                      Cluster

    Web
Servers
               Scribe
MidTier





     Oracle
RAC
         Hadoop
Hive
Warehouse
                    MySQL

    http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
HDFS Raid

  Start the same: triplicate
   every data block
                                                 A               B                C
  Background encoding
   –  Combine third replica of                   A               B                C
      blocks from a single file to
      create parity block                        A               B                C
   –  Remove third replica
   –  Apache JIRA HDFS-503                                     A+B+C
  DiskReduce from CMU
   –  Garth Gibson research                     A file with three blocks A, B and C

   http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
Archival: Move old data to cheap storage

Hadoop
Warehouse


                                                 NFS




               Hadoop
Archive
Node
                     Cheap
NAS




                                 Hadoop
Archival
Cluster



                    hEp://issues.apache.org/jira/browse/HDFS‐220

 Hive
Query

Dynamic-size MapReduce Clusters

  Why multiple compute clouds in Facebook?
    –  Users unaware of resources needed by job
    –  Absence of flexible Job Isolation techniques
    –  Provide adequate SLAs for jobs
  Dynamically move nodes between clusters
    –  Based on load and configured policies
    –  Apache Jira MAPREDUCE-1044
Resource Aware Scheduling (Fair Share
Scheduler)
  We use the Hadoop Fair Share Scheduler
    –  Scheduler unaware of memory needed by job
  Memory and CPU aware scheduling
    –  RealTime gathering of CPU and memory usage
    –  Scheduler analyzes memory consumption in realtime
    –  Scheduler fair-shares memory usage among jobs
    –  Slot-less scheduling of tasks (in future)
    –  Apache Jira MAPREDUCE-961
Hive – Data Warehouse

  Efficient SQL to Map-Reduce Compiler

  Mar 2008: Started at Facebook
  May 2009: Release 0.3.0 available
  Now: Preparing for release 0.4.0


  Countable for 95%+ of Hadoop jobs @ Facebook
  Used by ~200 engineers and business analysts at Facebook
   every month
Hive Architecture

  Web UI + Hive CLI + JDBC/                Map Reduce       HDFS
            ODBC                       User-defined
     Browse, Query, DDL              Map-reduce Scripts

  MetaStore               Hive QL              UDF/UDAF
                                                 substr
                     Parser                       sum
  Thrift API                                    average
                    Planner    Execution
                                                          FileFormats
                                                 SerDe
                   Optimizer                                TextFile
                                                  CSV     SequenceFile
                                                 Thrift      RCFile
                                                 Regex
Hive DDL

  DDL
   –  Complex columns
   –  Partitions
   –  Buckets
  Example
   –  CREATE TABLE sales (
        id INT,
        items ARRAY<STRUCT<id:INT, name:STRING>>,
        extra MAP<STRING, STRING>
      ) PARTITIONED BY (ds STRING)
      CLUSTERED BY (id) INTO 32 BUCKETS;
Hive Query Language

  SQL
   –    Where
   –    Group By
   –    Equi-Join
   –    Sub query in from clause
  Example
   –  SELECT r.*, s.*
      FROM r JOIN (
        SELECT key, count(1) as count
        FROM s
        GROUP BY key) s
      ON r.key = s.key
      WHERE s.count > 100;
Group By

  4 different plans based on:
  –  Does data have skew?
  –  partial aggregation
  Map-side hash aggregation
  –  In-memory hash table in mapper to do partial
     aggregations
  2-map-reduce aggregation
  –  For distinct queries with skew and large cardinality
Join

  Normal map-reduce Join
  –  Mapper sends all rows with the same key to a single
     reducer
  –  Reducer does the join
  Map-side Join
  –  Mapper loads the whole small table and a portion of big
     table
  –  Mapper does the join
  –  Much faster than map-reduce join
Sampling

  Efficient sampling
   –  Table can be bucketed
   –  Each bucket is a file
   –  Sampling can choose some buckets
  Example
   –  SELECT product_id, sum(price)
      FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
      GROUP BY product_id
Multi-table Group-By/Insert

FROM users
INSERT INTO TABLE pv_gender_sum
   SELECT gender, count(DISTINCT userid)
   GROUP BY gender
INSERT INTO
   DIRECTORY '/user/facebook/tmp/pv_age_sum.dir'
   SELECT age, count(DISTINCT userid)
   GROUP BY age
INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir'
   SELECT country, gender, count(DISTINCT userid)
   GROUP BY country, gender;
File Formats

  TextFile:
   –  Easy for other applications to write/read
   –  Gzip text files are not splittable
  SequenceFile:
   –  Only hadoop can read it
   –  Support splittable compression
  RCFile: Block-based columnar storage
   –    Use SequenceFile block format
   –    Columnar storage inside a block
   –    25% smaller compressed size
   –    On-par or better query performance depending on the query
SerDe

  Serialization/Deserialization
  Row Format
   –    CSV (LazySimpleSerDe)
   –    Thrift (ThriftSerDe)
   –    Regex (RegexSerDe)
   –    Hive Binary Format (LazyBinarySerDe)
  LazySimpleSerDe and LazyBinarySerDe
   –  Deserialize the field when needed
   –  Reuse objects across different rows
   –  Text and Binary format
UDF/UDAF

  Features:
   –    Use either Java or Hadoop Objects (int, Integer, IntWritable)
   –    Overloading
   –    Variable-length arguments
   –    Partial aggregation for UDAF
  Example UDF:
   –  public class UDFExampleAdd extends UDF {
        public int evaluate(int a, int b) {
          return a + b;
        }
      }
Hive – Performance

  Date      SVN Revision        Major Changes            Query A   Query B   Query C
2/22/2009      746906      Before Lazy Deserialization    83 sec    98 sec   183 sec
2/23/2009      747293         Lazy Deserialization        40 sec    66 sec   185 sec
3/6/2009       751166        Map-side Aggregation         22 sec    67 sec   182 sec
4/29/2009      770074             Object Reuse            21 sec    49 sec   130 sec
6/3/2009       781633           Map-side Join *           21 sec    48 sec   132 sec
8/5/2009       801497        Lazy Binary Format *         21 sec    48 sec   132 sec

        QueryA: SELECT count(1) FROM t;
        QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
        QueryC: SELECT * FROM t;
        map-side time only (incl. GzipCodec for comp/decompression)
        * These two features need to be tested with other queries.
Hive – Future Works

    Indexes
    Create table as select
    Views / variables
    Explode operator
    In/Exists sub queries
    Leverage sort/bucket information in Join

Weitere ähnliche Inhalte

Was ist angesagt?

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 

Was ist angesagt? (17)

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
01 hbase
01 hbase01 hbase
01 hbase
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 

Ähnlich wie Hadoop and Hive Development at Facebook

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 

Ähnlich wie Hadoop and Hive Development at Facebook (20)

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
מיכאל
מיכאלמיכאל
מיכאל
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Apache drill
Apache drillApache drill
Apache drill
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Hadoop and Hive Development at Facebook

  • 1. Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009
  • 3. Who generates this data?   Lots of data is generated on Facebook –  300+ million active users –  30 million users update their statuses at least once each day –  More than 1 billion photos uploaded each month –  More than 10 million videos uploaded each month –  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
  • 4. Data Usage   Statistics per day: –  4 TB of compressed new data added per day –  135TB of compressed data scanned per day –  7500+ Hive jobs on production cluster per day –  80K compute hours per day   Barrier to entry is significantly reduced: –  New engineers go though a Hive training session –  ~200 people/month run jobs on Hadoop/Hive –  Analysts (non-engineers) use Hadoop through Hive
  • 5. Where is this data stored?   Hadoop/Hive Warehouse –  4800 cores, 5.5 PetaBytes –  12 TB per node –  Two level network topology   1 Gbit/sec from node to rack switch   4 Gbit/sec to top level rack switch
  • 6. Data Flow into Hadoop Cloud Network
 Storage
 and
 Servers
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL

  • 7. Hadoop Scribe: Avoid Costly Filers Scribe
Writers
 RealBme
 Hadoop
 Cluster
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL
 http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
  • 8. HDFS Raid   Start the same: triplicate every data block A B C   Background encoding –  Combine third replica of A B C blocks from a single file to create parity block A B C –  Remove third replica –  Apache JIRA HDFS-503 A+B+C   DiskReduce from CMU –  Garth Gibson research A file with three blocks A, B and C http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
  • 9. Archival: Move old data to cheap storage Hadoop
Warehouse
 NFS
 Hadoop
Archive
Node
 Cheap
NAS
 Hadoop
Archival
Cluster
 hEp://issues.apache.org/jira/browse/HDFS‐220
 Hive
Query

  • 10. Dynamic-size MapReduce Clusters   Why multiple compute clouds in Facebook? –  Users unaware of resources needed by job –  Absence of flexible Job Isolation techniques –  Provide adequate SLAs for jobs   Dynamically move nodes between clusters –  Based on load and configured policies –  Apache Jira MAPREDUCE-1044
  • 11. Resource Aware Scheduling (Fair Share Scheduler)   We use the Hadoop Fair Share Scheduler –  Scheduler unaware of memory needed by job   Memory and CPU aware scheduling –  RealTime gathering of CPU and memory usage –  Scheduler analyzes memory consumption in realtime –  Scheduler fair-shares memory usage among jobs –  Slot-less scheduling of tasks (in future) –  Apache Jira MAPREDUCE-961
  • 12. Hive – Data Warehouse   Efficient SQL to Map-Reduce Compiler   Mar 2008: Started at Facebook   May 2009: Release 0.3.0 available   Now: Preparing for release 0.4.0   Countable for 95%+ of Hadoop jobs @ Facebook   Used by ~200 engineers and business analysts at Facebook every month
  • 13. Hive Architecture Web UI + Hive CLI + JDBC/ Map Reduce HDFS ODBC User-defined Browse, Query, DDL Map-reduce Scripts MetaStore Hive QL UDF/UDAF substr Parser sum Thrift API average Planner Execution FileFormats SerDe Optimizer TextFile CSV SequenceFile Thrift RCFile Regex
  • 14. Hive DDL   DDL –  Complex columns –  Partitions –  Buckets   Example –  CREATE TABLE sales ( id INT, items ARRAY<STRUCT<id:INT, name:STRING>>, extra MAP<STRING, STRING> ) PARTITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;
  • 15. Hive Query Language   SQL –  Where –  Group By –  Equi-Join –  Sub query in from clause   Example –  SELECT r.*, s.* FROM r JOIN ( SELECT key, count(1) as count FROM s GROUP BY key) s ON r.key = s.key WHERE s.count > 100;
  • 16. Group By   4 different plans based on: –  Does data have skew? –  partial aggregation   Map-side hash aggregation –  In-memory hash table in mapper to do partial aggregations   2-map-reduce aggregation –  For distinct queries with skew and large cardinality
  • 17. Join   Normal map-reduce Join –  Mapper sends all rows with the same key to a single reducer –  Reducer does the join   Map-side Join –  Mapper loads the whole small table and a portion of big table –  Mapper does the join –  Much faster than map-reduce join
  • 18. Sampling   Efficient sampling –  Table can be bucketed –  Each bucket is a file –  Sampling can choose some buckets   Example –  SELECT product_id, sum(price) FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) GROUP BY product_id
  • 19. Multi-table Group-By/Insert FROM users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;
  • 20. File Formats   TextFile: –  Easy for other applications to write/read –  Gzip text files are not splittable   SequenceFile: –  Only hadoop can read it –  Support splittable compression   RCFile: Block-based columnar storage –  Use SequenceFile block format –  Columnar storage inside a block –  25% smaller compressed size –  On-par or better query performance depending on the query
  • 21. SerDe   Serialization/Deserialization   Row Format –  CSV (LazySimpleSerDe) –  Thrift (ThriftSerDe) –  Regex (RegexSerDe) –  Hive Binary Format (LazyBinarySerDe)   LazySimpleSerDe and LazyBinarySerDe –  Deserialize the field when needed –  Reuse objects across different rows –  Text and Binary format
  • 22. UDF/UDAF   Features: –  Use either Java or Hadoop Objects (int, Integer, IntWritable) –  Overloading –  Variable-length arguments –  Partial aggregation for UDAF   Example UDF: –  public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } }
  • 23. Hive – Performance Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec 8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec   QueryA: SELECT count(1) FROM t;   QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;   QueryC: SELECT * FROM t;   map-side time only (incl. GzipCodec for comp/decompression)   * These two features need to be tested with other queries.
  • 24. Hive – Future Works   Indexes   Create table as select   Views / variables   Explode operator   In/Exists sub queries   Leverage sort/bucket information in Join