SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Hive -  Data Warehousing &   Analytics on Hadoop Wednesday, June 10, 2009  Santa Clara Marriott Namit Jain, Zheng Shao Facebook
Agenda ,[object Object],[object Object],[object Object],[object Object],Facebook
[object Object],Facebook
Why Another Data Warehousing System? ,[object Object],[object Object],[object Object],Facebook
 
Lets try Hadoop… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Lets try Hadoop… (continued) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
What is HIVE? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Simplifying Hadoop ,[object Object],[object Object],[object Object],Facebook
[object Object],Facebook
Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive on  Hadoop Cluster Oracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Hadoop Usage @ Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
[object Object],Facebook
[object Object],Facebook
Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB
Hive Query Language ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Hive Query Language (continued) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Pluggable Map-Reduce Scripts Facebook
Map Reduce Example Facebook Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
Hive QL – Join ,[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Hive QL – Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
Join Optimizations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Hive QL – Map Join Facebook page_view user Hash table pv_users key value 111 <1,2> 222 <2> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male Pageid age 1 25 2 25 1 32
Hive QL – Group By ,[object Object],[object Object],[object Object],Facebook
Hive QL – Group By in Map Reduce Facebook pv_users Map Reduce pageid age 1 25 1 25 pageid age count 1 25 3 pageid age 2 32 1 25 key value <1,25> 2 key value <1,25> 1 <2,32> 1 key value <1,25> 2 <1,25> 1 key value <2,32> 1 Shuffle Sort pageid age count 2 32 1
Group by Optimizations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Columnar Storage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Speed Improvements over Time Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse  21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
Overcoming Java Overhead ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Generic UDF and UDAF ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
HQL Optimizations ,[object Object],[object Object],[object Object],Facebook
[object Object],Facebook
Open Source Community ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Deployment Options ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook
Contributors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],Facebook

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 

Was ist angesagt? (18)

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 

Andere mochten auch

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
mapr-academy
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 

Andere mochten auch (20)

Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Replacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveReplacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and Hive
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 

Ähnlich wie Hadoop Summit 2009 Hive

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Cloudera, Inc.
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 

Ähnlich wie Hadoop Summit 2009 Hive (20)

Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and Hive
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive
HiveHive
Hive
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
Pig latin
Pig latinPig latin
Pig latin
 
Hadoop institutes in hyderabad
Hadoop institutes in hyderabadHadoop institutes in hyderabad
Hadoop institutes in hyderabad
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Tableau training course
Tableau training courseTableau training course
Tableau training course
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Hadoop Summit 2009 Hive

  • 1. Hive - Data Warehousing & Analytics on Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Namit Jain, Zheng Shao Facebook
  • 2.
  • 3.
  • 4.
  • 5.  
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
  • 17. HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Map Reduce Example Facebook Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 23.
  • 24. Hive QL – Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
  • 25.
  • 26. Hive QL – Map Join Facebook page_view user Hash table pv_users key value 111 <1,2> 222 <2> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male Pageid age 1 25 2 25 1 32
  • 27.
  • 28. Hive QL – Group By in Map Reduce Facebook pv_users Map Reduce pageid age 1 25 1 25 pageid age count 1 25 3 pageid age 2 32 1 25 key value <1,25> 2 key value <1,25> 1 <2,32> 1 key value <1,25> 2 <1,25> 1 key value <2,32> 1 Shuffle Sort pageid age count 2 32 1
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.

Hinweis der Redaktion

  1. What is this? This is huge amount of data. Along with the fast growth of active users on Facebook, the size of our data is exploding. In the last 12 months, the amount of data increased by 500%. These data are very valuable. They can be used to understand the user behavior, measure the impact of a new product, and make data-based decisions. Traditionally people store data in data warehouse solutions on top of Oracle and MySQL. In the recent years, we are also seeing new proprietary solutions like AsterData and Netezza. However, these solutions either do not scale to the amount of data that we have, or they are very inflexible that cannot satisfy our data analysis requirements. In order to provide the capability to analyze the huge amount of data that we have, we started the Hive project. Hive is based on Hadoop but does much more than Hadoop. We will show the details in the following slides. ============