SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Hive Evolution Hadoop India Summit February 2011 Namit Jain (Facebook)
Agenda Hive Overview Version 0.6 (released!) Version 0.7 (under development) Hive is now a TLP! Roadmaps
What is Hive? A Hadoop-based system for querying and managing structured data Uses Map/Reduce for execution Uses Hadoop Distributed File System (HDFS) for storage
Hive Origins Data explosion at Facebook Traditional DBMS technology could not keep up with the growth Hadoop to the rescue! Incubation with ASF, then became a Hadoop sub-project Now a top-level ASF project
SQL vs MapReduce hive> select key, count(1) from kv1 where key > 100 group by key;                     vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoop dfs –cat /tmp/largekey/part*
Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
Intended Usage Web-scale Big Data 100’s of terabytes Large Hadoop cluster 100’s of nodes (heterogeneous OK) Data has a schema Batch jobs for both loads and queries
So Don’t Use Hive If… Your data is measured in GB You don’t want to impose a schema You need responses in seconds A “conventional” analytic DBMS can already do the job (and you can afford it) You don’t have a lot of time and smart people
Scaling Up Facebook warehouse, Jan 2011: 2750 nodes 30 petabytes disk space Data access per day: ~40 terabytes added (compressed) 25000 map/reduce jobs 300-400 users/month
Facebook Deployment Web Servers Scribe MidTier Scribe-Hadoop Clusters       Hive  Replication Production  Hive-Hadoop  Cluster Archival  Hive-Hadoop  Cluster  Adhoc  Hive-Hadoop  Cluster  Sharded MySQL
System Architecture
Data Model
Column Data Types Primitive Types integer types, float, string, boolean Nest-able Collections array<any-type> map<primitive-type, any-type> User-defined types structures with attributes which can be of any-type
Hive Query Language DDL {create/alter/drop} {table/view/partition} create table as select DML Insert overwrite QL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert Sampling Lateral Views Interfaces JDBC/ODBC/Thrift
Query Translation Example SELECT url, count(*) FROM page_views GROUP BY url Map tasks compute partial counts for each URL in a hash table “map side” pre-aggregation map outputs are partitioned by URL and shipped to corresponding reducers Reduce tasks tally up partial counts to produce final results
FROM (SELECT a.status, b.school, b.gender        FROM status_updates a JOIN profiles b             ON (a.userid = b.userid and                 a.ds='2009-03-20' )       ) subq1 INSERT OVERWRITE TABLE gender_summary                        PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1)  GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary                              PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school
It Gets Quite Complicated!
Behavior Extensibility TRANSFORM scripts (any language) Serialization+IPC overhead User defined functions (Java) In-process, lazy object evaluation Pre/Post Hooks (Java) Statement validation/execution Example uses:  auditing, replication, authorization, multiple clusters
Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM     (SELECT TRANSFORM(user_id, page_url, unix_time)        USING 'page_url_to_id.py'        AS (user_id, page_id, unix_time)      FROM mylog      DISTRIBUTE BY user_id      SORT BY user_id, unix_time) mylog2   SELECT TRANSFORM(user_id, page_id, unix_time)     USING 'my_python_session_cutter.py'     AS (user_id, session_info);
UDF vs UDAF vs UDTF User Defined Function One-to-one row mapping Concat(‘foo’, ‘bar’) User Defined Aggregate Function Many-to-one row mapping Sum(num_ads) User Defined Table Function One-to-many row mapping Explode([1,2,3])
UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength'; SELECT testlength(src.value) FROM src; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf;  public class UDFTestLength extends UDF {   public Integer evaluate(String s) {     if (s == null) {       return null;     }     return s.length();   } }
Storage Extensibility Input/OutputFormat:  file formats SequenceFile, RCFile, TextFile, … SerDe:  row formats Thrift, JSON, ProtocolBuffer, … Storage Handlers (new in 0.6) Integrate foreign metadata, e.g. HBase Indexing Under development in 0.7
Release 0.6 October 2010 Views Multiple Databases Dynamic Partitioning Automatic Merge New Join Strategies Storage Handlers
Dynamic Partitions Automatically create partitions based on distinct values in columns INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)  SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country FROM page_view_stg pvs
Automatic merge Jobs can produce many files Why is this bad? Namenode pressure Downstream jobs have to deal with file processing overhead So, clean up by merging results into a few large files (configurable) Use conditional map-only task to do this
Join Strategies Old Join Strategies Map-reduce and Map Join Bucketed map-join Allows “small” table to be much bigger Sort Merge Map Join Deal with skew in map/reduce join Conditional plan step for skewed keys
Storage Handler Syntax HBase Example CREATE TABLE users(   userid int, name string, email string, notes string) STORED BY   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  WITH SERDEPROPERTIES ( “hbase.columns.mapping” = “small:name,small:email,large:notes”) TBLPROPERTIES ( “hbase.table.name” = “user_list”);
Release 0.7 Deployed in Facebook Stats Functions Indexes Local Mode Automatic Map Join Multiple DISTINCTs Archiving In development Concurrency Control Stats Collection J/ODBC Enhancements Authorization RCFile2 Partitioned Views Security Enhancements
Statistical Functions Stats 101 Stddev, var, covar Percentile_approx Data Mining Ngrams, sentences (text analysis) Histogram_numeric SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
Histogram query results ,[object Object]
“In a relationship” peaks at 20
“Engaged” peaks at 25
Married peaks in early 30s
 More married than single at 28
 Only teenagers use widowed?,[object Object]
Local Mode Execution Avoids map/reduce cluster job latency Good for jobs which process small amounts of data Let Hive decide when to use it set hive.exec.model.local.auto=true; Or force its usage set mapred.job.tracker=local;
Automatic Map Join Map-Join if small table fits in memory If it can’t, fall back to reduce join Optimize hash table data structures Use distributed cache to push out pre-filtered lookup table Avoid swamping HDFS with reads from thousands of mappers
Multiple DISTINCT Aggs Example SELECT    view_date,     COUNT(DISTINCT userid),    COUNT(DISTINCT page_url) FROM page_views GROUP BY view_date
Archiving Use HAR (Hadoop archive format) to combine many files into a few Relieves namenode memory ALTER TABLE page_views {ARCHIVE|UNARCHIVE} PARTITION (ds=‘2010-10-30’)
Concurrency Control Pluggable distributed lock manager Default is Zookeeper-based Simple read/write locking Table-level and partition-level Implicit locking (statement level) Deadlock-free via lock ordering Explicit LOCK TABLE (global)
Statistics Collection Implicit metastore update during load Or explicit via ANALYZE TABLE Table/partition-level Number of rows Number of files Size in bytes
Hive is now a TLP PMC Namit Jain (chair) John Sichi Zheng Shao Edward Capriolo Raghotham Murthy Committers Amareshwari Sriramadasu Carl Steinbach Paul Yang He Yongqiang Prasad Chakka Joydeep Sen Sarma Ashish Thusoo Ning Zhang
Developer Diversity Recent Contributors Facebook, Yahoo, Cloudera Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems Numerous research projects Many many more… Monthly San Francisco bay area contributor meetups India meetups ?  

Weitere ähnliche Inhalte

Was ist angesagt?

Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookZheng Shao
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveZheng Shao
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 

Was ist angesagt? (20)

Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Pig
PigPig
Pig
 
Hive
HiveHive
Hive
 
Map reducefunnyslide
Map reducefunnyslideMap reducefunnyslide
Map reducefunnyslide
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Unit 5-lecture4
Unit 5-lecture4Unit 5-lecture4
Unit 5-lecture4
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 

Ähnlich wie Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comEdward D. Kim
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 

Ähnlich wie Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain (20)

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hive
HiveHive
Hive
 
מיכאל
מיכאלמיכאל
מיכאל
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 

Mehr von Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mehr von Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain

  • 1. Hive Evolution Hadoop India Summit February 2011 Namit Jain (Facebook)
  • 2. Agenda Hive Overview Version 0.6 (released!) Version 0.7 (under development) Hive is now a TLP! Roadmaps
  • 3. What is Hive? A Hadoop-based system for querying and managing structured data Uses Map/Reduce for execution Uses Hadoop Distributed File System (HDFS) for storage
  • 4. Hive Origins Data explosion at Facebook Traditional DBMS technology could not keep up with the growth Hadoop to the rescue! Incubation with ASF, then became a Hadoop sub-project Now a top-level ASF project
  • 5. SQL vs MapReduce hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
  • 6. Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
  • 7. Intended Usage Web-scale Big Data 100’s of terabytes Large Hadoop cluster 100’s of nodes (heterogeneous OK) Data has a schema Batch jobs for both loads and queries
  • 8. So Don’t Use Hive If… Your data is measured in GB You don’t want to impose a schema You need responses in seconds A “conventional” analytic DBMS can already do the job (and you can afford it) You don’t have a lot of time and smart people
  • 9. Scaling Up Facebook warehouse, Jan 2011: 2750 nodes 30 petabytes disk space Data access per day: ~40 terabytes added (compressed) 25000 map/reduce jobs 300-400 users/month
  • 10. Facebook Deployment Web Servers Scribe MidTier Scribe-Hadoop Clusters Hive Replication Production Hive-Hadoop Cluster Archival Hive-Hadoop Cluster Adhoc Hive-Hadoop Cluster Sharded MySQL
  • 13. Column Data Types Primitive Types integer types, float, string, boolean Nest-able Collections array<any-type> map<primitive-type, any-type> User-defined types structures with attributes which can be of any-type
  • 14. Hive Query Language DDL {create/alter/drop} {table/view/partition} create table as select DML Insert overwrite QL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert Sampling Lateral Views Interfaces JDBC/ODBC/Thrift
  • 15. Query Translation Example SELECT url, count(*) FROM page_views GROUP BY url Map tasks compute partial counts for each URL in a hash table “map side” pre-aggregation map outputs are partitioned by URL and shipped to corresponding reducers Reduce tasks tally up partial counts to produce final results
  • 16. FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school
  • 17. It Gets Quite Complicated!
  • 18. Behavior Extensibility TRANSFORM scripts (any language) Serialization+IPC overhead User defined functions (Java) In-process, lazy object evaluation Pre/Post Hooks (Java) Statement validation/execution Example uses: auditing, replication, authorization, multiple clusters
  • 19. Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);
  • 20. UDF vs UDAF vs UDTF User Defined Function One-to-one row mapping Concat(‘foo’, ‘bar’) User Defined Aggregate Function Many-to-one row mapping Sum(num_ads) User Defined Table Function One-to-many row mapping Explode([1,2,3])
  • 21. UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength'; SELECT testlength(src.value) FROM src; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); } }
  • 22. Storage Extensibility Input/OutputFormat: file formats SequenceFile, RCFile, TextFile, … SerDe: row formats Thrift, JSON, ProtocolBuffer, … Storage Handlers (new in 0.6) Integrate foreign metadata, e.g. HBase Indexing Under development in 0.7
  • 23. Release 0.6 October 2010 Views Multiple Databases Dynamic Partitioning Automatic Merge New Join Strategies Storage Handlers
  • 24. Dynamic Partitions Automatically create partitions based on distinct values in columns INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country FROM page_view_stg pvs
  • 25. Automatic merge Jobs can produce many files Why is this bad? Namenode pressure Downstream jobs have to deal with file processing overhead So, clean up by merging results into a few large files (configurable) Use conditional map-only task to do this
  • 26. Join Strategies Old Join Strategies Map-reduce and Map Join Bucketed map-join Allows “small” table to be much bigger Sort Merge Map Join Deal with skew in map/reduce join Conditional plan step for skewed keys
  • 27. Storage Handler Syntax HBase Example CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “hbase.columns.mapping” = “small:name,small:email,large:notes”) TBLPROPERTIES ( “hbase.table.name” = “user_list”);
  • 28. Release 0.7 Deployed in Facebook Stats Functions Indexes Local Mode Automatic Map Join Multiple DISTINCTs Archiving In development Concurrency Control Stats Collection J/ODBC Enhancements Authorization RCFile2 Partitioned Views Security Enhancements
  • 29. Statistical Functions Stats 101 Stddev, var, covar Percentile_approx Data Mining Ngrams, sentences (text analysis) Histogram_numeric SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
  • 30.
  • 33. Married peaks in early 30s
  • 34. More married than single at 28
  • 35.
  • 36. Local Mode Execution Avoids map/reduce cluster job latency Good for jobs which process small amounts of data Let Hive decide when to use it set hive.exec.model.local.auto=true; Or force its usage set mapred.job.tracker=local;
  • 37. Automatic Map Join Map-Join if small table fits in memory If it can’t, fall back to reduce join Optimize hash table data structures Use distributed cache to push out pre-filtered lookup table Avoid swamping HDFS with reads from thousands of mappers
  • 38. Multiple DISTINCT Aggs Example SELECT view_date, COUNT(DISTINCT userid), COUNT(DISTINCT page_url) FROM page_views GROUP BY view_date
  • 39. Archiving Use HAR (Hadoop archive format) to combine many files into a few Relieves namenode memory ALTER TABLE page_views {ARCHIVE|UNARCHIVE} PARTITION (ds=‘2010-10-30’)
  • 40. Concurrency Control Pluggable distributed lock manager Default is Zookeeper-based Simple read/write locking Table-level and partition-level Implicit locking (statement level) Deadlock-free via lock ordering Explicit LOCK TABLE (global)
  • 41. Statistics Collection Implicit metastore update during load Or explicit via ANALYZE TABLE Table/partition-level Number of rows Number of files Size in bytes
  • 42. Hive is now a TLP PMC Namit Jain (chair) John Sichi Zheng Shao Edward Capriolo Raghotham Murthy Committers Amareshwari Sriramadasu Carl Steinbach Paul Yang He Yongqiang Prasad Chakka Joydeep Sen Sarma Ashish Thusoo Ning Zhang
  • 43. Developer Diversity Recent Contributors Facebook, Yahoo, Cloudera Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems Numerous research projects Many many more… Monthly San Francisco bay area contributor meetups India meetups ? 
  • 44. Roadmap: Heavy-Duty Tests Unit tests are insufficient What is needed: Real-world schemas/queries Non-toy data scales Scripted setup; configuration matrix Correctness/performance verification Automatic reports: throughput, latency, profiles, coverage, perf counters…
  • 45. Roadmap: Shared Test Site Nightly runs, regression alerting Performance trending Synthetic workload (e.g. TPC-H) Real-world workload (anonymized?) This is critical for Non-subjective commit criteria Release quality
  • 46. Roadmap: New Features Hive Server Stability/Deployment File Concatenation Reduce Number of Files Performance Bloom Filters Push Down Filters Cost Based Optimizer Column Level Statistics Plan should be based on Statistics