SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Hive Anatomy



Data Infrastructure Team, Facebook
Part of Apache Hadoop Hive Project
Overview	
  
•  Conceptual	
  level	
  architecture	
  
•  (Pseudo-­‐)code	
  level	
  architecture	
  
       •  Parser	
  
       •  Seman:c	
  analyzer	
  
       •  Execu:on	
  
•  Example:	
  adding	
  a	
  new	
  Semijoin	
  Operator	
  
Conceptual	
  Level	
  Architecture 	
  	
  
•  Hive	
  Components:	
  
   –  Parser	
  (antlr):	
  HiveQL	
  	
  Abstract	
  Syntax	
  Tree	
  (AST)	
  
   –  Seman:c	
  Analyzer:	
  AST	
  	
  DAG	
  of	
  MapReduce	
  Tasks	
  
       •  Logical	
  Plan	
  Generator:	
  AST	
  	
  operator	
  trees	
  
       •  Op:mizer	
  (logical	
  rewrite):	
  operator	
  trees	
  	
  operator	
  trees	
  	
  
       •  Physical	
  Plan	
  Generator:	
  operator	
  trees	
  -­‐>	
  MapReduce	
  Tasks	
  
   –  Execu:on	
  Libraries:	
  	
  
       •  Operator	
  implementa:ons,	
  UDF/UDAF/UDTF	
  
       •  SerDe	
  &	
  ObjectInspector,	
  Metastore	
  
       •  FileFormat	
  &	
  RecordReader	
  
Hive	
  User/Applica:on	
  Interfaces	
  

                              CliDriver	
               Hive	
  shell	
  script	
     HiPal*	
  



Driver	
                    HiveServer	
                        ODBC	
  


                            JDBCDriver	
  
                           (inheren:ng	
  
                              Driver)	
  



       *HiPal is a Web-basedHive client developed internally at Facebook.
Parser	
  
                               •  ANTLR	
  is	
  a	
  parser	
  generator.	
  	
  
                               •  $HIVE_SRC/ql/src/java/org/
                                  apache/hadoop/hive/ql/parse/
                                  Hive.g	
  
                               •  Hive.g	
  defines	
  keywords,	
  
ParseDriver	
     Driver	
        tokens,	
  transla:ons	
  from	
  
                                  HiveQL	
  to	
  AST	
  (ASTNode.java)	
  
                               •  Every	
  :me	
  Hive.g	
  is	
  changed,	
  
                                  you	
  need	
  to	
  ‘ant	
  clean’	
  first	
  and	
  
                                  rebuild	
  using	
  ‘ant	
  package’	
  
Seman:c	
  Analyzer	
  
                                          •  BaseSeman:cAnalyzer	
  is	
  the	
  
                                             base	
  class	
  for	
  
                                             DDLSeman:cAnalyzer	
  and	
  
                                             Seman:cAnalyzer	
  
BaseSeman:cAnalyzer	
        Driver	
  
                                             –  Seman:cAnalyzer	
  handles	
  
                                                queries,	
  DML,	
  and	
  some	
  DDL	
  
                                                (create-­‐table)	
  
                                             –  DDLSeman:cAnalyzer	
  
                                                handles	
  alter	
  table	
  etc.	
  
Logical	
  Plan	
  Genera:on	
  
•  Seman:cAnalyzer.analyzeInternal()	
  is	
  the	
  main	
  
   funciton	
  
   –  doPhase1():	
  recursively	
  traverse	
  AST	
  tree	
  and	
  check	
  for	
  
      seman:c	
  errors	
  and	
  gather	
  metadata	
  which	
  is	
  put	
  in	
  QB	
  
      and	
  	
  QBParseInfo.	
  
   –  getMetaData():	
  query	
  metastore	
  and	
  get	
  metadata	
  for	
  the	
  
      data	
  sources	
  and	
  put	
  them	
  into	
  QB	
  and	
  QBParseInfo.	
  
   –  genPlan():	
  takes	
  the	
  QB/QBParseInfo	
  and	
  AST	
  tree	
  and	
  
      generate	
  an	
  operator	
  tree.	
  
Logical	
  Plan	
  Genera:on	
  (cont.)	
  
•  genPlan()	
  is	
  recursively	
  called	
  for	
  each	
  subqueries	
  
   (QB),	
  and	
  output	
  the	
  root	
  of	
  the	
  operator	
  tree.	
  
•  For	
  each	
  subquery,	
  genPlan	
  create	
  operators	
  
   “bocom-­‐up”*	
  staring	
  from	
  
   FROMWHEREGROUPBYORDERBY	
  	
  SELECT	
  
•  In	
  the	
  FROM	
  clause,	
  generate	
  a	
  TableScanOperator	
  
   for	
  each	
  source	
  table,	
  Then	
  genLateralView()	
  and	
  
   genJoinPlan().	
  	
  
    *Hive code actually names each leaf operator as “root” and its downstream operators as children.
Logical	
  Plan	
  Genera:on	
  (cont.)	
  
•  genBodyPlan()	
  is	
  then	
  called	
  to	
  handle	
  WHERE-­‐
   GROUPBY-­‐ORDERBY-­‐SELECT	
  clauses.	
  
    –  genFilterPlan()	
  for	
  WHERE	
  clause	
  
    –  genGroupByPalnMapAgg1MR/2MR()	
  for	
  map-­‐side	
  par:al	
  
       aggrega:on	
  
    –  genGroupByPlan1MR/2MR()	
  for	
  reduce-­‐side	
  aggrega:on	
  
    –  genSelectPlan()	
  for	
  SELECT-­‐clause	
  
    –  genReduceSink()	
  for	
  marking	
  the	
  boundary	
  between	
  map/
       reduce	
  phases.	
  
    –  genFileSink()	
  to	
  store	
  intermediate	
  results	
  
Op:mizer	
  
•  The	
  resul:ng	
  operator	
  tree,	
  along	
  with	
  other	
  
   parsing	
  info,	
  is	
  stored	
  in	
  ParseContext	
  and	
  passed	
  
   to	
  Op:mizer.	
  
•  Op:mizer	
  is	
  a	
  set	
  of	
  Transforma:on	
  rules	
  on	
  the	
  
   operator	
  tree.	
  	
  
•  The	
  transforma:on	
  rules	
  are	
  specified	
  by	
  a	
  regexp	
  
   pacern	
  on	
  the	
  tree	
  and	
  a	
  Worker/Dispatcher	
  
   framework.	
  	
  	
  
Op:mizer	
  (cont.)	
  
•  Current	
  rewrite	
  rules	
  include:	
  
    –  ColumnPruner	
  
    –  PredicatePushDown	
  
    –  Par::onPruner	
  
    –  GroupByOp:mizer	
  
    –  SamplePruner	
  
    –  MapJoinProcessor	
  
    –  UnionProcessor	
  
    –  JoinReorder	
  
Physical	
  Plan	
  Genera:on	
  
•  genMapRedWorks()	
  takes	
  the	
  QB/QBParseInfo	
  and	
  the	
  
   operator	
  tree	
  and	
  generate	
  a	
  DAG	
  of	
  
   MapReduceTasks.	
  
•  The	
  genera:on	
  is	
  also	
  based	
  on	
  the	
  Worker/
   Dispatcher	
  framework	
  while	
  traversing	
  the	
  operator	
  
   tree.	
  
•  Different	
  task	
  types:	
  MapRedTask,	
  Condi:onalTask,	
  
   FetchTask,	
  MoveTask,	
  DDLTask,	
  CounterTask	
  
•  Validate()	
  on	
  the	
  physical	
  plan	
  is	
  called	
  at	
  the	
  end	
  of	
  
   Driver.compile().	
  	
  
Preparing	
  Execu:on	
  
•  Driver.execute	
  takes	
  the	
  output	
  from	
  Driver.compile	
  
   and	
  prepare	
  hadoop	
  command	
  line	
  (in	
  local	
  mode)	
  or	
  
   call	
  ExecDriver.execute	
  (in	
  remote	
  mode).	
  
    –  Start	
  a	
  session	
  
    –  Execute	
  PreExecu:onHooks	
  
    –  Create	
  a	
  Runnable	
  for	
  each	
  Task	
  that	
  can	
  be	
  executed	
  in	
  
       parallel	
  and	
  launch	
  Threads	
  within	
  a	
  certain	
  limit	
  
    –  Monitor	
  Thread	
  status	
  and	
  update	
  Session	
  
    –  Execute	
  PostExecu:onHooks	
  
Preparing	
  Execu:on	
  (cont.)	
  
•  Hadoop	
  jobs	
  are	
  started	
  from	
  
   MapRedTask.execute().	
  
    –  Get	
  info	
  of	
  all	
  needed	
  JAR	
  files	
  with	
  ExecDriver	
  as	
  the	
  
       star:ng	
  class	
  
    –  Serialize	
  the	
  Physical	
  Plan	
  (MapRedTask)	
  to	
  an	
  XML	
  file	
  
    –  Gather	
  other	
  info	
  such	
  as	
  Hadoop	
  version	
  and	
  prepare	
  
       the	
  hadoop	
  command	
  line	
  
    –  Execute	
  the	
  hadoop	
  command	
  line	
  in	
  a	
  separate	
  
       process.	
  	
  
Star:ng	
  Hadoop	
  Jobs	
  
•  ExecDriver	
  deserialize	
  the	
  plan	
  from	
  the	
  XML	
  file	
  and	
  
   call	
  execute().	
  
•  Execute()	
  set	
  up	
  #	
  of	
  reducers,	
  job	
  scratch	
  dir,	
  the	
  
   star:ng	
  mapper	
  class	
  (ExecMapper)	
  and	
  star:ng	
  
   reducer	
  class	
  (ExecReducer),	
  and	
  other	
  info	
  to	
  JobConf	
  
   and	
  submit	
  the	
  job	
  through	
  hadoop.mapred.JobClient.	
  	
  
•  The	
  query	
  plan	
  is	
  again	
  serialized	
  into	
  a	
  file	
  and	
  put	
  
   into	
  DistributedCache	
  to	
  be	
  sent	
  out	
  to	
  mappers/
   reducers	
  before	
  the	
  job	
  is	
  started.	
  
Operator	
  
•  ExecMapper	
  create	
  a	
  MapOperator	
  as	
  the	
  parent	
  of	
  all	
  root	
  operators	
  in	
  
   the	
  query	
  plan	
  and	
  start	
  execu:ng	
  on	
  the	
  operator	
  tree.	
  
•  Each	
  Operator	
  class	
  comes	
  with	
  a	
  descriptor	
  class,	
  which	
  contains	
  
   metadata	
  passing	
  from	
  compila:on	
  to	
  execu:on.	
  
     –  Any	
  metadata	
  variable	
  that	
  needs	
  to	
  be	
  passed	
  should	
  have	
  a	
  public	
  secer	
  &	
  
        gecer	
  in	
  order	
  for	
  the	
  XML	
  serializer/deserializer	
  to	
  work.	
  
•  The	
  operator’s	
  interface	
  contains:	
  
     –    ini:lize():	
  called	
  once	
  per	
  operator	
  life:me	
  
     –    startGroup()	
  *:called	
  once	
  for	
  each	
  group	
  (groupby/join)	
  
     –    process()*:	
  called	
  one	
  for	
  each	
  input	
  row.	
  
     –    endGroup()*:	
  called	
  once	
  for	
  each	
  group	
  (groupby/join)	
  
     –    close():	
  called	
  once	
  per	
  operator	
  life:me	
  
Example:	
  adding	
  Semijoin	
  operator	
  
•  Lek	
  semijoin	
  is	
  similar	
  to	
  inner	
  join	
  except	
  only	
  the	
  
   lek	
  hand	
  side	
  table	
  is	
  output	
  and	
  no	
  duplicated	
  
   join	
  values	
  if	
  there	
  are	
  duplicated	
  keys	
  in	
  RHS	
  
   table.	
  
    –  IN/EXISTS	
  seman:cs	
  
    –  SELECT	
  *	
  FROM	
  S	
  LEFT	
  SEMI	
  JOIN	
  T	
  ON	
  S.KEY	
  =	
  T.KEY	
  
       AND	
  S.VALUE	
  =	
  T.VALUE	
  
    –  Output	
  all	
  columns	
  in	
  table	
  S	
  if	
  its	
  (key,value)	
  matches	
  
       at	
  least	
  one	
  (key,value)	
  pair	
  in	
  T	
  
Semijoin	
  Implementa:on	
  
•  Parser:	
  adding	
  the	
  SEMI	
  keyword	
  
•  Seman:cAnalyzer:	
  
   –  doPhase1():	
  keep	
  a	
  mapping	
  of	
  the	
  RHS	
  table	
  name	
  
      and	
  its	
  join	
  key	
  columns	
  in	
  QBParseInfo.	
  	
  
   –  genJoinTree:	
  set	
  new	
  join	
  type	
  in	
  joinDesc.joinCond	
  
   –  genJoinPlan:	
  	
  
       •  generate	
  a	
  map-­‐side	
  par:al	
  groupby	
  operator	
  right	
  aker	
  the	
  
          TableScanOperator	
  for	
  the	
  RHS	
  table.	
  The	
  input	
  &	
  output	
  
          columns	
  of	
  the	
  groupby	
  operator	
  is	
  the	
  RHS	
  join	
  keys.	
  	
  
Semijoin	
  Implementa:on	
  (cont.)	
  
•  Seman:cAnalyzer	
  
   –  genJoinOperator:	
  generate	
  a	
  JoinOperator	
  (lek	
  semi	
  
      type)	
  and	
  set	
  the	
  output	
  fields	
  as	
  the	
  LHS	
  table’s	
  fields	
  
•  Execu:on	
  
   –  In	
  CommonJoinOperator,	
  implement	
  lek	
  semi	
  join	
  with	
  
      early-­‐exit	
  op:miza:on:	
  as	
  long	
  as	
  the	
  RHS	
  table	
  of	
  lek	
  
      semi	
  join	
  is	
  non-­‐null,	
  return	
  the	
  row	
  from	
  the	
  LHS	
  
      table.	
  
Debugging	
  
•  Debugging	
  compile-­‐:me	
  code	
  (Driver	
  :ll	
  
   ExecDriver)	
  is	
  rela:vely	
  easy	
  since	
  it	
  is	
  running	
  on	
  
   the	
  JVM	
  on	
  the	
  client	
  side.	
  
•  Debugging	
  execu:on	
  :me	
  code	
  (aker	
  ExecDriver	
  
   calls	
  hadoop)	
  need	
  some	
  configura:on.	
  See	
  wiki	
  
   hcp://wiki.apache.org/hadoop/Hive/
   DeveloperGuide#Debugging_Hive_code	
  
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesInfluxData
 

Was ist angesagt? (20)

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 

Andere mochten auch

Hive sourcecodereading
Hive sourcecodereadingHive sourcecodereading
Hive sourcecodereadingwyukawa
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3moai kids
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 

Andere mochten auch (20)

Hive sourcecodereading
Hive sourcecodereadingHive sourcecodereading
Hive sourcecodereading
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 

Ähnlich wie Hive Anatomy

Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewBartosz Dobrzelecki
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentationIlya Bogunov
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing DaeHyung Lee
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 

Ähnlich wie Hive Anatomy (20)

Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Hive Anatomy

  • 1. Hive Anatomy Data Infrastructure Team, Facebook Part of Apache Hadoop Hive Project
  • 2. Overview   •  Conceptual  level  architecture   •  (Pseudo-­‐)code  level  architecture   •  Parser   •  Seman:c  analyzer   •  Execu:on   •  Example:  adding  a  new  Semijoin  Operator  
  • 3. Conceptual  Level  Architecture     •  Hive  Components:   –  Parser  (antlr):  HiveQL    Abstract  Syntax  Tree  (AST)   –  Seman:c  Analyzer:  AST    DAG  of  MapReduce  Tasks   •  Logical  Plan  Generator:  AST    operator  trees   •  Op:mizer  (logical  rewrite):  operator  trees    operator  trees     •  Physical  Plan  Generator:  operator  trees  -­‐>  MapReduce  Tasks   –  Execu:on  Libraries:     •  Operator  implementa:ons,  UDF/UDAF/UDTF   •  SerDe  &  ObjectInspector,  Metastore   •  FileFormat  &  RecordReader  
  • 4.
  • 5. Hive  User/Applica:on  Interfaces   CliDriver   Hive  shell  script   HiPal*   Driver   HiveServer   ODBC   JDBCDriver   (inheren:ng   Driver)   *HiPal is a Web-basedHive client developed internally at Facebook.
  • 6. Parser   •  ANTLR  is  a  parser  generator.     •  $HIVE_SRC/ql/src/java/org/ apache/hadoop/hive/ql/parse/ Hive.g   •  Hive.g  defines  keywords,   ParseDriver   Driver   tokens,  transla:ons  from   HiveQL  to  AST  (ASTNode.java)   •  Every  :me  Hive.g  is  changed,   you  need  to  ‘ant  clean’  first  and   rebuild  using  ‘ant  package’  
  • 7. Seman:c  Analyzer   •  BaseSeman:cAnalyzer  is  the   base  class  for   DDLSeman:cAnalyzer  and   Seman:cAnalyzer   BaseSeman:cAnalyzer   Driver   –  Seman:cAnalyzer  handles   queries,  DML,  and  some  DDL   (create-­‐table)   –  DDLSeman:cAnalyzer   handles  alter  table  etc.  
  • 8. Logical  Plan  Genera:on   •  Seman:cAnalyzer.analyzeInternal()  is  the  main   funciton   –  doPhase1():  recursively  traverse  AST  tree  and  check  for   seman:c  errors  and  gather  metadata  which  is  put  in  QB   and    QBParseInfo.   –  getMetaData():  query  metastore  and  get  metadata  for  the   data  sources  and  put  them  into  QB  and  QBParseInfo.   –  genPlan():  takes  the  QB/QBParseInfo  and  AST  tree  and   generate  an  operator  tree.  
  • 9. Logical  Plan  Genera:on  (cont.)   •  genPlan()  is  recursively  called  for  each  subqueries   (QB),  and  output  the  root  of  the  operator  tree.   •  For  each  subquery,  genPlan  create  operators   “bocom-­‐up”*  staring  from   FROMWHEREGROUPBYORDERBY    SELECT   •  In  the  FROM  clause,  generate  a  TableScanOperator   for  each  source  table,  Then  genLateralView()  and   genJoinPlan().     *Hive code actually names each leaf operator as “root” and its downstream operators as children.
  • 10. Logical  Plan  Genera:on  (cont.)   •  genBodyPlan()  is  then  called  to  handle  WHERE-­‐ GROUPBY-­‐ORDERBY-­‐SELECT  clauses.   –  genFilterPlan()  for  WHERE  clause   –  genGroupByPalnMapAgg1MR/2MR()  for  map-­‐side  par:al   aggrega:on   –  genGroupByPlan1MR/2MR()  for  reduce-­‐side  aggrega:on   –  genSelectPlan()  for  SELECT-­‐clause   –  genReduceSink()  for  marking  the  boundary  between  map/ reduce  phases.   –  genFileSink()  to  store  intermediate  results  
  • 11. Op:mizer   •  The  resul:ng  operator  tree,  along  with  other   parsing  info,  is  stored  in  ParseContext  and  passed   to  Op:mizer.   •  Op:mizer  is  a  set  of  Transforma:on  rules  on  the   operator  tree.     •  The  transforma:on  rules  are  specified  by  a  regexp   pacern  on  the  tree  and  a  Worker/Dispatcher   framework.      
  • 12. Op:mizer  (cont.)   •  Current  rewrite  rules  include:   –  ColumnPruner   –  PredicatePushDown   –  Par::onPruner   –  GroupByOp:mizer   –  SamplePruner   –  MapJoinProcessor   –  UnionProcessor   –  JoinReorder  
  • 13. Physical  Plan  Genera:on   •  genMapRedWorks()  takes  the  QB/QBParseInfo  and  the   operator  tree  and  generate  a  DAG  of   MapReduceTasks.   •  The  genera:on  is  also  based  on  the  Worker/ Dispatcher  framework  while  traversing  the  operator   tree.   •  Different  task  types:  MapRedTask,  Condi:onalTask,   FetchTask,  MoveTask,  DDLTask,  CounterTask   •  Validate()  on  the  physical  plan  is  called  at  the  end  of   Driver.compile().    
  • 14. Preparing  Execu:on   •  Driver.execute  takes  the  output  from  Driver.compile   and  prepare  hadoop  command  line  (in  local  mode)  or   call  ExecDriver.execute  (in  remote  mode).   –  Start  a  session   –  Execute  PreExecu:onHooks   –  Create  a  Runnable  for  each  Task  that  can  be  executed  in   parallel  and  launch  Threads  within  a  certain  limit   –  Monitor  Thread  status  and  update  Session   –  Execute  PostExecu:onHooks  
  • 15. Preparing  Execu:on  (cont.)   •  Hadoop  jobs  are  started  from   MapRedTask.execute().   –  Get  info  of  all  needed  JAR  files  with  ExecDriver  as  the   star:ng  class   –  Serialize  the  Physical  Plan  (MapRedTask)  to  an  XML  file   –  Gather  other  info  such  as  Hadoop  version  and  prepare   the  hadoop  command  line   –  Execute  the  hadoop  command  line  in  a  separate   process.    
  • 16. Star:ng  Hadoop  Jobs   •  ExecDriver  deserialize  the  plan  from  the  XML  file  and   call  execute().   •  Execute()  set  up  #  of  reducers,  job  scratch  dir,  the   star:ng  mapper  class  (ExecMapper)  and  star:ng   reducer  class  (ExecReducer),  and  other  info  to  JobConf   and  submit  the  job  through  hadoop.mapred.JobClient.     •  The  query  plan  is  again  serialized  into  a  file  and  put   into  DistributedCache  to  be  sent  out  to  mappers/ reducers  before  the  job  is  started.  
  • 17. Operator   •  ExecMapper  create  a  MapOperator  as  the  parent  of  all  root  operators  in   the  query  plan  and  start  execu:ng  on  the  operator  tree.   •  Each  Operator  class  comes  with  a  descriptor  class,  which  contains   metadata  passing  from  compila:on  to  execu:on.   –  Any  metadata  variable  that  needs  to  be  passed  should  have  a  public  secer  &   gecer  in  order  for  the  XML  serializer/deserializer  to  work.   •  The  operator’s  interface  contains:   –  ini:lize():  called  once  per  operator  life:me   –  startGroup()  *:called  once  for  each  group  (groupby/join)   –  process()*:  called  one  for  each  input  row.   –  endGroup()*:  called  once  for  each  group  (groupby/join)   –  close():  called  once  per  operator  life:me  
  • 18. Example:  adding  Semijoin  operator   •  Lek  semijoin  is  similar  to  inner  join  except  only  the   lek  hand  side  table  is  output  and  no  duplicated   join  values  if  there  are  duplicated  keys  in  RHS   table.   –  IN/EXISTS  seman:cs   –  SELECT  *  FROM  S  LEFT  SEMI  JOIN  T  ON  S.KEY  =  T.KEY   AND  S.VALUE  =  T.VALUE   –  Output  all  columns  in  table  S  if  its  (key,value)  matches   at  least  one  (key,value)  pair  in  T  
  • 19. Semijoin  Implementa:on   •  Parser:  adding  the  SEMI  keyword   •  Seman:cAnalyzer:   –  doPhase1():  keep  a  mapping  of  the  RHS  table  name   and  its  join  key  columns  in  QBParseInfo.     –  genJoinTree:  set  new  join  type  in  joinDesc.joinCond   –  genJoinPlan:     •  generate  a  map-­‐side  par:al  groupby  operator  right  aker  the   TableScanOperator  for  the  RHS  table.  The  input  &  output   columns  of  the  groupby  operator  is  the  RHS  join  keys.    
  • 20. Semijoin  Implementa:on  (cont.)   •  Seman:cAnalyzer   –  genJoinOperator:  generate  a  JoinOperator  (lek  semi   type)  and  set  the  output  fields  as  the  LHS  table’s  fields   •  Execu:on   –  In  CommonJoinOperator,  implement  lek  semi  join  with   early-­‐exit  op:miza:on:  as  long  as  the  RHS  table  of  lek   semi  join  is  non-­‐null,  return  the  row  from  the  LHS   table.  
  • 21. Debugging   •  Debugging  compile-­‐:me  code  (Driver  :ll   ExecDriver)  is  rela:vely  easy  since  it  is  running  on   the  JVM  on  the  client  side.   •  Debugging  execu:on  :me  code  (aker  ExecDriver   calls  hadoop)  need  some  configura:on.  See  wiki   hcp://wiki.apache.org/hadoop/Hive/ DeveloperGuide#Debugging_Hive_code