Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Hadoop from Hive with Stinger to Tez

Wird geladen in …3

Hier ansehen

1 von 16 Anzeige

Hadoop from Hive with Stinger to Tez

Herunterladen, um offline zu lesen

Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.

Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.


Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)


Ähnlich wie Hadoop from Hive with Stinger to Tez (20)


Aktuellste (20)

Hadoop from Hive with Stinger to Tez

  1. 1. www.rubicon.nl Hadoop: From Hive with Stinger to Tez Jan Pieter Posthuma March 5, 2015
  2. 2. 2 Introduction  Jan Pieter Posthuma  Microsoft Data Consultant  Rubicon, local consultancy firm in the Netherlands  Architect role at multiple projects  Analysis Service, Reporting Service, Big Data, HDInsight, Cloud BI, Power BI http://twitter.com/jppp http://linkedin.com/jpposthuma jp.posthuma@rubicon.nl
  3. 3. 3 Agenda Hive Stinger Tez Hadoop
  4. 4. 4 Hadoop  Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware: ‘store and process the data on the Internet in a simple, scalable and economically feasible way’  Widely accepted by Database vendors as a solution for unstructured data  Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight (now on Windows and Linux)  Available on premise and as an Azure service  HortonWorks Data Platform (HDP) 100% Open Source!
  5. 5. 5 Why SQL on Hadoop? Hadoop is great for cost, but MapReduce is too difficult. SQL on Hadoop makes Hadoop real and gives me scale that traditional SQL can’t offer. I’m deleting important data because it’s too expensive to store it. $
  6. 6. 6 Hive Developed Hive to address traditional RDBMS limitations. 300+ PB of data under management. 600+ TB of data loaded daily. 60,000+ Hive queries per day. More than 1,000 users per day. Initial Apache release in April 2009 Problem: Hive is bound to MapReduce leading to latency and needs higher performance
  7. 7. 7 Stinger ‘Making Apache Hive 100 Times Faster’ Hortonworks blog, February 2013 SQL Engine Vectorized SQL Engine Columnar Storage ORCFile = 100X+ + Distributed Execution Apache Tez
  8. 8. 8 ORCFiles  Started by HortonWorks to optimize existing RCFiles with input from Microsoft to cooperate with QE and Tez  Two goals:  Improve query speed  Improve storage efficiency  CREATE TABLE … STORED AS ORC
  9. 9. 9 Yarn
  10. 10. 10 Tez
  11. 11. 11 Stinger TPC-DS Benchmark at 30 Terabyte Scale  Sample of 50 queries from TPC-DS at 30 terabyte scale.  Average 52x Query Speedup, Maximum 160x Query Speedup.  Total benchmark time decreased from 7.8 days to 9.3 hours.(3)  Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
  12. 12. 12 Stinger.Next  Stinger.Next (in 3 phases)  Transactions with ACID semantics – allow users to easily modify data with inserts, updates and deletes. It extend Hive from the traditional write- once, and read-often system to support analytics over changing data.  Sub-second queries – allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements. Emerge of LLAP (Live Long and Process) and Hive on Spark.  SQL:2011 Analytics – allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.
  13. 13. 13 Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Vector Cache LLAP Persistent Server Historical Current In Develop ment Legend Apache Hive: Modern Architecture
  14. 14. 14 Questions ?
  15. 15. 15 Links  Microsoft Big Data: http://www.microsoft.com/bigdata  Hortonworks: http://www.hortonworks.com  Try your self via Windows Azure HDInsight: http://azure.com/hdinsight
  16. 16. 16 Usefull resources  http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final/  http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/  http://hortonworks.com/labs/stinger/  http://hortonworks.com/blog/100x-faster-hive/  http://www.slideshare.net/hugfrance/recent-enhancements-to-apache-hive-query- performance?qid=2cd74ce1-e863-436c-a1ab- 52a513c61a27&v=default&b=&from_search=10  http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-  http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit  http://hortonworks.com/blog/microsofts-contributions-to-the-stinger-initiative-and- apache-hive/

Hinweis der Redaktion

  • Based on Google’s academical papers (2003) about distributed storage (HDFS) and extracting data (MapReduce).
    Hadoop started in 2005.
  • Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
    Launched: February 2013; Delivered: April 2014.
    Delivered in 100% Apache Open Source.

    Baseline Hive 0.10 and is delivered in 18 months in three phases:
    1. Introducing ORC Files (Optimized Row Columnar)
    2. Vectorized Query Engine
    3. Hive and Tez

  • ORC File structure:
    Default stripe is 250MB
    File footer:
    Stripe location
    Stripe row count and data types of each column
    Statistics of each column (min, max, count and sum)
    Compression parameters
    Stripe Index:
    Min and Max values for each column
    Row index for position in file (default 10.000 rows)
    Stripe Footer:
    Column stream locations (like Row Data, Nullable and Dictionaries)
    Column encoding
  • YARN (Yet Another Resource Negotiator).
    MapReduce is both data processor and cluster resource manager and central managed via the job tracker on the headnode. (Inefficient)
    Yarn splits the JobTracker into a Resource Manager (headnode) and a Node Manager. Each node can communicate with another node and shares statuses.
  • Tez Sessions
    – Hot containers ready for immediate use
    – Removes task and job launch overhead (~5s – 30s)
    – Session launch/shutdown in background (seamless, user not aware)
    – Submits query plan directly to Tez Session

    - Tez models data processing as a dataflow graph, with the graph vertices representing application logic and its edges representing movement of data.
    - Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules.
    - YARN manages resources in a Hadoop cluster, based on cluster capacity and load.
    In short: Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf.
  • Stinger.Next. Started in second half of 2014. Phase 1 is delivered, so ACID transactions are now possible via a delta mechanism.
    Next two phases are scheduled to be released in 2015 (1st and 2nd half).