Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

sql on hadoop

221 Aufrufe

Veröffentlicht am

大数据的SQL引擎

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

sql on hadoop

  1. 1. 1© Cloudera, Inc. All rights reserved. Choosing the Right Tool for the Right Job Overview of Cloudera’s SQL-on-Hadoop Technologies Jianwei Li Jarred@cloudera.com
  2. 2. 2© Cloudera, Inc. All rights reserved. Hadoop Ecosystem OPERATIONS Cloudera Manager Cloudera Director DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  3. 3. 3© Cloudera, Inc. All rights reserved. Choosing the Right SQL Engine Know Your Audience, Know Your Use Case SQLOR Impala
  4. 4. 4© Cloudera, Inc. All rights reserved. Hive SQL on Hadoop
  5. 5. 5© Cloudera, Inc. All rights reserved. What is Hive? l Data warehouse system for Hadoop l Enables Extract/Transform/Load (ETL) l Associate structure with a variety of data formats l Integrates with HDFS, HBase, MongoDB, etc. l Query execution in MapReduce 5
  6. 6. 6© Cloudera, Inc. All rights reserved. Hive Architecture 6
  7. 7. 7© Cloudera, Inc. All rights reserved. Hive features l Create table, create view, create index - DDL l Select, where clause, group by, order by, joins l Pluggable User Defined Functions - UDFs (e.g from_unixtime) l Pluggable User Defined Aggregate Functions - UDAFs (e.g. count, avg) l Pluggable User Defined Table Generating Functions - UDTFs (e.g. explode) lPluggable custom Input/Output format l Pluggable Serialization Deserialization libraries (SerDes) l Pluggable custom map and reduce scripts 7
  8. 8. 8© Cloudera, Inc. All rights reserved. What Hive does NOT support l OLTP workloads - low latency lNot super performant with small amounts of data l How much data do you need to call it “Big Data”? 8
  9. 9. 9© Cloudera, Inc. All rights reserved. The Future of Hive • Hive is designed and great for batch processing • Hive is not architected for low-latency and multi-user interactive queries • Hive-on-DAG (Spark) provides incrementally faster batch processing
  10. 10. 10© Cloudera, Inc. All rights reserved. Spark SQL SQL on Hadoop
  11. 11. 11© Cloudera, Inc. All rights reserved. Dataframes • Distributed collection of data organized as named typed columns • Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via lineage • Can be constructed from: • Structured data files: Json, avro, parquet, etc • Tables in Hive • Tables in a RDBMS • Existing RDDs by programmatically applying schema
  12. 12. 12© Cloudera, Inc. All rights reserved. Spark SQL • SQL statements to process Dataframes • Embed SQL statements in your scala, java, python Spark application • Queries can also be issued via JDBC/ODBC
  13. 13. 13© Cloudera, Inc. All rights reserved. Spark SQL Performance SQL processed by Query Optimizer à Automatic Optimizations • Compressed memory format (as against java serialized objects in RDDs) • Predicate pushdown (read less data to reduce IO) • Optimal pipelining of operations • Cost based optimizer
  14. 14. 14© Cloudera, Inc. All rights reserved. movies = sc.textFile(“movies.txt”) .map(Movie(_) .toDF() ratings = sc.textFile(“ratings.txt”) .map(Rating(_)) .toDF() movies.join(ratings, “titleId”) .filter(“month = ‘Nov’”) .groupBy(movies("title")) .agg(count(ratings("rating"))) SparkSQL Example
  15. 15. 15© Cloudera, Inc. All rights reserved. Mixing SparkSQL and Machine Learning
  16. 16. 16© Cloudera, Inc. All rights reserved. Why Spark SQL • Ease of embedding SQL into Java, Scala, or Python applications • Easy language for common operations (eg. aggregations, filters, samples) • Seamlessly mix SQL and Spark code within a single application • Improved performance with automatic optimizations (Intelligent Query Engine)
  17. 17. 17© Cloudera, Inc. All rights reserved. Impala SQL on Hadoop
  18. 18. 18© Cloudera, Inc. All rights reserved. What’s Impala? • Interactive SQL • Typically 5-70x faster than the latest Hive • Responses in seconds instead of minutes (sometimes sub-second) • ANSI-92 standard SQL queries with HiveQL • Compatible SQL interface for existing Hadoop/CDH applications • Based on industry standard SQL • Natively on Hadoop/HBase storage and metadata • Flexibility, scale, and cost advantages of Hadoop • No duplication/synchronization of data and metadata • Local processing to avoid network bottlenecks • Separate runtime from batch processing • Hive, Pig, MapReduce are designed and great for batch • Impala is purpose-built for low-latency SQL queries on Hadoop 18
  19. 19. 19© Cloudera, Inc. All rights reserved. Business Intelligence with Impala OPERATIONS Cloudera Manager Cloudera Director DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREA M Spark SQL Impala SEARCH Solr SDK Kite
  20. 20. 20© Cloudera, Inc. All rights reserved. Where does the Performance Come From? 20 • No MapReduce; No JVM; All Native • In-Memory Data Transfers • Optimized File Format (ie Parquet) • In-Memory HDFS Caching • Cost-Based Join Order Optimization – Frees User from Having to Guess the Correct Join Order
  21. 21. 21© Cloudera, Inc. All rights reserved. Impala Architecture • Impala daemon (impalad) • one Impala daemon on each node with data • handles external client requests and all internal requests related to query execution • State store daemon (statestored) • Check the health of impalad, one process in one cluster • not part of query execution path • Catalog service(catalogd) • Relay metadata changes to Datanodes • One process in one cluster 21
  22. 22. 22© Cloudera, Inc. All rights reserved. Impala Architecture
  23. 23. 23© Cloudera, Inc. All rights reserved. Impala Architecture: Query Execution Phases • Client SQL arrives via ODBC/JDBC/Hue GUI/Shell • Planner turns request into collections of plan fragments • Coordinator initiates execution on impalad's local to data • During execution: • intermediate results are streamed between executors • query results are streamed back to client 23
  24. 24. 24© Cloudera, Inc. All rights reserved. Impala Architecture: Planner • Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 Hbase Scan Hash Join Hdfs Scan Exch TopN Agg Exch at coordinator at DataNodes at region servers Agg TopN Agg Hash Join Hdfs Scan Hbase Scan 24
  25. 25. 25© Cloudera, Inc. All rights reserved. Impala Architecture: Query Execution • Request arrives via ODBC/JDBC/Hue GUI/Shell Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 25
  26. 26. 26© Cloudera, Inc. All rights reserved. Impala Architecture: Query Execution • Planner turns request into collections of plan fragments • Coordinator initiates execution on impalad's local to data Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore 26
  27. 27. 27© Cloudera, Inc. All rights reserved. Impala Architecture: Query Execution • Intermediate results are streamed between impalad’s • Query results are streamed back to client Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase query results 27
  28. 28. 28© Cloudera, Inc. All rights reserved. Impala and Hive • Everything Client-Facing is Shared with Hive: • Metadata (table definitions) • ODBC/JDBC drivers • Hue GUI • SQL syntax (HiveQL) • Flexible file formats • Machine pool • Internal Improvements: • Purpose-built query engine direct on HDFS and HBase • No JVM startup and no MapReduce • In-memory data transfers • Native distributed relational query engine 28
  29. 29. 29© Cloudera, Inc. All rights reserved. Parquet Overview • State-of-the-art, open source columnar file format that’s available for (most) Hadoop processing frameworks: • Impala, Hive, Pig, MapReduce, Spark, Cascading, Crunch, Drill, Tajo, … • Offers both high compression and high scan efficiency • Co-developed by Twitter and Cloudera • Contributions from Criteo, Stripe, Berkeley AMPlab, LinkedIn • Top-Level Apache Project
  30. 30. 30© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
  31. 31. 31© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB
  32. 32. 32© Cloudera, Inc. All rights reserved. Columnar compression {1442865158, 1442828307, 1442865156, 1442865155} Created_at Created_at Diff(created_at) 1442865158 n/a 1442828307 -36851 1442865156 36849 1442865155 -1 64 bits each 17 bits each • Many columns can compress to a few bits per row! • Especially: • Timestamps • Time series values • Low-cardinality strings • Massive space savings and throughput increase!
  33. 33. 33© Cloudera, Inc. All rights reserved. Impala Performance • Impala’s latest milestone: • Comparable commercial MPP DBMS speed • Natively on Hadoop • Three result sets: • Impala vs Hive (Impala 6-70x faster) • Impala vs “DBMS-Y” (Impala average of 2x faster) • Impala scalability (Impala achieves linear scale) • Background: • 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported language) • Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB) • Methodical testing (multiple runs, reviewed fairness for competition, etc) • Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ 33
  34. 34. 34© Cloudera, Inc. All rights reserved. Impala vs Hive (Lower bars are better) 34
  35. 35. 35© Cloudera, Inc. All rights reserved. Impala vs “DBMS-Y” (Lower bars are better) 35
  36. 36. 36© Cloudera, Inc. All rights reserved. Impala Scalability: 2x the Hardware (Expectation: Cut response times in half) 36
  37. 37. 37© Cloudera, Inc. All rights reserved. Impala Scalability: 2x the Hardware and 2x Users/Data (Expectation: Constant response times) 37 2x the Users, 2x the Hardware 2x the Data, 2x the Hardware
  38. 38. 38© Cloudera, Inc. All rights reserved. Impala Roadmap 2H 2015 1H 2016 2016 • SQL Support & Usability • Nested structures • Kudu updates (beta) • Management & Security • Record reader service (beta) • Finer-grained security (Sentry) • Integration • Isilon support • Python interface (Ibis) • Performance & Scale • Improved predictability under concurrency • Performance & Scale • Continued scalability and concurrency • Initial perf/scale improvements • Management & Security • Improved admission control • Resource utilization and showback • SQL Support & Usability • Dynamic partitioning • Improved timestamp compatibility • Performance & Scale • >20x performance • Multi-threaded joins/aggregations • Continued scale work • Management & Security • Improved YARN integration • Automated metadata • Integration • S3 support • SQL Support & Usability • Nested types with Avro • Date type • Added SQL extensions
  39. 39. 39© Cloudera, Inc. All rights reserved. Typical Using Scenario SQL on Hadoop
  40. 40. 40© Cloudera, Inc. All rights reserved. Apache Hive Batch Processing • User: • SQL-based ETL developers • Designed for: • Handful of concurrent, very long- running batch jobs • Strengths: • Custom file formats • Very long-running ETL, data preparation, or batch processing • Massive ETL sorts with joins • Existing Hive jobs
  41. 41. 41© Cloudera, Inc. All rights reserved. Apache Impala BI and Analytics User: • Data Analysts • BI Users Designed for: • Interactive SQL for large number of BI users and analysts Strengths: • Multi-user scale • Interactive latency • Compatibility (BI tools, ANSI SQL, and vendor-specific SQL) • Usability
  42. 42. 42© Cloudera, Inc. All rights reserved. Apache Spark SQL Machine Learning Applications User: • Data Engineers • Data Scientists Designed for: • Ease of development for Spark developers • Handful of concurrent Spark jobs Strengths: • Ease of embedding SQL into Java or Scala applications • SQL for common functionality in developer flow (eg. aggregations, filters, samples)
  43. 43. 43© Cloudera, Inc. All rights reserved. SQL-on-Hadoop Benchmark Impala, Spark SQL, Hive-on-Tez Versions: • Impala 2.3 • Hive 2.0 on Tez 0.5.2 (aka “Stinger”) • Spark SQL 1.5 with Tungsten • Benchmark Details • Based on industry standards (TPC) • Repeatable (https://github.com/cloudera/impala-tpcds-kit) • Methodical testing with multiple runs on same hardware • Help competing software do well • Run on optimal file formats for each • Tune query engines appropriately Full Details: http://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely- delivers-analytic-database-performance/
  44. 44. 44© Cloudera, Inc. All rights reserved. Impala Multi-User Performance Over 7x Faster with Just 10 Users 0 50 100 150 200 250 Time (in Seconds) Single User, 4 10 Users, 12.8 Single User, 32 10 Users, 97 Single User, 59 10 Users, 210 7.2x 7.6x 13.4x 16.4x Single User vs 10 User Response Time/Impala Times Faster (Lower Bars = Better) Impala Spark SQL (with Tungsten) Hive-on-Tez
  45. 45. 45© Cloudera, Inc. All rights reserved. Impala Enables Nearly 7x Throughput More Work Done in Less Time 2045 302 136.0 0 500 1000 1500 2000 2500 Queries per Hour Query Throughput/Impala Throughput Times Faster (Higher Bars = Better) 6.8x 15x Impala Hive-on-TezSpark SQL (with Tungsten)
  46. 46. 46© Cloudera, Inc. All rights reserved. Performance Benchmark Takeaways • Impala unlocks BI usage directly on Hadoop • Meets BI low-latency and multi-user requirements • Advantage expands for single-user vs just 10 users • Hive is designed (and still great) for batch processing • Most Impala customers use Hive for data preparation • Hive is the most commonly used ETL framework • Spark SQL enables easier Spark application development • Enables mixed procedural Spark (Java/Scala) and SQL job development • Mid-term trends will further favor Impala’s design approach for latency and concurrency • More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) • CPU efficiency will increase in importance • Native code enables easy optimizations for CPU instruction sets • Intel joint roadmap support these opportunities
  47. 47. 47© Cloudera, Inc. All rights reserved. IBM Research Validation • VLDB academic paper compares Impala and Hive (both MR and Tez) for SQL-on-Hadoop • http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf • Impala’s significantly more efficient than Hive/Tez or Hive/MR • Impala’s lead due to CPU efficiency, I/O manager, and overall architecture that resembles a shared-nothing parallel database • Parquet more efficient than ORC • Additional Notes: • Impala 1.4 and higher is significantly faster on selective joins than Impala 1.2.2 used in the paper • Impala 2.0 has disk-based joins and aggregations • Paper compares single-user only. Multi-user would perform even better “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce or Tez- based runtime” “The Parquet format skips data more efficiently than ORC, which tends to pre-fetch unnecessary data, especially when a table contains a large number of columns”
  48. 48. 48© Cloudera, Inc. All rights reserved. Choosing the Right SQL Engine Know Your Audience, Know Your Use Case Batch Processing BI and SQL Analytics Procedural Development SQLOR Impala SQL-Based ETL Developer Data Analyst Data Engineer/ Data Scientist Tool Use Case User
  49. 49. 49© Cloudera, Inc. All rights reserved. Thank You

×