Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Bay Area Impala User Group Meetup (Sept 16 2014)

948 Aufrufe

Veröffentlicht am

Presentations from the Bay Area Impala User Group meetup on Sept 16 2014.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Bay Area Impala User Group Meetup (Sept 16 2014)

  1. 1. Impala Product Update Justin Erickson | Director, Product Management September 2014 ©2014 Cloudera, Inc. All Rights Reserved. 1
  2. 2. Agenda • Impala releases • Impala roadmap • Perf update ©2014 Cloudera, Inc. All Rights Reserved. 2
  3. 3. Key Milestones and Features • Impala 1.0 • ~SQL-92 (minus correlated sub-queries) • Native Hadoop file formats (Parquet, Avro, text, Sequence, …) • Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) • Service-level resource isolation with other Hadoop frameworks • Impala 1.1 • Fine-grained, role-based authorization via Apache Sentry • Auditing (Impala 1.1.1 and CM 4.7+) • Impala 1.2 • Custom language extensibility (UDFs, UDAFs) • Cost-based join-order optimization • On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility • Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) • Resource management ©2014 Cloudera, Inc. All Rights Reserved. 3
  4. 4. Just Released Impala 1.4 / CDH 5.1 (also with version for CDH 4.x) • Additional SQL: • DECIMAL data type • Additional built-in functions from EDW • ORDER BY without LIMIT • Continued performance gains: • HDFS caching support (CDH 5 only) • Faster selective joins • Faster COMPUTE STATS 4 ©2014 Cloudera, Inc. All Rights Reserved.
  5. 5. Impala near-term roadmap Targeted for Impala 2.0 (fall 2014): • Additional SQL: • Analytic/window functions • Subqueries in the WHERE clause • Additional data types (VARCHAR, CHAR) • Disk-based joins and aggregations • GRANT/REVOKE Considerations for Impala 2.x (priority and inclusion based on your feedback): • Nested/complex types (next highest priority) • Navigator Lineage • Updates via MERGE • Incremental stats • Additional SQL functions (GROUPING, ROLLUP, CUBE, MINUS, INTERSECT built-ins, etc) • UDTFs • Intra-node parallel joins and aggregations • Even faster performance • S3 integration ©2014 Cloudera, Inc. All Rights Reserved. 5
  6. 6. SQL-on-Hadoop benchmark: Impala, Presto, Stinger, Spark SQL • Upcoming benchmarks on latest versions of: • Impala (1.4.0) • Presto (0.74) • Stinger (final) phase 3 => aka Hive 0.13.0 • Spark SQL (1.1) • Published with smaller memory configuration (64 GB / node) • Demonstrates leadership is independent of memory size • Dropped Shark given retirement for Hive-on-Spark • As always, our public benchmarks are: • Based on industry standards (TPC) • Repeatable (https://github.com/cloudera/impala-tpcds-kit) • Methodical testing with multiple runs on same hardware • Help competing software put its best foot forward • SQL-92 join style for engines without CBO • JVM tuning for Presto • Run on optimal file formats for each ©2014 Cloudera, Inc. All Rights Reserved. 6
  7. 7. Impala’s Multi-User over 10x faster: Gap widening compared to May’s update ©2014 Cloudera, Inc. All Rights Reserved. 7
  8. 8. Faster = more work in less time: Impala enables over 8.7x throughput ©2014 Cloudera, Inc. All Rights Reserved. 8
  9. 9. Performance Takeaways • Impala’s advantage expands from 5x single-user to >10x with just 10 user • Performance gap is widening since May • Single user Presto went from 5x before to 7.5x now • Single user Hive/Tez went from 5x before to 9x now • Mid-term trends will further favor Impala’s design approach • More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) • CPU efficiency will increase in importance • Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math operations, encrypt/decrypt) • The Intel joint roadmap helps support these opportunities ©2014 Cloudera, Inc. All Rights Reserved. 9
  10. 10. Try It Out! • 100% Apache-licensed open source • Downloads on http://impala.io/: • Live online • VM • Installation • Questions/comments? • Community: http://impala.io/community • Email: impala-user@cloudera.org ©2014 Cloudera, Inc. All Rights Reserved. 10
  11. 11. ©2014 Cloudera, Inc. All Rights Reserved. 11
  12. 12. Real Time Audience Dashboard September 2014
  13. 13. Introduction 13 Tubular Labs SAAS Platform for online Video Audience Development (e.g. Big Data for YouTube videos) David Koblas VP Engineering, Tubular Labs
  14. 14. Overview 14 This presentation will talk about the work Tubular Labs has done to use Impala as one of the core components to our SAAS platform. We'll go through the pipeline for getting data into the system, to how we've distributed responsibility across AWS instances, and other tips and tricks for getting real-time responses to our end-user queries over billions of data points.
  15. 15. User Story: Audience Also Watches 15 For any YouTube video can we figure out who the audience is and what other videos and channels they are watching. Also to have the ability to slice the audience by demographic information. …and have it all run interactively from a web SAAS platform.
  16. 16. Tubular App 16
  17. 17. Technology Options 17 • Pre-compute (e.g. Map/Reduce) • MySQL or similar • Data Warehouse • Impala or Redshift • Homebrew
  18. 18. Impala 0.7 18 Now we have a technology … Make it interactive … and make a bet on Cloudera
  19. 19. Now We Have A Technology Time To Make It Fast and Economical 19 Source: Tubular Labs
  20. 20. Pipeline 20 Loading • Sqoop - collect data from MySQL • Hive - preprocess data Query • Impala - interactive display • Python - REST endpoint
  21. 21. AWS EC2: Node types 21 • m1.xlarge - 1.6TB of Instance Storage - slow IO • hi1.4xlarge - 2TB of SSD - expensive Note: this would be an i2.4xlarge instance today
  22. 22. Managing costs 22 Problem • hi1.4xlarge - expensive • m1.xlarge - slow IO Solution – HDFS rack replication for separation • One copy of data on both racks • Hive creates tables on m1.xlarge instances • Impala queries on hi1.4xlarge instances
  23. 23. Interactive Performance 23 Problem • Large tables take time to scan • No indexes • Need to deliver results in < 1second Solution – partitioning (duh!) • Partitions are targeted to be between 100…200MB • The query log is your friend
  24. 24. Tubular App 24
  25. 25. Summary 25 Impala can back your SAAS application • We’re now running version 1.3 • We’re “spinning” 10TB of data • Delivering queries in < 2seconds We’re hiring – but you already knew that.

×