SlideShare a Scribd company logo
1 of 103
Impala: A Modern SQL
                     Engine for Hadoop
                             Henry Robinson | Software Engineer
                              henry@cloudera.com | @henryr




Wednesday, 16 January 2013
Agenda




Wednesday, 16 January 2013
Agenda
                   • Part 1:
                      • Low-latency puzzle
                         piece of the
                                      and Hadoop: a missing

                      • Impala: Goals, non-goals and features
                      • Demo
                      • Q+A



Wednesday, 16 January 2013
Agenda
                   • Part 1:
                      • Low-latency puzzle
                         piece of the
                                      and Hadoop: a missing

                      • Impala: Goals, non-goals and features
                      • Demo
                      • Q+A
                   • Part 2:
                      • Impala Internals
                      • Comparing Impala to other systems
                      • Q+A
Wednesday, 16 January 2013
About Me




                                3
Wednesday, 16 January 2013
About Me

                   • Hi!




                                3
Wednesday, 16 January 2013
About Me

                   • Hi!
                   • Software Engineer at Cloudera since 2009
                      • Apache ZooKeeper
                      • First version of Flume
                      • Cloudera Enterprise
                      • Working on Impala since the beginning
                         of 2012


                                       3
Wednesday, 16 January 2013
Part 1: Why Impala?



Wednesday, 16 January 2013
The Hadoop Landscape




                                      5
Wednesday, 16 January 2013
The Hadoop Landscape

                   • Hadoop MapReduce is a batch processing
                     system




                                       5
Wednesday, 16 January 2013
The Hadoop Landscape

                   • Hadoop MapReduce is a batch processing
                     system
                   • Ideally suited to workloads high-latency
                     data processing
                                       long-running,




                                       5
Wednesday, 16 January 2013
The Hadoop Landscape

                   • Hadoop MapReduce is a batch processing
                     system
                   • Ideally suited to workloads high-latency
                     data processing
                                       long-running,

                   • But not as suitable for interactive queries,
                     data exploration or iterative query
                             refinement
                              •  All of which are keystones of data
                                 warehousing

                                                5
Wednesday, 16 January 2013
Bringing Low-Latency to
                                     Hadoop




                                        6
Wednesday, 16 January 2013
Bringing Low-Latency to
                                     Hadoop
                   • HDFS and HBase make data storage cheap
                     and flexible




                                        6
Wednesday, 16 January 2013
Bringing Low-Latency to
                                     Hadoop
                   • HDFS and HBase make data storage cheap
                     and flexible
                   • SQL / ODBC are industry-standards
                      • Analyst familiarity
                      • BI tool integration
                      • Legacy systems

                                        6
Wednesday, 16 January 2013
Bringing Low-Latency to
                                     Hadoop
                   • HDFS and HBase make data storage cheap
                     and flexible
                   • SQL / ODBC are industry-standards
                      • Analyst familiarity
                      • BI tool integration
                      • Legacy systems
                   • Can we get the advantages of both?
                      • With acceptable performance?
                                        6
Wednesday, 16 January 2013
Impala Overview: Goals




                                       7
Wednesday, 16 January 2013
Impala Overview: Goals
                   • General-purpose SQL query engine
                         •   should work both for analytical and transactional workloads

                         •   will support queries that take from milliseconds to hours




                                                           7
Wednesday, 16 January 2013
Impala Overview: Goals
                   • General-purpose SQL query engine
                         • should work both for analytical and transactional workloads
                         • will support queries that take from milliseconds to hours
                   •      Runs directly within Hadoop:
                         • Reads widely-used Hadoop file formats
                         • talks to widely used Hadoop storage managers like HDFS and HBase
                         • runs on same nodes that run Hadoop processes




                                                        7
Wednesday, 16 January 2013
Impala Overview: Goals
                   • General-purpose SQL query engine
                         • should work both for analytical and transactional workloads
                         • will support queries that take from milliseconds to hours
                   •      Runs directly within Hadoop:
                         • Reads widely-used Hadoop file formats
                         • talks to widely used Hadoop storage managers like HDFS and HBase
                         • runs on same nodes that run Hadoop processes
                   •      High performance
                         • C++ instead of Java
                         • runtime code generation via LLVM
                         • completely new execution engine that doesn’t build on MapReduce
                                                        7
Wednesday, 16 January 2013
User View of Impala




                                      8
Wednesday, 16 January 2013
User View of Impala
                   •         Runs as a distributed service in cluster: one
                             Impala daemon on each node with data




                                                  8
Wednesday, 16 January 2013
User View of Impala
                   •         Runs as a distributed service in cluster: one
                             Impala daemon on each node with data
                   •         User submits query via ODBC/Beeswax Thrift
                             API to any daemon




                                                 8
Wednesday, 16 January 2013
User View of Impala
                   •         Runs as a distributed service in cluster: one
                             Impala daemon on each node with data
                   •         User submits query via ODBC/Beeswax Thrift
                             API to any daemon
                   •         Query is distributed to all nodes with relevant
                             data




                                                  8
Wednesday, 16 January 2013
User View of Impala
                   •         Runs as a distributed service in cluster: one
                             Impala daemon on each node with data
                   •         User submits query via ODBC/Beeswax Thrift
                             API to any daemon
                   •         Query is distributed to all nodes with relevant
                             data
                   •         If any node fails, the query fails




                                                  8
Wednesday, 16 January 2013
User View of Impala
                   •         Runs as a distributed service in cluster: one
                             Impala daemon on each node with data
                   •         User submits query via ODBC/Beeswax Thrift
                             API to any daemon
                   •         Query is distributed to all nodes with relevant
                             data
                   •         If any node fails, the query fails
                   •         Impala uses Hive’s metadata interface




                                                  8
Wednesday, 16 January 2013
User View of Impala
                   •      Runs as a distributed service in cluster: one
                          Impala daemon on each node with data
                   •      User submits query via ODBC/Beeswax Thrift
                          API to any daemon
                   •      Query is distributed to all nodes with relevant
                          data
                   •      If any node fails, the query fails
                   •      Impala uses Hive’s metadata interface
                   •      Supported file formats:
                         • text files (GA: with compression, including lzo)
                         • sequence files with snappy / gzip compression
                         • GA: Avro data files / columnar format (more on this later)
                                                   8
Wednesday, 16 January 2013
User View of Impala: SQL




                                        9
Wednesday, 16 January 2013
User View of Impala: SQL

                   • SQL support:
                         •   patterned after Hive’s version of SQL

                         •   limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert

                         •   only equi-joins, no non-equi-joins, no cross products

                         •   ORDER BY only with LIMIT

                         •   GA: DDL support (CREATE, ALTER)




                                                             9
Wednesday, 16 January 2013
User View of Impala: SQL

                   • SQL support:
                         • patterned after Hive’s version of SQL
                         • limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert
                         • only equi-joins, no non-equi-joins, no cross products
                         • ORDER BY only with LIMIT
                         • GA: DDL support (CREATE, ALTER)
                   •      Functional Limitations
                         • no custom UDFs, file formats, Hive SerDes
                         • only hash memory of alltable has to fit in(GA) of a single node (beta) /
                            aggregate
                                      joins: joined
                                                    executing nodes
                                                                      memory


                         •   join order = FROM clause order


                                                            9
Wednesday, 16 January 2013
User View of Impala: HBase




                                    10
Wednesday, 16 January 2013
User View of Impala: HBase


                   • HBase functionality
                         •   uses Hive’s mapping of HBase table into metastore table

                         •   predicates on rowkey columns are mapped into start / stop row

                         •   predicates on other columns are mapped into SingleColumnValueFilters




                                                          10
Wednesday, 16 January 2013
User View of Impala: HBase


                   • HBase functionality
                         • uses Hive’s mapping of HBase table into metastore table
                         • predicates on rowkey columns are mapped into start / stop row
                         • predicates on other columns are mapped into SingleColumnValueFilters
                   •      HBase functional limitations
                         • no nested-loop joins
                         • all data stored as text


                                                        10
Wednesday, 16 January 2013
Demo



Wednesday, 16 January 2013
TPC-DS




                               12
Wednesday, 16 January 2013
TPC-DS

                   • TPC-DS isdecision supportdataset designed
                     to model
                               a benchmark
                                               systems




                                        12
Wednesday, 16 January 2013
TPC-DS

                   • TPC-DS isdecision supportdataset designed
                     to model
                               a benchmark
                                               systems
                   • We generatedillustrative!) (not a lot, but
                     enough to be
                                    500MB data




                                         12
Wednesday, 16 January 2013
TPC-DS

                   • TPC-DS isdecision supportdataset designed
                     to model
                                 a benchmark
                                               systems
                   • We generatedillustrative!) (not a lot, but
                     enough to be
                                     500MB data

                   • Let’sagainstsample query against Hive 0.9,
                     and
                           run a
                                  Impala 0.3




                                        12
Wednesday, 16 January 2013
TPC-DS

                   • TPC-DS isdecision supportdataset designed
                     to model
                                 a benchmark
                                               systems
                   • We generatedillustrative!) (not a lot, but
                     enough to be
                                     500MB data

                   • Let’sagainstsample query against Hive 0.9,
                     and
                           run a
                                  Impala 0.3
                   • Single node (VM! -engine speeds so we’re
                     testing execution
                                         caveat emptor),



                                       12
Wednesday, 16 January 2013
TPC-DS Sample Query
                select
                   i_item_id,
                   s_state,
                   avg(ss_quantity) agg1,
                   avg(ss_list_price) agg2,
                   avg(ss_coupon_amt) agg3,
                   avg(ss_sales_price) agg4
                FROM store_sales
                JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
                JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
                JOIN customer_demographics on (store_sales.ss_cdemo_sk =
                customer_demographics.cd_demo_sk)
                JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
                where
                   cd_gender = 'M' and
                   cd_marital_status = 'S' and
                   cd_education_status = 'College' and
                   d_year = 2002 and
                   s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')
                group by
                   i_item_id,
                   s_state
                order by
                   i_item_id,
                   s_state
                limit 100;

                                                  13
Wednesday, 16 January 2013
Impala is much faster




                                       14
Wednesday, 16 January 2013
Impala is much faster


                   • Why?
                      • No materialisation of intermediate data
                        - less I/O
                      • No multi-phase queries - much smaller
                        startup / teardown overhead
                      • Fasterfor each individual query
                        code
                                 execution engine: generates fast



                                        14
Wednesday, 16 January 2013
Part 2:
                             Impala Internals /
                                Roadmap



Wednesday, 16 January 2013
Impala Architecture




                                      16
Wednesday, 16 January 2013
Impala Architecture
                   • Two binaries: impalad and statestored




                                        16
Wednesday, 16 January 2013
Impala Architecture
                   • Two binaries: impalad and statestored
                   • Impala daemon (impalad)
                      • handles client requests andexecution
                        requests related to query
                                                     all internal
                                 over Thrift
                             •   runs on every datanode




                                               16
Wednesday, 16 January 2013
Impala Architecture
                   • Two binaries: impalad and statestored
                   • Impala daemon (impalad)
                      • handles client requests andexecution
                        requests related to query
                                                     all internal
                                 over Thrift
                             •   runs on every datanode
                   • Statestore daemon (statestored)
                      • provides membership information and
                         metadata distribution
                      • only one per cluster
                                               16
Wednesday, 16 January 2013
Query Execution




                                    17
Wednesday, 16 January 2013
Query Execution
                   • Query execution phases:
                      • Request arrives via Thrift API (perhaps
                        from ODBC, or shell)
                      • Plannerfragments into collections
                        of plan
                                turns request

                      • ‘Coordinator’ initiates execution on
                        remote impalad daemons




                                        17
Wednesday, 16 January 2013
Query Execution
                   • Query execution phases:
                      • Request arrives via Thrift API (perhaps
                        from ODBC, or shell)
                      • Plannerfragments into collections
                        of plan
                                turns request

                      • ‘Coordinator’ initiates execution on
                        remote impalad daemons
                   • During execution:
                      • Intermediate results are streamed
                        between impalad daemons
                      • Query results are streamed to client
                                        17
Wednesday, 16 January 2013
Query Execution

                   • Request arrives via Thrift API:
                               SQL App
                                                                              Hive
                                                                                          HDFS NN            Statestore
                                                                            Metastore
                                ODBC


                                                   SQL request




                                Query Planner                       Query Planner                    Query Planner

                               Query Coordinator                   Query Coordinator                Query Coordinator

                                Query Executor                      Query Executor                   Query Executor

                             HDFS DN         HBase               HDFS DN          HBase        HDFS DN            HBase




                                                                           18
Wednesday, 16 January 2013
Query Execution

                   • Planner turns request into collections of
                     plan fragments

                               SQL App
                                                                Hive
                                                                            HDFS NN            Statestore
                                                              Metastore
                                ODBC




                                Query Planner          Query Planner                   Query Planner

                              Query Coordinator      Query Coordinator                Query Coordinator

                                Query Executor        Query Executor                   Query Executor

                             HDFS DN        HBase   HDFS DN         HBase        HDFS DN            HBase




                                                           19
Wednesday, 16 January 2013
Query Execution
                   • Intermediate results are streamed between
                     Impalad daemons. Query results are
                             streamed back to client.
                                SQL App
                                                                         Hive
                                                                                    HDFS NN            Statestore
                                                                       Metastore
                                 ODBC


                                             query results




                                 Query Planner                  Query Planner                  Query Planner

                               Query Coordinator              Query Coordinator               Query Coordinator

                                Query Executor                 Query Executor                  Query Executor

                              HDFS DN        HBase           HDFS DN        HBase        HDFS DN           HBase




                                                                    20
Wednesday, 16 January 2013
The Planner




                                  21
Wednesday, 16 January 2013
The Planner

                   • Two-phase planning process:
                         •   single-node plan: left-deep tree of plan operators

                         •   plan partitioning: partition single-node plan to maximise scan locality,
                             minimise data movement




                                                            21
Wednesday, 16 January 2013
The Planner

                   • Two-phase planning process:
                         •   single-node plan: left-deep tree of plan operators

                         •   plan partitioning: partition single-node plan to maximise scan locality,
                             minimise data movement


                   • Plan operators: Scan, HashJoin, Exchange
                     HashAggregation, Union, TopN,




                                                            21
Wednesday, 16 January 2013
The Planner

                   • Two-phase planning process:
                         •   single-node plan: left-deep tree of plan operators

                         •   plan partitioning: partition single-node plan to maximise scan locality,
                             minimise data movement


                   • Plan operators: Scan, HashJoin, Exchange
                     HashAggregation, Union, TopN,
                   • Distributed aggregation:aggregation at root
                     individual nodes, merge
                                             pre-aggregate in




                                                            21
Wednesday, 16 January 2013
The Planner

                   • Two-phase planning process:
                         •   single-node plan: left-deep tree of plan operators

                         •   plan partitioning: partition single-node plan to maximise scan locality,
                             minimise data movement


                   • Plan operators: Scan, HashJoin, Exchange
                     HashAggregation, Union, TopN,
                   • Distributed aggregation:aggregation at root
                     individual nodes, merge
                                              pre-aggregate in

                   • GA: rudimentary cost-based optimiser
                                                            21
Wednesday, 16 January 2013
Plan Partitioning

                   • Example: query with join and aggregation
                             SELECT state, SUM(revenue)
                             FROM HdfsTbl h JOIN HbaseTbl b ON (...)
                             GROUP BY 1 ORDER BY 2 desc LIMIT 10



                                    TopN
                                                                                     Agg
                                                          TopN
                                    Agg                                              Hash
                                                          Agg                        Join
                                    Hash
                                    Join
                                                                              Hdfs                    Hbase
                                                          Exch                              Exch
                                                                              Scan                    Scan
                             Hdfs          Hbase       at coordinator        at DataNodes          at region servers
                             Scan          Scan




                                                                        22
Wednesday, 16 January 2013
Catalog Metadata




                                    23
Wednesday, 16 January 2013
Catalog Metadata

                   • Metadata Handling
                      • Uses Hive’s metastore
                      • Caches metadata between queries: no
                        synchronous metastore API calls during
                                 query execution
                             •   Beta: Changes in metadata require
                                 manual refresh
                             •   GA: Metadata distributed through
                                 statestore

                                               23
Wednesday, 16 January 2013
Execution Engine




                                    24
Wednesday, 16 January 2013
Execution Engine

                   • Heavy-lifting component of each Impalad
                      • Written in C++
                      • runtime code generation for “big-
                        loops”
                      • Internal in-memoryfixed offsets puts
                        fixed-width data at
                                            tuple format

                      • Hand-optimised assembly where
                        needed

                                       24
Wednesday, 16 January 2013
More on Code Generation




                                   25
Wednesday, 16 January 2013
More on Code Generation
                   • For example: Inserting tuples into a hash-
                     table
                       • We know ahead of time the maximum
                         number of tuples (in a batch), the tuple
                                 layout, what fields might be null and so
                                 on.
                             • Pre-bake loop that avoids branches and
                               unrolled
                                         all this information into an
                                 dead code
                             •   Function calls are inlined at compile-
                                 time

                                                25
Wednesday, 16 January 2013
More on Code Generation
                   • For example: Inserting tuples into a hash-
                     table
                       • We know ahead of time the maximum
                         number of tuples (in a batch), the tuple
                                 layout, what fields might be null and so
                                 on.
                             • Pre-bake loop that avoids branches and
                               unrolled
                                         all this information into an
                                 dead code
                             •   Function calls are inlined at compile-
                                 time
                   • Result: significant speedup in real queries
                                                25
Wednesday, 16 January 2013
Statestore




                                 26
Wednesday, 16 January 2013
Statestore
                   •         Central system state repository
                              •  Membership / failure-detection
                              •  GA: metadata
                              •  GA: diagnostics, scheduling information




                                                     26
Wednesday, 16 January 2013
Statestore
                   •         Central system state repository
                              •  Membership / failure-detection
                              •  GA: metadata
                              •  GA: diagnostics, scheduling information
                   •         Soft-state
                              •   All data can be reconstructed from the rest of
                                  the system
                              •   Impala continues to run when statestore fails,
                                  but per-node state becomes increasingly stale




                                                     26
Wednesday, 16 January 2013
Statestore
                   •         Central system state repository
                              •  Membership / failure-detection
                              •  GA: metadata
                              •  GA: diagnostics, scheduling information
                   •         Soft-state
                              •   All data can be reconstructed from the rest of
                                  the system
                              •   Impala continues to run when statestore fails,
                                  but per-node state becomes increasingly stale
                   •         Sends periodic heartbeats
                              •   Pushes new data
                              •   Checks for liveness
                                                     26
Wednesday, 16 January 2013
Why not ZooKeeper?




                                     27
Wednesday, 16 January 2013
Why not ZooKeeper?
                   • Apache ZooKeeper is not a good publish-
                     subscribe system
                         •   API is awkward, and requires a lot of client logic

                         •   Multiple round-trips required to get data for changes to node’s children

                         •   Push model is more natural for our use case




                                                            27
Wednesday, 16 January 2013
Why not ZooKeeper?
                   • Apache ZooKeeper is not a good publish-
                     subscribe system
                         • API is awkward, and requires a lot of client logic
                         • Multiple round-trips required to get data for changes to node’s children
                         • Push model is more natural for our use case
                   •      Don’t need all the guarantees ZK provides
                         • Serializability
                         • Persistence
                         • Avoid complexity where possible!


                                                          27
Wednesday, 16 January 2013
Why not ZooKeeper?
                   • Apache ZooKeeper is not a good publish-
                     subscribe system
                         • API is awkward, and requires a lot of client logic
                         • Multiple round-trips required to get data for changes to node’s children
                         • Push model is more natural for our use case
                   •      Don’t need all the guarantees ZK provides
                         • Serializability
                         • Persistence
                         • Avoid complexity where possible!
                   •      ZK is bad at the things we care about, and
                          good at the things we don’t
                                                          27
Wednesday, 16 January 2013
Comparing Impala to Dremel




                               28
Wednesday, 16 January 2013
Comparing Impala to Dremel

                   • Google’s Dremel
                      • Columnar storage for data with nested
                        structures
                      • Distributed scalable aggregation on top
                        of that




                                        28
Wednesday, 16 January 2013
Comparing Impala to Dremel

                   • Google’s Dremel
                       • Columnar storage for data with nested
                          structures
                       • Distributed scalable aggregation on top
                          of that
                   • Columnar storage coming to Hadoop via
                     joint project between Cloudera and Twitter




                                        28
Wednesday, 16 January 2013
Comparing Impala to Dremel

                   • Google’s Dremel
                       • Columnar storage for data with nested
                          structures
                       • Distributed scalable aggregation on top
                          of that
                   • Columnar storage coming to Hadoop via
                     joint project between Cloudera and Twitter
                   • Impala plus columnar format: a superset had
                     the published version of Dremel (which
                                                              of
                             no joins)
                                         28
Wednesday, 16 January 2013
Comparing Impala to Hive




                                        29
Wednesday, 16 January 2013
Comparing Impala to Hive
                   •         Hive: MapReduce as an execution engine
                              •  High latency, low throughput queries
                              •   Fault-tolerance based on MapReduce’s on-
                                  disk checkpointing: materialises all
                                  intermediate results
                              •   Java runtime allows for extensibility: file
                                  formats and UDFs




                                                   29
Wednesday, 16 January 2013
Comparing Impala to Hive
                   •         Hive: MapReduce as an execution engine
                              •  High latency, low throughput queries
                              •  Fault-tolerance based on MapReduce’s on-
                                 disk checkpointing: materialises all
                                 intermediate results
                              •  Java runtime allows for extensibility: file
                                 formats and UDFs
                   •         Impala:
                              •   Direct, process-to-process data exchange
                              •   No fault tolerance
                              •   Designed for low runtime overhead
                              •   Not nearly as extensible
                                                  29
Wednesday, 16 January 2013
Impala and Hive: Performance




                                30
Wednesday, 16 January 2013
Impala and Hive: Performance

                   • No published process: yet, but from the
                     development
                                   benchmarks

                      • Impala workloads fasterthroughput, I/O-
                         bound
                                can get full disk
                                                  by 3-4x.
                      • Multiple phase Hive queries see larger
                         speedup in Impala
                      • Queries against in-memory data can be
                         up to 100x faster


                                       30
Wednesday, 16 January 2013
Impala Roadmap to GA




                                      31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •         GA planned for second-quarter 2013




                                                  31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •         GA planned for second-quarter 2013
                   •         New data formats
                              •  LZO-compressed text
                              •  Avro
                              •  Columnar format




                                                  31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •         GA planned for second-quarter 2013
                   •         New data formats
                              •   LZO-compressed text
                              •   Avro
                              •   Columnar format
                   •         Better metadata handling through statestore




                                                    31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •         GA planned for second-quarter 2013
                   •         New data formats
                              •   LZO-compressed text
                              •   Avro
                              •   Columnar format
                   •         Better metadata handling through statestore
                   •         JDBC support




                                                    31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •         GA planned for second-quarter 2013
                   •         New data formats
                              •   LZO-compressed text
                              •   Avro
                              •   Columnar format
                   •         Better metadata handling through statestore
                   •         JDBC support
                   •         Improved query execution, e.g. partitioned joins




                                                     31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •      GA planned for second-quarter 2013
                   •      New data formats
                             • LZO-compressed text
                             • Avro
                             • Columnar format
                   •      Better metadata handling through statestore
                   •      JDBC support
                   •      Improved query execution, e.g. partitioned joins
                   •      Production deployment guidelines
                         • Load-balancing across Impalad daemons
                         • Resource isolation within Hadoop cluster

                                                 31
Wednesday, 16 January 2013
Impala Roadmap to GA
                   •      GA planned for second-quarter 2013
                   •      New data formats
                             • LZO-compressed text
                             • Avro
                             • Columnar format
                   •      Better metadata handling through statestore
                   •      JDBC support
                   •      Improved query execution, e.g. partitioned joins
                   •      Production deployment guidelines
                         • Load-balancing across Impalad daemons
                         • Resource isolation within Hadoop cluster
                   •      More packages: RHEL 5.7, Ubuntu, Debian
                                                 31
Wednesday, 16 January 2013
Impala Roadmap: Beyond GA




                                32
Wednesday, 16 January 2013
Impala Roadmap: Beyond GA
                   •         Coming in 2013




                                              32
Wednesday, 16 January 2013
Impala Roadmap: Beyond GA
                   •      Coming in 2013
                   •      Improved HBase support
                         • Composite keys, Avro data in columns
                         • Indexed nested-loop joins
                         • INSERT / UPDATE / DELETE




                                                                  32
Wednesday, 16 January 2013
Impala Roadmap: Beyond GA
                   •      Coming in 2013
                   •      Improved HBase support
                         • Composite keys, Avro data in columns
                         • Indexed nested-loop joins
                         • INSERT / UPDATE / DELETE
                   •      Additional SQL
                         • UDFs
                         • SQL authorisation and DDL
                         • ORDER BY without LIMIT
                         • Window functions
                         • Support for structured data types




                                                                  32
Wednesday, 16 January 2013
Impala Roadmap: Beyond GA
                   •      Coming in 2013
                   •      Improved HBase support
                         • Composite keys, Avro data in columns
                         • Indexed nested-loop joins
                         • INSERT / UPDATE / DELETE
                   •      Additional SQL
                         • UDFs
                         • SQL authorisation and DDL
                         • ORDER BY without LIMIT
                         • Window functions
                         • Support for structured data types
                   •      Runtime optimisations
                         • Straggler handling
                         • Join order optimisation
                         • Improved cache management
                         • Data co-location for improved join performance
                                                                  32
Wednesday, 16 January 2013
Impala Roadmap: 2013




                                      33
Wednesday, 16 January 2013
Impala Roadmap: 2013


                   • Resource management
                      • Cluster-wide quotas
                      • “User X canqueries have more than 5
                        concurrent
                                     never
                                           running”
                      • Goal: run exploratory and production
                        workloads in same cluster without
                             affecting production jobs


                                            33
Wednesday, 16 January 2013
Try it out!




                                  34
Wednesday, 16 January 2013
Try it out!


                   • Beta version available since October 2012




                                        34
Wednesday, 16 January 2013
Try it out!


                   • Beta version available since October 2012
                   • Get started at www.cloudera.com/impala



                                        34
Wednesday, 16 January 2013
Try it out!


                   • Beta version available since October 2012
                   • Get started at www.cloudera.com/impala
                   • Questions / comments?
                      • impala-user@cloudera.org
                      • henry@cloudera.com

                                        34
Wednesday, 16 January 2013
Thank you!
                               Questions?
                             @henryr / henry@cloudera.com




Wednesday, 16 January 2013

More Related Content

What's hot

Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 

What's hot (20)

SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 

Viewers also liked

Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeNati Shalom
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Nati Shalom
 
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...Carmine Gallo
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 

Viewers also liked (11)

Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017
 
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 

Similar to Impala Brings Low-Latency SQL to Hadoop

Hadoop Solutions
Hadoop SolutionsHadoop Solutions
Hadoop Solutionszenyk
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014gmalouf678
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptxITLAb21
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Big data references
Big data referencesBig data references
Big data referenceszarigatongy
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Adam Doyle
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with HadoopSagar Jauhari
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? nakshatraL
 

Similar to Impala Brings Low-Latency SQL to Hadoop (20)

Hadoop Solutions
Hadoop SolutionsHadoop Solutions
Hadoop Solutions
 
Geek camp
Geek campGeek camp
Geek camp
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Big data references
Big data referencesBig data references
Big data references
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with Hadoop
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS?
 
Hire Hadoop Developer
Hire Hadoop DeveloperHire Hadoop Developer
Hire Hadoop Developer
 

More from Data Science London

Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourData Science London
 

More from Data Science London (20)

Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer Behaviour
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Impala Brings Low-Latency SQL to Hadoop

  • 1. Impala: A Modern SQL Engine for Hadoop Henry Robinson | Software Engineer henry@cloudera.com | @henryr Wednesday, 16 January 2013
  • 3. Agenda • Part 1: • Low-latency puzzle piece of the and Hadoop: a missing • Impala: Goals, non-goals and features • Demo • Q+A Wednesday, 16 January 2013
  • 4. Agenda • Part 1: • Low-latency puzzle piece of the and Hadoop: a missing • Impala: Goals, non-goals and features • Demo • Q+A • Part 2: • Impala Internals • Comparing Impala to other systems • Q+A Wednesday, 16 January 2013
  • 5. About Me 3 Wednesday, 16 January 2013
  • 6. About Me • Hi! 3 Wednesday, 16 January 2013
  • 7. About Me • Hi! • Software Engineer at Cloudera since 2009 • Apache ZooKeeper • First version of Flume • Cloudera Enterprise • Working on Impala since the beginning of 2012 3 Wednesday, 16 January 2013
  • 8. Part 1: Why Impala? Wednesday, 16 January 2013
  • 9. The Hadoop Landscape 5 Wednesday, 16 January 2013
  • 10. The Hadoop Landscape • Hadoop MapReduce is a batch processing system 5 Wednesday, 16 January 2013
  • 11. The Hadoop Landscape • Hadoop MapReduce is a batch processing system • Ideally suited to workloads high-latency data processing long-running, 5 Wednesday, 16 January 2013
  • 12. The Hadoop Landscape • Hadoop MapReduce is a batch processing system • Ideally suited to workloads high-latency data processing long-running, • But not as suitable for interactive queries, data exploration or iterative query refinement • All of which are keystones of data warehousing 5 Wednesday, 16 January 2013
  • 13. Bringing Low-Latency to Hadoop 6 Wednesday, 16 January 2013
  • 14. Bringing Low-Latency to Hadoop • HDFS and HBase make data storage cheap and flexible 6 Wednesday, 16 January 2013
  • 15. Bringing Low-Latency to Hadoop • HDFS and HBase make data storage cheap and flexible • SQL / ODBC are industry-standards • Analyst familiarity • BI tool integration • Legacy systems 6 Wednesday, 16 January 2013
  • 16. Bringing Low-Latency to Hadoop • HDFS and HBase make data storage cheap and flexible • SQL / ODBC are industry-standards • Analyst familiarity • BI tool integration • Legacy systems • Can we get the advantages of both? • With acceptable performance? 6 Wednesday, 16 January 2013
  • 17. Impala Overview: Goals 7 Wednesday, 16 January 2013
  • 18. Impala Overview: Goals • General-purpose SQL query engine • should work both for analytical and transactional workloads • will support queries that take from milliseconds to hours 7 Wednesday, 16 January 2013
  • 19. Impala Overview: Goals • General-purpose SQL query engine • should work both for analytical and transactional workloads • will support queries that take from milliseconds to hours • Runs directly within Hadoop: • Reads widely-used Hadoop file formats • talks to widely used Hadoop storage managers like HDFS and HBase • runs on same nodes that run Hadoop processes 7 Wednesday, 16 January 2013
  • 20. Impala Overview: Goals • General-purpose SQL query engine • should work both for analytical and transactional workloads • will support queries that take from milliseconds to hours • Runs directly within Hadoop: • Reads widely-used Hadoop file formats • talks to widely used Hadoop storage managers like HDFS and HBase • runs on same nodes that run Hadoop processes • High performance • C++ instead of Java • runtime code generation via LLVM • completely new execution engine that doesn’t build on MapReduce 7 Wednesday, 16 January 2013
  • 21. User View of Impala 8 Wednesday, 16 January 2013
  • 22. User View of Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data 8 Wednesday, 16 January 2013
  • 23. User View of Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any daemon 8 Wednesday, 16 January 2013
  • 24. User View of Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any daemon • Query is distributed to all nodes with relevant data 8 Wednesday, 16 January 2013
  • 25. User View of Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any daemon • Query is distributed to all nodes with relevant data • If any node fails, the query fails 8 Wednesday, 16 January 2013
  • 26. User View of Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any daemon • Query is distributed to all nodes with relevant data • If any node fails, the query fails • Impala uses Hive’s metadata interface 8 Wednesday, 16 January 2013
  • 27. User View of Impala • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any daemon • Query is distributed to all nodes with relevant data • If any node fails, the query fails • Impala uses Hive’s metadata interface • Supported file formats: • text files (GA: with compression, including lzo) • sequence files with snappy / gzip compression • GA: Avro data files / columnar format (more on this later) 8 Wednesday, 16 January 2013
  • 28. User View of Impala: SQL 9 Wednesday, 16 January 2013
  • 29. User View of Impala: SQL • SQL support: • patterned after Hive’s version of SQL • limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert • only equi-joins, no non-equi-joins, no cross products • ORDER BY only with LIMIT • GA: DDL support (CREATE, ALTER) 9 Wednesday, 16 January 2013
  • 30. User View of Impala: SQL • SQL support: • patterned after Hive’s version of SQL • limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert • only equi-joins, no non-equi-joins, no cross products • ORDER BY only with LIMIT • GA: DDL support (CREATE, ALTER) • Functional Limitations • no custom UDFs, file formats, Hive SerDes • only hash memory of alltable has to fit in(GA) of a single node (beta) / aggregate joins: joined executing nodes memory • join order = FROM clause order 9 Wednesday, 16 January 2013
  • 31. User View of Impala: HBase 10 Wednesday, 16 January 2013
  • 32. User View of Impala: HBase • HBase functionality • uses Hive’s mapping of HBase table into metastore table • predicates on rowkey columns are mapped into start / stop row • predicates on other columns are mapped into SingleColumnValueFilters 10 Wednesday, 16 January 2013
  • 33. User View of Impala: HBase • HBase functionality • uses Hive’s mapping of HBase table into metastore table • predicates on rowkey columns are mapped into start / stop row • predicates on other columns are mapped into SingleColumnValueFilters • HBase functional limitations • no nested-loop joins • all data stored as text 10 Wednesday, 16 January 2013
  • 35. TPC-DS 12 Wednesday, 16 January 2013
  • 36. TPC-DS • TPC-DS isdecision supportdataset designed to model a benchmark systems 12 Wednesday, 16 January 2013
  • 37. TPC-DS • TPC-DS isdecision supportdataset designed to model a benchmark systems • We generatedillustrative!) (not a lot, but enough to be 500MB data 12 Wednesday, 16 January 2013
  • 38. TPC-DS • TPC-DS isdecision supportdataset designed to model a benchmark systems • We generatedillustrative!) (not a lot, but enough to be 500MB data • Let’sagainstsample query against Hive 0.9, and run a Impala 0.3 12 Wednesday, 16 January 2013
  • 39. TPC-DS • TPC-DS isdecision supportdataset designed to model a benchmark systems • We generatedillustrative!) (not a lot, but enough to be 500MB data • Let’sagainstsample query against Hive 0.9, and run a Impala 0.3 • Single node (VM! -engine speeds so we’re testing execution caveat emptor), 12 Wednesday, 16 January 2013
  • 40. TPC-DS Sample Query select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'M' and cd_marital_status = 'S' and cd_education_status = 'College' and d_year = 2002 and s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD') group by i_item_id, s_state order by i_item_id, s_state limit 100; 13 Wednesday, 16 January 2013
  • 41. Impala is much faster 14 Wednesday, 16 January 2013
  • 42. Impala is much faster • Why? • No materialisation of intermediate data - less I/O • No multi-phase queries - much smaller startup / teardown overhead • Fasterfor each individual query code execution engine: generates fast 14 Wednesday, 16 January 2013
  • 43. Part 2: Impala Internals / Roadmap Wednesday, 16 January 2013
  • 44. Impala Architecture 16 Wednesday, 16 January 2013
  • 45. Impala Architecture • Two binaries: impalad and statestored 16 Wednesday, 16 January 2013
  • 46. Impala Architecture • Two binaries: impalad and statestored • Impala daemon (impalad) • handles client requests andexecution requests related to query all internal over Thrift • runs on every datanode 16 Wednesday, 16 January 2013
  • 47. Impala Architecture • Two binaries: impalad and statestored • Impala daemon (impalad) • handles client requests andexecution requests related to query all internal over Thrift • runs on every datanode • Statestore daemon (statestored) • provides membership information and metadata distribution • only one per cluster 16 Wednesday, 16 January 2013
  • 48. Query Execution 17 Wednesday, 16 January 2013
  • 49. Query Execution • Query execution phases: • Request arrives via Thrift API (perhaps from ODBC, or shell) • Plannerfragments into collections of plan turns request • ‘Coordinator’ initiates execution on remote impalad daemons 17 Wednesday, 16 January 2013
  • 50. Query Execution • Query execution phases: • Request arrives via Thrift API (perhaps from ODBC, or shell) • Plannerfragments into collections of plan turns request • ‘Coordinator’ initiates execution on remote impalad daemons • During execution: • Intermediate results are streamed between impalad daemons • Query results are streamed to client 17 Wednesday, 16 January 2013
  • 51. Query Execution • Request arrives via Thrift API: SQL App Hive HDFS NN Statestore Metastore ODBC SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase 18 Wednesday, 16 January 2013
  • 52. Query Execution • Planner turns request into collections of plan fragments SQL App Hive HDFS NN Statestore Metastore ODBC Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase 19 Wednesday, 16 January 2013
  • 53. Query Execution • Intermediate results are streamed between Impalad daemons. Query results are streamed back to client. SQL App Hive HDFS NN Statestore Metastore ODBC query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase 20 Wednesday, 16 January 2013
  • 54. The Planner 21 Wednesday, 16 January 2013
  • 55. The Planner • Two-phase planning process: • single-node plan: left-deep tree of plan operators • plan partitioning: partition single-node plan to maximise scan locality, minimise data movement 21 Wednesday, 16 January 2013
  • 56. The Planner • Two-phase planning process: • single-node plan: left-deep tree of plan operators • plan partitioning: partition single-node plan to maximise scan locality, minimise data movement • Plan operators: Scan, HashJoin, Exchange HashAggregation, Union, TopN, 21 Wednesday, 16 January 2013
  • 57. The Planner • Two-phase planning process: • single-node plan: left-deep tree of plan operators • plan partitioning: partition single-node plan to maximise scan locality, minimise data movement • Plan operators: Scan, HashJoin, Exchange HashAggregation, Union, TopN, • Distributed aggregation:aggregation at root individual nodes, merge pre-aggregate in 21 Wednesday, 16 January 2013
  • 58. The Planner • Two-phase planning process: • single-node plan: left-deep tree of plan operators • plan partitioning: partition single-node plan to maximise scan locality, minimise data movement • Plan operators: Scan, HashJoin, Exchange HashAggregation, Union, TopN, • Distributed aggregation:aggregation at root individual nodes, merge pre-aggregate in • GA: rudimentary cost-based optimiser 21 Wednesday, 16 January 2013
  • 59. Plan Partitioning • Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Agg Join Hash Join Hdfs Hbase Exch Exch Scan Scan Hdfs Hbase at coordinator at DataNodes at region servers Scan Scan 22 Wednesday, 16 January 2013
  • 60. Catalog Metadata 23 Wednesday, 16 January 2013
  • 61. Catalog Metadata • Metadata Handling • Uses Hive’s metastore • Caches metadata between queries: no synchronous metastore API calls during query execution • Beta: Changes in metadata require manual refresh • GA: Metadata distributed through statestore 23 Wednesday, 16 January 2013
  • 62. Execution Engine 24 Wednesday, 16 January 2013
  • 63. Execution Engine • Heavy-lifting component of each Impalad • Written in C++ • runtime code generation for “big- loops” • Internal in-memoryfixed offsets puts fixed-width data at tuple format • Hand-optimised assembly where needed 24 Wednesday, 16 January 2013
  • 64. More on Code Generation 25 Wednesday, 16 January 2013
  • 65. More on Code Generation • For example: Inserting tuples into a hash- table • We know ahead of time the maximum number of tuples (in a batch), the tuple layout, what fields might be null and so on. • Pre-bake loop that avoids branches and unrolled all this information into an dead code • Function calls are inlined at compile- time 25 Wednesday, 16 January 2013
  • 66. More on Code Generation • For example: Inserting tuples into a hash- table • We know ahead of time the maximum number of tuples (in a batch), the tuple layout, what fields might be null and so on. • Pre-bake loop that avoids branches and unrolled all this information into an dead code • Function calls are inlined at compile- time • Result: significant speedup in real queries 25 Wednesday, 16 January 2013
  • 67. Statestore 26 Wednesday, 16 January 2013
  • 68. Statestore • Central system state repository • Membership / failure-detection • GA: metadata • GA: diagnostics, scheduling information 26 Wednesday, 16 January 2013
  • 69. Statestore • Central system state repository • Membership / failure-detection • GA: metadata • GA: diagnostics, scheduling information • Soft-state • All data can be reconstructed from the rest of the system • Impala continues to run when statestore fails, but per-node state becomes increasingly stale 26 Wednesday, 16 January 2013
  • 70. Statestore • Central system state repository • Membership / failure-detection • GA: metadata • GA: diagnostics, scheduling information • Soft-state • All data can be reconstructed from the rest of the system • Impala continues to run when statestore fails, but per-node state becomes increasingly stale • Sends periodic heartbeats • Pushes new data • Checks for liveness 26 Wednesday, 16 January 2013
  • 71. Why not ZooKeeper? 27 Wednesday, 16 January 2013
  • 72. Why not ZooKeeper? • Apache ZooKeeper is not a good publish- subscribe system • API is awkward, and requires a lot of client logic • Multiple round-trips required to get data for changes to node’s children • Push model is more natural for our use case 27 Wednesday, 16 January 2013
  • 73. Why not ZooKeeper? • Apache ZooKeeper is not a good publish- subscribe system • API is awkward, and requires a lot of client logic • Multiple round-trips required to get data for changes to node’s children • Push model is more natural for our use case • Don’t need all the guarantees ZK provides • Serializability • Persistence • Avoid complexity where possible! 27 Wednesday, 16 January 2013
  • 74. Why not ZooKeeper? • Apache ZooKeeper is not a good publish- subscribe system • API is awkward, and requires a lot of client logic • Multiple round-trips required to get data for changes to node’s children • Push model is more natural for our use case • Don’t need all the guarantees ZK provides • Serializability • Persistence • Avoid complexity where possible! • ZK is bad at the things we care about, and good at the things we don’t 27 Wednesday, 16 January 2013
  • 75. Comparing Impala to Dremel 28 Wednesday, 16 January 2013
  • 76. Comparing Impala to Dremel • Google’s Dremel • Columnar storage for data with nested structures • Distributed scalable aggregation on top of that 28 Wednesday, 16 January 2013
  • 77. Comparing Impala to Dremel • Google’s Dremel • Columnar storage for data with nested structures • Distributed scalable aggregation on top of that • Columnar storage coming to Hadoop via joint project between Cloudera and Twitter 28 Wednesday, 16 January 2013
  • 78. Comparing Impala to Dremel • Google’s Dremel • Columnar storage for data with nested structures • Distributed scalable aggregation on top of that • Columnar storage coming to Hadoop via joint project between Cloudera and Twitter • Impala plus columnar format: a superset had the published version of Dremel (which of no joins) 28 Wednesday, 16 January 2013
  • 79. Comparing Impala to Hive 29 Wednesday, 16 January 2013
  • 80. Comparing Impala to Hive • Hive: MapReduce as an execution engine • High latency, low throughput queries • Fault-tolerance based on MapReduce’s on- disk checkpointing: materialises all intermediate results • Java runtime allows for extensibility: file formats and UDFs 29 Wednesday, 16 January 2013
  • 81. Comparing Impala to Hive • Hive: MapReduce as an execution engine • High latency, low throughput queries • Fault-tolerance based on MapReduce’s on- disk checkpointing: materialises all intermediate results • Java runtime allows for extensibility: file formats and UDFs • Impala: • Direct, process-to-process data exchange • No fault tolerance • Designed for low runtime overhead • Not nearly as extensible 29 Wednesday, 16 January 2013
  • 82. Impala and Hive: Performance 30 Wednesday, 16 January 2013
  • 83. Impala and Hive: Performance • No published process: yet, but from the development benchmarks • Impala workloads fasterthroughput, I/O- bound can get full disk by 3-4x. • Multiple phase Hive queries see larger speedup in Impala • Queries against in-memory data can be up to 100x faster 30 Wednesday, 16 January 2013
  • 84. Impala Roadmap to GA 31 Wednesday, 16 January 2013
  • 85. Impala Roadmap to GA • GA planned for second-quarter 2013 31 Wednesday, 16 January 2013
  • 86. Impala Roadmap to GA • GA planned for second-quarter 2013 • New data formats • LZO-compressed text • Avro • Columnar format 31 Wednesday, 16 January 2013
  • 87. Impala Roadmap to GA • GA planned for second-quarter 2013 • New data formats • LZO-compressed text • Avro • Columnar format • Better metadata handling through statestore 31 Wednesday, 16 January 2013
  • 88. Impala Roadmap to GA • GA planned for second-quarter 2013 • New data formats • LZO-compressed text • Avro • Columnar format • Better metadata handling through statestore • JDBC support 31 Wednesday, 16 January 2013
  • 89. Impala Roadmap to GA • GA planned for second-quarter 2013 • New data formats • LZO-compressed text • Avro • Columnar format • Better metadata handling through statestore • JDBC support • Improved query execution, e.g. partitioned joins 31 Wednesday, 16 January 2013
  • 90. Impala Roadmap to GA • GA planned for second-quarter 2013 • New data formats • LZO-compressed text • Avro • Columnar format • Better metadata handling through statestore • JDBC support • Improved query execution, e.g. partitioned joins • Production deployment guidelines • Load-balancing across Impalad daemons • Resource isolation within Hadoop cluster 31 Wednesday, 16 January 2013
  • 91. Impala Roadmap to GA • GA planned for second-quarter 2013 • New data formats • LZO-compressed text • Avro • Columnar format • Better metadata handling through statestore • JDBC support • Improved query execution, e.g. partitioned joins • Production deployment guidelines • Load-balancing across Impalad daemons • Resource isolation within Hadoop cluster • More packages: RHEL 5.7, Ubuntu, Debian 31 Wednesday, 16 January 2013
  • 92. Impala Roadmap: Beyond GA 32 Wednesday, 16 January 2013
  • 93. Impala Roadmap: Beyond GA • Coming in 2013 32 Wednesday, 16 January 2013
  • 94. Impala Roadmap: Beyond GA • Coming in 2013 • Improved HBase support • Composite keys, Avro data in columns • Indexed nested-loop joins • INSERT / UPDATE / DELETE 32 Wednesday, 16 January 2013
  • 95. Impala Roadmap: Beyond GA • Coming in 2013 • Improved HBase support • Composite keys, Avro data in columns • Indexed nested-loop joins • INSERT / UPDATE / DELETE • Additional SQL • UDFs • SQL authorisation and DDL • ORDER BY without LIMIT • Window functions • Support for structured data types 32 Wednesday, 16 January 2013
  • 96. Impala Roadmap: Beyond GA • Coming in 2013 • Improved HBase support • Composite keys, Avro data in columns • Indexed nested-loop joins • INSERT / UPDATE / DELETE • Additional SQL • UDFs • SQL authorisation and DDL • ORDER BY without LIMIT • Window functions • Support for structured data types • Runtime optimisations • Straggler handling • Join order optimisation • Improved cache management • Data co-location for improved join performance 32 Wednesday, 16 January 2013
  • 97. Impala Roadmap: 2013 33 Wednesday, 16 January 2013
  • 98. Impala Roadmap: 2013 • Resource management • Cluster-wide quotas • “User X canqueries have more than 5 concurrent never running” • Goal: run exploratory and production workloads in same cluster without affecting production jobs 33 Wednesday, 16 January 2013
  • 99. Try it out! 34 Wednesday, 16 January 2013
  • 100. Try it out! • Beta version available since October 2012 34 Wednesday, 16 January 2013
  • 101. Try it out! • Beta version available since October 2012 • Get started at www.cloudera.com/impala 34 Wednesday, 16 January 2013
  • 102. Try it out! • Beta version available since October 2012 • Get started at www.cloudera.com/impala • Questions / comments? • impala-user@cloudera.org • henry@cloudera.com 34 Wednesday, 16 January 2013
  • 103. Thank you! Questions? @henryr / henry@cloudera.com Wednesday, 16 January 2013