Cloudera Impala - HUG Karlsruhe, July 04, 2013

•

2 likes•3,122 views

Low latency data processing with Impala Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), JDBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.

Technology

Cloudera Impala
Real Time Query for HDFS and HBase
Alexander Alten-Lorenz, Cloudera INC
Tuesday, July 2, 13

2
Beyond Batch
What is Impala
Capability
Architecture
Demo
Tuesday, July 2, 13

Beyond Batch
3
For some things MapReduce is just too slow
Apache Hive:
MapReduce execution engine
High-latency, low throughput
High runtime overhead
Google realized this early on

Analysts wanted fast, interactive results
Tuesday, July 2, 13

Dremel
4
Google paper (2010)
“scalable, interactive ad-hoc query system for
analysis of read-only nested data”
Columnar storage format
Distributed scalable aggregation
“capable of running aggregation queries over
trillion-row tables in seconds”
http://research.google.com/pubs/pub36632.html
Tuesday, July 2, 13

Impala: Goals
5
General-purpose SQL query engine for Hadoop
For analytical and transactional workloads
Support queries that take μs to hours
Run directly with Hadoop
Collocated daemons
Same ﬁle formats
Same storage managers (NN, metastore)
Tuesday, July 2, 13

Impala: Goals
6
High performance
C++
runtime code generation (LLVM)
direct access to data (no MapReduce)
Retain user experience

easy for Hive users to migrate
100% open-source
Tuesday, July 2, 13

Impala: Capability
7
HiveQL (subset of SQL92)
select, project, join, union, subqueries,
aggregation, insert, order by (with limit)
DDL
Directly queries data in HDFS & HBase
Text ﬁles (compressed)
Sequence ﬁles (snappy/gzip)
Avro &Trevni
GA features
Tuesday, July 2, 13

Impala: Capability
8
Familiar and uniﬁed platform
Uses Hive’s metastore
Submit queries via ODBC | BeeswaxThrift API
Query is distributed to nodes with relevant data
Process-to-process data exchange
Kerberos authentication
No fault tolerance
Tuesday, July 2, 13

Impala: Performance
9
Greater disk throughput
~100MB/sec/disk
I/O-bound workloads faster by 3-4x
Queries that require multiple map-reduce phases
in Hive are signiﬁcantly faster in Impala (up to 45x)
Queries that run against in-memory cached data
see a signiﬁcant speedup (up to 90x)
Tuesday, July 2, 13

Impala:Architecture
10
impalad
runs on every node
handles client requests (ODBC, thrift)
handles query planning & execution
statestored
provides name service
metadata distribution
used for ﬁnding data
Tuesday, July 2, 13

Impala:Architecture
11
Tuesday, July 2, 13

Impala:Architecture
12
Tuesday, July 2, 13

Impala:Architecture
13
Tuesday, July 2, 13

Impala:Architecture
14
Tuesday, July 2, 13

Current limitations
15
1.0.1 (available since May 2013)
No SerDes
No User Deﬁned Functions (UDF’s)
impalad’s only read statestored metadata at
startup
Tuesday, July 2, 13

Futures
16
DDL support (CREATE)
Rudimentary cost-based optimizer (CBO)
metadata distribution through statestored
Doug Cutting’sTrevni
Columnar storage format like Dremel’s
Impala +Trevni = Dremel superset
Tuesday, July 2, 13

Demo
17
impala-user@cloudera.com
alexander@cloudera.com
@mapredit
mapredit.blogspot.com
Web: http://goo.gl/7sxdp
Tuesday, July 2, 13

What's hot

Secure Hadoop Cluster With KerberosEdureka!

Apache ignite as in-memory computing platformSurinder Mehra

Hadoop securityBiju Nair

Hadoop architecture-tutorialvinayiqbusiness

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit

Fluid: When Alluxio Meets KubernetesAlluxio, Inc.

Ceph Days 2014 Paul Evans Slide DeckDaystromTech

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HDFS IssuesSteve Loughran

2.introduction to hdfsdatabloginfo

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit

1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen

Burst Presto & Spark workloads to AWS EMR with no data copiesAlluxio, Inc.

Hadoop training in bangaloreKelly Technologies

On-premise Spark as a Service with YARN Jim Dowling

Hadoop Distributed File SystemAnand Kulkarni

Hadoop on-mesosHenry Cai 蔡明航

Understanding Distributed Databases ScalabilityRicardo Jimenez-Peris

From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.

What's hot (20)

Secure Hadoop Cluster With Kerberos

Apache ignite as in-memory computing platform

Hadoop security

Hadoop architecture-tutorial

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Data Protection in Hybrid Enterprise Data Lake Environment

Fluid: When Alluxio Meets Kubernetes

Ceph Days 2014 Paul Evans Slide Deck

Practical NoSQL: Accumulo's dirlist Example

HDFS Issues

2.introduction to hdfs

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...

1. beyond mission critical virtualizing big data and hadoop

Burst Presto & Spark workloads to AWS EMR with no data copies

Hadoop training in bangalore

On-premise Spark as a Service with YARN

Hadoop Distributed File System

Hadoop on-mesos

Understanding Distributed Databases Scalability

From limited Hadoop compute capacity to increased data scientist efficiency

Similar to Cloudera Impala - HUG Karlsruhe, July 04, 2013

Cloudera impalaSwiss Big Data User Group

Hw09 Cross Data Center Logs ProcessingCloudera, Inc.

My other computer_is_a_datacentreSteve Loughran

Cloud Lambda Architecture PatternsAsis Mohanty

Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal

Hadoop and Voldemort @ LinkedInHadoop User Group

Hopsworks - The Platform for Data-Intensive AIQAware GmbH

Proud to be polyglot!NLJUG

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Hadoop tutorialAamir Ameen

Hadoop Tutorial.pptSathish24111

Data Management on Hadoop at Yahoo!Seetharam Venkatesh

HADOOPHarinder Kaur

Hadoop and Netezza - Co-existence or Competition?Krishnan Parasuraman

Ceph Day San Jose - Object Storage for Big Data Ceph Community

SparkSrinath Reddy

How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah

Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling

Apache Hadoop- Hadoop Basics.pptxMiraj Godha

Similar to Cloudera Impala - HUG Karlsruhe, July 04, 2013 (20)

Cloudera impala

Hw09 Cross Data Center Logs Processing

My other computer_is_a_datacentre

Cloud Lambda Architecture Patterns

Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

Hopsworks - The Platform for Data-Intensive AI

Proud to be polyglot!

Hadoop ecosystem framework n hadoop in live environment

Hadoop tutorial

Hadoop Tutorial.ppt

Data Management on Hadoop at Yahoo!

HADOOP

Hadoop and Netezza - Co-existence or Competition?

Ceph Day San Jose - Object Storage for Big Data

Spark

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Hopsworks in the cloud Berlin Buzzwords 2019

Apache Hadoop- Hadoop Basics.pptx

Recently uploaded

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

"ML in Production",Oleksandr BaganFwdays

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Training state-of-the-art general text embeddingZilliz

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms

My INSURER PTE LTD - Insurtech Innovation Award 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Unleash Your Potential - Namagunga Girls Coding Club

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Artificial intelligence in cctv survelliance.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Are Multi-Cloud and Serverless Good or Bad?

"ML in Production",Oleksandr Bagan

Vertex AI Gemini Prompt Engineering Tips

"Debugging python applications inside k8s environment", Andrii Soldatenko

SIP trunking in Janus @ Kamailio World 2024

Unraveling Multimodality with Large Language Models.pdf

Powerpoint exploring the locations used in television show Time Clash

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Training state-of-the-art general text embedding

Connect Wave/ connectwave Pitch Deck Presentation

DMCC Future of Trade Web3 - Special Edition

Gen AI in Business - Global Trends Report 2024.pdf

Cloudera Impala - HUG Karlsruhe, July 04, 2013

1. Cloudera Impala Real Time Query for HDFS and HBase Alexander Alten-Lorenz, Cloudera INC Tuesday, July 2, 13

2. 2 Beyond Batch What is Impala Capability Architecture Demo Tuesday, July 2, 13

3. Beyond Batch 3 For some things MapReduce is just too slow Apache Hive: MapReduce execution engine High-latency, low throughput High runtime overhead Google realized this early on Analysts wanted fast, interactive results Tuesday, July 2, 13

4. Dremel 4 Google paper (2010) “scalable, interactive ad-hoc query system for analysis of read-only nested data” Columnar storage format Distributed scalable aggregation “capable of running aggregation queries over trillion-row tables in seconds” http://research.google.com/pubs/pub36632.html Tuesday, July 2, 13

5. Impala: Goals 5 General-purpose SQL query engine for Hadoop For analytical and transactional workloads Support queries that take μs to hours Run directly with Hadoop Collocated daemons Same ﬁle formats Same storage managers (NN, metastore) Tuesday, July 2, 13

6. Impala: Goals 6 High performance C++ runtime code generation (LLVM) direct access to data (no MapReduce) Retain user experience easy for Hive users to migrate 100% open-source Tuesday, July 2, 13

7. Impala: Capability 7 HiveQL (subset of SQL92) select, project, join, union, subqueries, aggregation, insert, order by (with limit) DDL Directly queries data in HDFS & HBase Text ﬁles (compressed) Sequence ﬁles (snappy/gzip) Avro &Trevni GA features Tuesday, July 2, 13

8. Impala: Capability 8 Familiar and uniﬁed platform Uses Hive’s metastore Submit queries via ODBC | BeeswaxThrift API Query is distributed to nodes with relevant data Process-to-process data exchange Kerberos authentication No fault tolerance Tuesday, July 2, 13

9. Impala: Performance 9 Greater disk throughput ~100MB/sec/disk I/O-bound workloads faster by 3-4x Queries that require multiple map-reduce phases in Hive are signiﬁcantly faster in Impala (up to 45x) Queries that run against in-memory cached data see a signiﬁcant speedup (up to 90x) Tuesday, July 2, 13

10. Impala:Architecture 10 impalad runs on every node handles client requests (ODBC, thrift) handles query planning & execution statestored provides name service metadata distribution used for ﬁnding data Tuesday, July 2, 13

11. Impala:Architecture 11 Tuesday, July 2, 13

12. Impala:Architecture 12 Tuesday, July 2, 13

13. Impala:Architecture 13 Tuesday, July 2, 13

14. Impala:Architecture 14 Tuesday, July 2, 13

15. Current limitations 15 1.0.1 (available since May 2013) No SerDes No User Deﬁned Functions (UDF’s) impalad’s only read statestored metadata at startup Tuesday, July 2, 13

16. Futures 16 DDL support (CREATE) Rudimentary cost-based optimizer (CBO) metadata distribution through statestored Doug Cutting’sTrevni Columnar storage format like Dremel’s Impala +Trevni = Dremel superset Tuesday, July 2, 13

17. Demo 17 impala-user@cloudera.com alexander@cloudera.com @mapredit mapredit.blogspot.com Web: http://goo.gl/7sxdp Tuesday, July 2, 13

18. Tuesday, July 2, 13

Cloudera Impala - HUG Karlsruhe, July 04, 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloudera Impala - HUG Karlsruhe, July 04, 2013

Similar to Cloudera Impala - HUG Karlsruhe, July 04, 2013 (20)

More from Alexander Alten-Lorenz

More from Alexander Alten-Lorenz (12)

Recently uploaded

Recently uploaded (20)

Cloudera Impala - HUG Karlsruhe, July 04, 2013