SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Running Cloudera Impala on PostgreSQL

By Chengzhong Liu
liuchengzhong@miaozhen.com
2013.12
Story coming from…
• Data gravity
• Why big data
• Why SQL on big data
Today agenda
•
•
•
•
•
•

Big data in Miaozhen 秒针系统
Overview of Cloudera Impala
Hacking practice in Cloudera Impala
Performance
Conclusions
Q&A
What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster

• Besides this
–
–
–
–

TV Monitor
Mobile Monitor
Site Monitor
…
Before Hadoop
• Scrat
– PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan

• Mobile Monitor
– Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression
Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface
SQL on Hadoop
•
•
•
•
•

Google Dremel
Apache Drill
Cloudera Impala
Facebook Presto
EMC Greenplum/Pivotal

Latency matters

Pig

Impala/Drill
/Pivotal/Presto

Map Reduce

HDFS

Hive
What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join

• Small result size
Why Cloudera Impala
• The team move fast
– UDF coming out
– Better join strategy on the way

• Good code base
– Modularize
– Easy to add sub classes

• Really fast
– Llvm code generation
• 80s/95s – uv test

– Distributed aggregation Tree
– In-situ data processing (inside storage)
Typical Arch.
SQL Interface

Meta Store

Query
Planner

Query
Planner

Query
Planner

Coordinat
or

Coordinat
or

Coordinat
or

Exec
Engine

Exec
Engine

Exec
Engine
Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed

• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…
Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info

• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode
Front end
• Meta store update
– Link data to the table name
– Table location management

• Front end
– Compute table location
Back end
• Coordinator
– pg host

• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor
SQL Plan
• select count(distinct id)
from table
– MR like process

HDFS/PG scan
Aggr. : group by id

Exchange node
Aggr. : group by id
Aggr. : count(id)

Exchange node
Aggr.: sum(count(id)
Env.
• Ads impression logs
– 150 millions, 100KB/line

• 3 servers
–
–
–
–

24 cores
32 G mem
2T * 12 HD
100Mbps LAN

• Query
– Select count(id) from t group by campaign
– Select count(distinct id) from t group by campaign
– Select * from t where id = ‘xxxxxxxx’
Performance
• Group by speed / core
• 20 M /s
700
600
500
400

impala
hive

300

pg+impala

200
100
0
1

2

3
With index
Codegen on/off
• select count(distinct id)
from t group by c

100
90
80
70

• select distinct id
from t

60
50

en_codegen

40

dis_codegen

30

•

20
select id from t
10
group by id
0
having
uv_test
count(case when c = '1' then 1 else null end) > 0
and
count(case when c= 2' then 1 else null end) > 0
limit 10;

distinct

duplicated
Multi-users
Conclusion
• Source quality
– Readable
– Google C++ style
– Robust

• MPP solution based on PG
– Proved perf.
– Easy to scale

• Mixed engine usage
– HDFS and DB
What’s next
•
•
•
•
•

Yarn integrating
UDF
Join with Big table
BI roadmap
Fail over
Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and
%20BigQuery.ppt
• http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.
de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh
Thanks!
Q&A

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 

Was ist angesagt? (20)

Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Apache drill
Apache drillApache drill
Apache drill
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

Andere mochten auch

Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Protecting Your IP with Perforce Helix and Interset
Protecting Your IP with Perforce Helix and IntersetProtecting Your IP with Perforce Helix and Interset
Protecting Your IP with Perforce Helix and IntersetPerforce
 
Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Yukinori Suda
 
Impala データサイエンティストのための 高速大規模分散基盤 #tokyowebmining
Impala データサイエンティストのための 高速大規模分散基盤 #tokyowebminingImpala データサイエンティストのための 高速大規模分散基盤 #tokyowebmining
Impala データサイエンティストのための 高速大規模分散基盤 #tokyowebminingSho Shimauchi
 
The moroccan ethnic groups of Morocco
The moroccan ethnic groups of MoroccoThe moroccan ethnic groups of Morocco
The moroccan ethnic groups of MoroccoMohsine Mahraj
 
Elephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksCommand Prompt., Inc
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計Cloudera Japan
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
R-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat IntelligenceR-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat IntelligenceJason Trost
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon
 
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)Uptime Technologies LLC (JP)
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 

Andere mochten auch (17)

Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Protecting Your IP with Perforce Helix and Interset
Protecting Your IP with Perforce Helix and IntersetProtecting Your IP with Perforce Helix and Interset
Protecting Your IP with Perforce Helix and Interset
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
 
Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話
 
Impala データサイエンティストのための 高速大規模分散基盤 #tokyowebmining
Impala データサイエンティストのための 高速大規模分散基盤 #tokyowebminingImpala データサイエンティストのための 高速大規模分散基盤 #tokyowebmining
Impala データサイエンティストのための 高速大規模分散基盤 #tokyowebmining
 
GoでKVSを書けるのか
GoでKVSを書けるのかGoでKVSを書けるのか
GoでKVSを書けるのか
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
The moroccan ethnic groups of Morocco
The moroccan ethnic groups of MoroccoThe moroccan ethnic groups of Morocco
The moroccan ethnic groups of Morocco
 
Elephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forks
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
R-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat IntelligenceR-CISC Summit 2016 Borderless Threat Intelligence
R-CISC Summit 2016 Borderless Threat Intelligence
 
HBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon 2015: Running ML Infrastructure on HBase
HBaseCon 2015: Running ML Infrastructure on HBase
 
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 

Ähnlich wie Cloudera Impala + PostgreSQL

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...DataStax Academy
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...Łukasz Grala
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
MySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and GraphiteMySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and GraphiteDB-Art
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
Getting started with amazon redshift - Toronto
Getting started with amazon redshift - TorontoGetting started with amazon redshift - Toronto
Getting started with amazon redshift - TorontoAmazon Web Services
 
Solr Power FTW: Powering NoSQL the World Over
Solr Power FTW: Powering NoSQL the World OverSolr Power FTW: Powering NoSQL the World Over
Solr Power FTW: Powering NoSQL the World OverAlex Pinkin
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
 

Ähnlich wie Cloudera Impala + PostgreSQL (20)

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
MySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and GraphiteMySQL performance monitoring using Statsd and Graphite
MySQL performance monitoring using Statsd and Graphite
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Getting started with amazon redshift - Toronto
Getting started with amazon redshift - TorontoGetting started with amazon redshift - Toronto
Getting started with amazon redshift - Toronto
 
Solr Power FTW: Powering NoSQL the World Over
Solr Power FTW: Powering NoSQL the World OverSolr Power FTW: Powering NoSQL the World Over
Solr Power FTW: Powering NoSQL the World Over
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 

Kürzlich hochgeladen

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Kürzlich hochgeladen (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Cloudera Impala + PostgreSQL

  • 1. Running Cloudera Impala on PostgreSQL By Chengzhong Liu liuchengzhong@miaozhen.com 2013.12
  • 2. Story coming from… • Data gravity • Why big data • Why SQL on big data
  • 3. Today agenda • • • • • • Big data in Miaozhen 秒针系统 Overview of Cloudera Impala Hacking practice in Cloudera Impala Performance Conclusions Q&A
  • 4. What happened in miaozhen • 3 billion Ads impression per day • 20TB data scan for report generation every morning • 24 servers cluster • Besides this – – – – TV Monitor Mobile Monitor Site Monitor …
  • 5. Before Hadoop • Scrat – PostgreSQL 9.1 cluster – Write a simple proxy – <2s for 2TB data scan • Mobile Monitor – Hadoop-like distribute computing system – Rabbit MQ + 3 computing servers – Write a Map-Reduce in C++ – Handles 30 millions to 500 millions Ads impression
  • 6. Problem & Chance • Database cluster • SQL on Hadoop • Miscellaneous data • Requirements – Most data is rational – SQL interface
  • 7. SQL on Hadoop • • • • • Google Dremel Apache Drill Cloudera Impala Facebook Presto EMC Greenplum/Pivotal Latency matters Pig Impala/Drill /Pivotal/Presto Map Reduce HDFS Hive
  • 8. What’s this • A kind of MPP engine • In memory processing • Small to big join – Broadcast join • Small result size
  • 9. Why Cloudera Impala • The team move fast – UDF coming out – Better join strategy on the way • Good code base – Modularize – Easy to add sub classes • Really fast – Llvm code generation • 80s/95s – uv test – Distributed aggregation Tree – In-situ data processing (inside storage)
  • 10. Typical Arch. SQL Interface Meta Store Query Planner Query Planner Query Planner Coordinat or Coordinat or Coordinat or Exec Engine Exec Engine Exec Engine
  • 11. Our target • A MPP database – Build on PostgreSQL9.1 – Scale well – Speed • A mixed data source MPP query engine – Join two tables in different sources – In fact…
  • 12. Hacking… from where • Add, not change – Scan Node type – DB Meta info • Put changes in configuration – Thrift Protocol update • TDBHostInfo • TDBScanNode
  • 13. Front end • Meta store update – Link data to the table name – Table location management • Front end – Compute table location
  • 14. Back end • Coordinator – pg host • New scan node type – db scan node • Pg scan node • Psql library using cursor
  • 15. SQL Plan • select count(distinct id) from table – MR like process HDFS/PG scan Aggr. : group by id Exchange node Aggr. : group by id Aggr. : count(id) Exchange node Aggr.: sum(count(id)
  • 16. Env. • Ads impression logs – 150 millions, 100KB/line • 3 servers – – – – 24 cores 32 G mem 2T * 12 HD 100Mbps LAN • Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’
  • 17. Performance • Group by speed / core • 20 M /s 700 600 500 400 impala hive 300 pg+impala 200 100 0 1 2 3
  • 19. Codegen on/off • select count(distinct id) from t group by c 100 90 80 70 • select distinct id from t 60 50 en_codegen 40 dis_codegen 30 • 20 select id from t 10 group by id 0 having uv_test count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10; distinct duplicated
  • 21. Conclusion • Source quality – Readable – Google C++ style – Robust • MPP solution based on PG – Proved perf. – Easy to scale • Mixed engine usage – HDFS and DB
  • 22. What’s next • • • • • Yarn integrating UDF Join with Big table BI roadmap Fail over
  • 23. Rerf. • Cloudera Impala online doc. & src • http://files.meetup.com/1727991/Impala%20and %20BigQuery.ppt • http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/ • http://berlinbuzzwords.de/sites/berlinbuzzwords. de/files/slides/Impala%20tech%20talk.pdf • @datascientist, @dongxicheng, @flyingsk, @zhh