Cloudera Impala + PostgreSQL

•Als PPTX, PDF herunterladen•

4 gefällt mir•3,784 views

Hacking Cloudera Impala for running on PostgreSQL cluster as MPP style. Performances under typical sql stmt and concurrence case are verified.

Technologie Sport

Running Cloudera Impala on PostgreSQL

By Chengzhong Liu
liuchengzhong@miaozhen.com
2013.12

Story coming from…
• Data gravity
• Why big data
• Why SQL on big data

Today agenda
•
•
•
•
•
•

Big data in Miaozhen 秒针系统
Overview of Cloudera Impala
Hacking practice in Cloudera Impala
Performance
Conclusions
Q&A

What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster

• Besides this
–
–
–
–

TV Monitor
Mobile Monitor
Site Monitor
…

Before Hadoop
• Scrat
– PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan

• Mobile Monitor
– Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression

Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface

SQL on Hadoop
•
•
•
•
•

Google Dremel
Apache Drill
Cloudera Impala
Facebook Presto
EMC Greenplum/Pivotal

Latency matters

Pig

Impala/Drill
/Pivotal/Presto

Map Reduce

HDFS

Hive

What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join

• Small result size

Why Cloudera Impala
• The team move fast
– UDF coming out
– Better join strategy on the way

• Good code base
– Modularize
– Easy to add sub classes

• Really fast
– Llvm code generation
• 80s/95s – uv test

– Distributed aggregation Tree
– In-situ data processing (inside storage)

Typical Arch.
SQL Interface

Meta Store

Query
Planner

Query
Planner

Query
Planner

Coordinat
or

Coordinat
or

Coordinat
or

Exec
Engine

Exec
Engine

Exec
Engine

Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed

• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…

Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info

• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode

Front end
• Meta store update
– Link data to the table name
– Table location management

• Front end
– Compute table location

Back end
• Coordinator
– pg host

• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor

SQL Plan
• select count(distinct id)
from table
– MR like process

HDFS/PG scan
Aggr. : group by id

Exchange node
Aggr. : group by id
Aggr. : count(id)

Exchange node
Aggr.: sum(count(id)

Env.
• Ads impression logs
– 150 millions, 100KB/line

• 3 servers
–
–
–
–

24 cores
32 G mem
2T * 12 HD
100Mbps LAN

• Query
– Select count(id) from t group by campaign
– Select count(distinct id) from t group by campaign
– Select * from t where id = ‘xxxxxxxx’

Performance
• Group by speed / core
• 20 M /s
700
600
500
400

impala
hive

300

pg+impala

200
100
0
1

2

3

Codegen on/off
• select count(distinct id)
from t group by c

100
90
80
70

• select distinct id
from t

60
50

en_codegen

40

dis_codegen

30

•

20
select id from t
10
group by id
0
having
uv_test
count(case when c = '1' then 1 else null end) > 0
and
count(case when c= 2' then 1 else null end) > 0
limit 10;

distinct

duplicated

Conclusion
• Source quality
– Readable
– Google C++ style
– Robust

• MPP solution based on PG
– Proved perf.
– Easy to scale

• Mixed engine usage
– HDFS and DB

What’s next
•
•
•
•
•

Yarn integrating
UDF
Join with Big table
BI roadmap
Fail over

Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and
%20BigQuery.ppt
• http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.
de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Impalamarkgrover

October 2014 HUG : Hive On SparkYahoo Developer Network

SQOOP - RDBMS to HadoopSofian Hadiwijaya

Hadoop EcosystemLior Sidi

Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma

Apache Spark & HadoopMapR Technologies

Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit

Kudu - Fast Analytics on Fast DataRyan Bosshart

Introduction to AWS Big Data Omid Vahdaty

Apache drillMapR Technologies

Hoodie - DataEngConf 2017Vinoth Chandar

HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.

HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon

Nextag talkJoydeep Sen Sarma

Introduction to the Hadoop EcoSystemShivaji Dutta

Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies

Exponea - Kafka and Hadoop as components of architectureMartinStrycek

Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA

Was ist angesagt? (20)

Introduction to Impala

October 2014 HUG : Hive On Spark

SQOOP - RDBMS to Hadoop

Hadoop Ecosystem

Qubole @ AWS Meetup Bangalore - July 2015

Apache Spark & Hadoop

Hadoop Hive Talk At IIT-Delhi

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks

Kudu - Fast Analytics on Fast Data

Introduction to AWS Big Data

Apache drill

Hoodie - DataEngConf 2017

HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop

Nextag talk

Introduction to the Hadoop EcoSystem

Maintaining Low Latency While Maximizing Throughput on a Single Cluster

Exponea - Kafka and Hadoop as components of architecture

Intro to Apache Kudu (short) - Big Data Application Meetup

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...

Andere mochten auch

Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.

Protecting Your IP with Perforce Helix and IntersetPerforce

Database aggregation using metadataDr Sandeep Kumar Poonia

Cloudera Impalaをサービスに組み込むときに苦労した話Yukinori Suda

Impala データサイエンティストのための高速大規模分散基盤 #tokyowebminingSho Shimauchi

GoでKVSを書けるのかMoriyoshi Koizumi

Cloudera ImpalaScott Leberknight

The moroccan ethnic groups of MoroccoMohsine Mahraj

Elephant Roads: a tour of Postgres forksCommand Prompt., Inc

#cwt2016 Apache Kudu 構成とテーブル設計Cloudera Japan

Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.

R-CISC Summit 2016 Borderless Threat IntelligenceJason Trost

HBaseCon 2015: Running ML Infrastructure on HBaseHBaseCon

PostgreSQLアーキテクチャ入門（PostgreSQL Conference 2012）Uptime Technologies LLC (JP)

Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi

HBase Storage InternalsDataWorks Summit

Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit

Andere mochten auch (17)

Impala 2.0 - The Best Analytic Database for Hadoop

Protecting Your IP with Perforce Helix and Interset

Database aggregation using metadata

Cloudera Impalaをサービスに組み込むときに苦労した話

Impala データサイエンティストのための高速大規模分散基盤 #tokyowebmining

GoでKVSを書けるのか

Cloudera Impala

The moroccan ethnic groups of Morocco

Elephant Roads: a tour of Postgres forks

#cwt2016 Apache Kudu 構成とテーブル設計

Side by Side with Elasticsearch & Solr, Part 2

R-CISC Summit 2016 Borderless Threat Intelligence

HBaseCon 2015: Running ML Infrastructure on HBase

PostgreSQLアーキテクチャ入門（PostgreSQL Conference 2012）

Presto - Hadoop Conference Japan 2014

HBase Storage Internals

Debunking the Myths of HDFS Erasure Coding Performance

Ähnlich wie Cloudera Impala + PostgreSQL

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...DataStax Academy

DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal

MariaDB ColumnStoreMariaDB plc

Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd

Big Data Analytics with MariaDB ColumnStoreMariaDB plc

Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic

Webinar: SQL for Machine Data?Crate.io

Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta

Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI

WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...Łukasz Grala

Agility and Scalability with MongoDBMongoDB

Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely

Rapids: Data Science on GPUsinside-BigData.com

NVIDIA Rapids presentationtestSri1

MySQL performance monitoring using Statsd and GraphiteDB-Art

Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa

Ops Jumpstart: MongoDB Administration 101MongoDB

Getting started with amazon redshift - TorontoAmazon Web Services

Solr Power FTW: Powering NoSQL the World OverAlex Pinkin

Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services

Ähnlich wie Cloudera Impala + PostgreSQL (20)

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...

DC Migration and Hadoop Scale For Big Billion Days

MariaDB ColumnStore

Migration to ClickHouse. Practical guide, by Alexander Zaitsev

Big Data Analytics with MariaDB ColumnStore

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

Webinar: SQL for Machine Data?

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

Overview of data analytics service: Treasure Data Service

WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...

Agility and Scalability with MongoDB

Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse

Rapids: Data Science on GPUs

NVIDIA Rapids presentation

MySQL performance monitoring using Statsd and Graphite

Tweaking perfomance on high-load projects_Думанский Дмитрий

Ops Jumpstart: MongoDB Administration 101

Getting started with amazon redshift - Toronto

Solr Power FTW: Powering NoSQL the World Over

Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...

Kürzlich hochgeladen

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

Kürzlich hochgeladen (20)

Unraveling Multimodality with Large Language Models.pdf

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

SIP trunking in Janus @ Kamailio World 2024

SAP Build Work Zone - Overview L2-L3.pptx

Connect Wave/ connectwave Pitch Deck Presentation

Commit 2024 - Secret Management made easy

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

How AI, OpenAI, and ChatGPT impact business and software.

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Scanning the Internet for External Cloud Exposures via SSL Certs

What is DBT - The Ultimate Data Build Tool.pdf

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Are Multi-Cloud and Serverless Good or Bad?

DevoxxFR 2024 Reproducible Builds with Apache Maven

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Take control of your SAP testing with UiPath Test Suite

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

From Family Reminiscence to Scholarly Archive .

Cloudera Impala + PostgreSQL

1. Running Cloudera Impala on PostgreSQL By Chengzhong Liu liuchengzhong@miaozhen.com 2013.12

2. Story coming from… • Data gravity • Why big data • Why SQL on big data

3. Today agenda • • • • • • Big data in Miaozhen 秒针系统 Overview of Cloudera Impala Hacking practice in Cloudera Impala Performance Conclusions Q&A

4. What happened in miaozhen • 3 billion Ads impression per day • 20TB data scan for report generation every morning • 24 servers cluster • Besides this – – – – TV Monitor Mobile Monitor Site Monitor …

5. Before Hadoop • Scrat – PostgreSQL 9.1 cluster – Write a simple proxy – <2s for 2TB data scan • Mobile Monitor – Hadoop-like distribute computing system – Rabbit MQ + 3 computing servers – Write a Map-Reduce in C++ – Handles 30 millions to 500 millions Ads impression

6. Problem & Chance • Database cluster • SQL on Hadoop • Miscellaneous data • Requirements – Most data is rational – SQL interface

7. SQL on Hadoop • • • • • Google Dremel Apache Drill Cloudera Impala Facebook Presto EMC Greenplum/Pivotal Latency matters Pig Impala/Drill /Pivotal/Presto Map Reduce HDFS Hive

8. What’s this • A kind of MPP engine • In memory processing • Small to big join – Broadcast join • Small result size

9. Why Cloudera Impala • The team move fast – UDF coming out – Better join strategy on the way • Good code base – Modularize – Easy to add sub classes • Really fast – Llvm code generation • 80s/95s – uv test – Distributed aggregation Tree – In-situ data processing (inside storage)

10. Typical Arch. SQL Interface Meta Store Query Planner Query Planner Query Planner Coordinat or Coordinat or Coordinat or Exec Engine Exec Engine Exec Engine

11. Our target • A MPP database – Build on PostgreSQL9.1 – Scale well – Speed • A mixed data source MPP query engine – Join two tables in different sources – In fact…

12. Hacking… from where • Add, not change – Scan Node type – DB Meta info • Put changes in configuration – Thrift Protocol update • TDBHostInfo • TDBScanNode

13. Front end • Meta store update – Link data to the table name – Table location management • Front end – Compute table location

14. Back end • Coordinator – pg host • New scan node type – db scan node • Pg scan node • Psql library using cursor

15. SQL Plan • select count(distinct id) from table – MR like process HDFS/PG scan Aggr. : group by id Exchange node Aggr. : group by id Aggr. : count(id) Exchange node Aggr.: sum(count(id)

16. Env. • Ads impression logs – 150 millions, 100KB/line • 3 servers – – – – 24 cores 32 G mem 2T * 12 HD 100Mbps LAN • Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’

17. Performance • Group by speed / core • 20 M /s 700 600 500 400 impala hive 300 pg+impala 200 100 0 1 2 3

18. With index

19. Codegen on/off • select count(distinct id) from t group by c 100 90 80 70 • select distinct id from t 60 50 en_codegen 40 dis_codegen 30 • 20 select id from t 10 group by id 0 having uv_test count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10; distinct duplicated

20. Multi-users

21. Conclusion • Source quality – Readable – Google C++ style – Robust • MPP solution based on PG – Proved perf. – Easy to scale • Mixed engine usage – HDFS and DB

22. What’s next • • • • • Yarn integrating UDF Join with Big table BI roadmap Fail over

23. Rerf. • Cloudera Impala online doc. & src • http://files.meetup.com/1727991/Impala%20and %20BigQuery.ppt • http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/ • http://berlinbuzzwords.de/sites/berlinbuzzwords. de/files/slides/Impala%20tech%20talk.pdf • @datascientist, @dongxicheng, @flyingsk, @zhh

24. Thanks! Q&A

Cloudera Impala + PostgreSQL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Cloudera Impala + PostgreSQL

Ähnlich wie Cloudera Impala + PostgreSQL (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cloudera Impala + PostgreSQL