Speed up Cubing with Spark

•

9 gefällt mir•3,756 views

This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.

Daten & Analysen

Luke Han, luke@kyligence.io
Shaofeng Shi, shaofeng@kyligence.io
Apache Kylin:
Speed up Cubing with Spark

About Us
Shaofeng Shi (史少锋)
• Apache Kylin PMC
• Sr. Architect,
Kyligence Inc.
Luke Han (韩卿)
• VP of Apache Kylin
• Co-founder & CEO,
Kyligence Inc.
formed by the team who created Apache Kylin
http://kyligence.io

Agenda
• What is Apache Kylin
• Challenges with MapReduce
• Speed up Cubing with Spark
• Q&A

About Apache Kylin
ü Leading open source OLAP on Hadoop
ü Fast growing open source community
ü Adopted by 200+ global organizations
ü First born in China Apache Top Level Project
ü InfoWorld BossieAward:
– Best Open Source Big Data Tool (2015)
– Best Open Source Big Data Tool (2016)
Kylin / ˈkiːˈlɪn / 麒麟
-- n. (in Chinese art) a mythical animal of composite form
Apache Kylin
Extreme OLAP Engine for Big Data
http://kylin.apache.org

Global Users 200+ use cases in production

About Apache Kylin
• Extreme OLAP Engine on Hadoop
ü High Performance
ü High Concurrency
ü ANSI SQL
ü Native on Hadoop
ü Cloud Ready

The Basic: OLAP Cube
• Pre-calculate metrics by dimensions

The Magic: Pre-Calculation
Sort
Aggr
Filter
Tables
Join
Sort
Filter
Online:Query
from Cube
Post
Aggr.
SELECT
returned,
status,
sum(quantity),
sum(price)
FROM
lineitem INNER JOIN orders
on l_orderkey =o_orderkey
WHERE
shipdate =’2016-09-16'
GROUP BY
returned, status
ORDER BY
returned, status;
Offline:Cubing
Cube

Apache Kylin: O(1)
O(N)
O(1)
Apache Kylin
Data Size
Response
Time
Other Engines

Star-Schema Benchmark
Run SSB at 10, 20 and 40 million-row scales,
Kylin’s response time keep stable
Read more at: https://github.com/kyligence/ssb-kylin
Kylin vs Hive, O(1) : O(N)

Cubing Process
Cubing: Map -> Shuffle -> Reduce

Build Cube with MapReduce
• Calculate Cuboids by layer :N
dim (Base cuboid), N-1 dim,
N-2…, 1, 0
• Reuse previous layer’s result
• HDFS used for data sharing
• Totally need N round MR;

Challenges with MapReduce
• Slow data sharing
ü Serialization, Replication, Deserialization…
• Repeated job submission
ü Submit dependent jar/files repeatedly
ü Re-queue when cluster is busy
• Short of streaming support
ü Couldn’t support low latency data processing
• Limited storage support
ü Couldn’t ingest data from non-HDFS

Speed up Cubing with Spark
• Abstract each layer cuboids as
a RDD
• Cache parent RDD to generate
Child RDD
• Export RDD when Child be
generated
• As-is measure aggregators can
be reused with Spark Java API
RDD-1
RDD-2
RDD-3
RDD-4
RDD-5

Speed up Cubing with Spark
Reuse data in memory

Speed up Cubing with Spark
One job finishes all layers aggregation!

Performance Benchmark
• Environment
ü 4 nodes Hadoop cluster; each has 28 GB RAM and 12 cores
ü CDH 5.8, Apache Kylin 2.0
• Spark
ü Spark 1.6.3 on YARN
ü 6 executors, each has 4 cores, 5GB memory
• Test Data
ü Airline data of US DoT, totally 160 million rows
ü Cube:10 dimensions, 5 measures (SUM)
• Test Scenarios
ü Build the cube at different scale: 3 million, 50 million and 160 million rows

Spark Cubing vs. MR Cubing
Build Cube Time Comparison
BuildTime(minute)
0
8
15
23
30
Source data rows
3-million 50-million 160-million
MapReduce Spark MapReduce Spark MapReduce Spark
17
8
3
29
19
6

Spark Cubing vs. MR In-Mem Cubing
MR In-mem is slow when dataset is big
and random distributed
When data is in sharding,
MR In-mem can be fast
Spark is fast &
stable regardless
of data distribution

Benefits with Spark
• Spark speeds up Cubing at 1x
ü Half time be saved
• Spark simplifies Kylin’s development
ü More functions, less codes
• Spark brings Kylin to a new era
ü Real-time OLAP, Ad hoc query, Cloud integration

Thank You.
Join Apache Kylin community
dev@kylin.apache.org
Visit Kylin & Kyligence home
http://kylin.apache.org
http://kyligence.io

Empfohlen

Design cube in Apache KylinYang Li

Accelerating Big Data Analytics with Apache KylinTyler Wishnoff

Building an open data platform with apache icebergAlluxio, Inc.

Hadoop Operations - Best Practices from the FieldDataWorks Summit

Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Spark tunning in Apache KylinShi Shao Feng

Deep Dive: Memory Management in Apache SparkDatabricks

Empfohlen

Design cube in Apache KylinYang Li

Accelerating Big Data Analytics with Apache KylinTyler Wishnoff

Building an open data platform with apache icebergAlluxio, Inc.

Hadoop Operations - Best Practices from the FieldDataWorks Summit

Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Spark tunning in Apache KylinShi Shao Feng

Deep Dive: Memory Management in Apache SparkDatabricks

Apache Kylin - Balance Between Space and TimeDataWorks Summit

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent

Apache KylinBYOUNG GON KIM

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Run the elastic stack on kubernetes with eck Daliya Spasova

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent

Building large scale transactional data lake using apache hudiBill Liu

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Accelerating Data Ingestion with Databricks AutoloaderDatabricks

(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Apache Flink, AWS Kinesis, Analytics Araf Karsh Hamid

Apache Kylin – Cubes on HadoopDataWorks Summit

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

Deploying your Data Warehouse on AWSAmazon Web Services

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Large Scale Lakehouse Implementation Using Structured StreamingDatabricks

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Big Data Processing with Apache Spark 2014mahchiev

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Kylin - Balance Between Space and TimeDataWorks Summit

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent

Apache KylinBYOUNG GON KIM

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Run the elastic stack on kubernetes with eck Daliya Spasova

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent

Building large scale transactional data lake using apache hudiBill Liu

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Accelerating Data Ingestion with Databricks AutoloaderDatabricks

(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Apache Flink, AWS Kinesis, Analytics Araf Karsh Hamid

Apache Kylin – Cubes on HadoopDataWorks Summit

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

Deploying your Data Warehouse on AWSAmazon Web Services

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Large Scale Lakehouse Implementation Using Structured StreamingDatabricks

Was ist angesagt? (20)

Apache Kylin - Balance Between Space and Time

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...

Apache Kylin

Apache Iceberg: An Architectural Look Under the Covers

Run the elastic stack on kubernetes with eck

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

Building large scale transactional data lake using apache hudi

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Apache Iceberg - A Table Format for Hige Analytic Datasets

Accelerating Data Ingestion with Databricks Autoloader

(BDT403) Best Practices for Building Real-time Streaming Applications with Am...

Enabling Vectorized Engine in Apache Spark

Dongwon Kim – A Comparative Performance Evaluation of Flink

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Apache Flink, AWS Kinesis, Analytics

Apache Kylin – Cubes on Hadoop

Hive Bucketing in Apache Spark with Tejas Patil

Deploying your Data Warehouse on AWS

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Large Scale Lakehouse Implementation Using Structured Streaming

Ähnlich wie Speed up Cubing with Spark

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Big Data Processing with Apache Spark 2014mahchiev

Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan

Apache Spark in IndustryDorian Beganovic

Intro to Apache Spark by CTO of TwingoMapR Technologies

Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Webinar - DreamObjects/Ceph Case StudyCeph Community

OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud

New Analytics Toolbox DevNexus 2015Robbie Strickland

Apache Spark FundamentalsZahra Eskandari

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks

Spark Uber Development KitDataWorks Summit/Hadoop Summit

Apache kylin - Big Data Technology Conference 2014 BeijingLuke Han

Big Data trainingvishal192091

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Ähnlich wie Speed up Cubing with Spark (20)

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Big Data Processing with Apache Spark 2014

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Apache Spark in Industry

Intro to Apache Spark by CTO of Twingo

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Webinar - DreamObjects/Ceph Case Study

OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...

New Analytics Toolbox DevNexus 2015

Apache Spark Fundamentals

Apache Spark for Everyone - Women Who Code Workshop

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...

Spark Uber Development Kit

Apache kylin - Big Data Technology Conference 2014 Beijing

Big Data training

Trend Micro Big Data Platform and Apache Bigtop

Mehr von Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Kürzlich hochgeladen

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Data-Analysis for Chicago Crime Data 2023ymrp368

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Halmar dropshipping via API with DroFxolyaivanovalion

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

April 2024 - Crypto Market Report's Analysismanisha194592

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Introduction-to-Machine-Learning (1).pptxfirstjob4

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Kürzlich hochgeladen (20)

Carero dropshipping via API with DroFx.pptx

Data-Analysis for Chicago Crime Data 2023

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

FESE Capital Markets Fact Sheet 2024 Q1.pdf

100-Concepts-of-AI by Anupama Kate .pptx

Halmar dropshipping via API with DroFx

Schema on read is obsolete. Welcome metaprogramming..pdf

CebaBaby dropshipping via API with DroFX.pptx

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

April 2024 - Crypto Market Report's Analysis

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

BabyOno dropshipping via API with DroFx.pptx

Introduction-to-Machine-Learning (1).pptx

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Speed up Cubing with Spark

1. Luke Han, luke@kyligence.io Shaofeng Shi, shaofeng@kyligence.io Apache Kylin: Speed up Cubing with Spark

2. About Us Shaofeng Shi (史少锋) • Apache Kylin PMC • Sr. Architect, Kyligence Inc. Luke Han (韩卿) • VP of Apache Kylin • Co-founder & CEO, Kyligence Inc. formed by the team who created Apache Kylin http://kyligence.io

3. Agenda • What is Apache Kylin • Challenges with MapReduce • Speed up Cubing with Spark • Q&A

4. About Apache Kylin ü Leading open source OLAP on Hadoop ü Fast growing open source community ü Adopted by 200+ global organizations ü First born in China Apache Top Level Project ü InfoWorld BossieAward: – Best Open Source Big Data Tool (2015) – Best Open Source Big Data Tool (2016) Kylin / ˈkiːˈlɪn / 麒麟 -- n. (in Chinese art) a mythical animal of composite form Apache Kylin Extreme OLAP Engine for Big Data http://kylin.apache.org

5. Global Users 200+ use cases in production

6. Use Case

7. About Apache Kylin • Extreme OLAP Engine on Hadoop ü High Performance ü High Concurrency ü ANSI SQL ü Native on Hadoop ü Cloud Ready

8. The Basic: OLAP Cube • Pre-calculate metrics by dimensions

9. The Magic: Pre-Calculation Sort Aggr Filter Tables Join Sort Filter Online:Query from Cube Post Aggr. SELECT returned, status, sum(quantity), sum(price) FROM lineitem INNER JOIN orders on l_orderkey =o_orderkey WHERE shipdate =’2016-09-16' GROUP BY returned, status ORDER BY returned, status; Offline:Cubing Cube

10. Apache Kylin: O(1) O(N) O(1) Apache Kylin Data Size Response Time Other Engines

11. Star-Schema Benchmark Run SSB at 10, 20 and 40 million-row scales, Kylin’s response time keep stable Read more at: https://github.com/kyligence/ssb-kylin Kylin vs Hive, O(1) : O(N)

12. Architecture

13. Cubing Process Cubing: Map -> Shuffle -> Reduce

14. Build Cube with MapReduce • Calculate Cuboids by layer :N dim (Base cuboid), N-1 dim, N-2…, 1, 0 • Reuse previous layer’s result • HDFS used for data sharing • Totally need N round MR;

15. Challenges with MapReduce • Slow data sharing ü Serialization, Replication, Deserialization… • Repeated job submission ü Submit dependent jar/files repeatedly ü Re-queue when cluster is busy • Short of streaming support ü Couldn’t support low latency data processing • Limited storage support ü Couldn’t ingest data from non-HDFS

16.

17. Speed up Cubing with Spark • Abstract each layer cuboids as a RDD • Cache parent RDD to generate Child RDD • Export RDD when Child be generated • As-is measure aggregators can be reused with Spark Java API RDD-1 RDD-2 RDD-3 RDD-4 RDD-5

18. Speed up Cubing with Spark Reuse data in memory

19. Speed up Cubing with Spark One job finishes all layers aggregation!

20. Performance Benchmark • Environment ü 4 nodes Hadoop cluster; each has 28 GB RAM and 12 cores ü CDH 5.8, Apache Kylin 2.0 • Spark ü Spark 1.6.3 on YARN ü 6 executors, each has 4 cores, 5GB memory • Test Data ü Airline data of US DoT, totally 160 million rows ü Cube:10 dimensions, 5 measures (SUM) • Test Scenarios ü Build the cube at different scale: 3 million, 50 million and 160 million rows

21. Spark Cubing vs. MR Cubing Build Cube Time Comparison BuildTime(minute) 0 8 15 23 30 Source data rows 3-million 50-million 160-million MapReduce Spark MapReduce Spark MapReduce Spark 17 8 3 29 19 6

22. Spark Cubing vs. MR In-Mem Cubing MR In-mem is slow when dataset is big and random distributed When data is in sharding, MR In-mem can be fast Spark is fast & stable regardless of data distribution

23. Benefits with Spark • Spark speeds up Cubing at 1x ü Half time be saved • Spark simplifies Kylin’s development ü More functions, less codes • Spark brings Kylin to a new era ü Real-time OLAP, Ad hoc query, Cloud integration

24. Thank You. Join Apache Kylin community dev@kylin.apache.org Visit Kylin & Kyligence home http://kylin.apache.org http://kyligence.io