SlideShare ist ein Scribd-Unternehmen logo
1 von 28
The Evolution of Apache Kylin
Realtime & Plugin Architecture in Kylin
Li, Yang | 李扬
Co-founder & CTO at Kyligence Inc.
Agenda
 What’s Apache Kylin?
 New Features in Kylin
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from eBay that
provides SQL interface and multi-dimensional analysis (OLAP) on
Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Nov 2014 -- Apache Incubator Project
• Nov 2015 – Apache Top Level Project
Feature – SQL Interface
Hive Table
Build Cube
(Index)
SQL Query
 eBay
Feature – Big Data
Case Cube Size Raw Records
Session Analysis 20 TB 81+ billion rows
Traffic Analysis 30 TB 28+ billion rows
Transaction Analysis 560 GB 1.2+ billion rows
90% queries <5s
Dark-blue line: 90%tile queries
Light-blue line: 95%tile queries
90%ile query returns in 3 seconds
Feature – Low Latency
Feature – BI Integration via ODBC, JDBC
Linear scale out with more nodes
Feature – Scalable Throughput
Agenda
 What’s Apache Kylin?
 New Features in Kylin
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
Cube Builder (MapReduce…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
Plugin Architecture Overview
MR Engine
IN OUT
Hive
Source
HBase
Storage
Cube Metadata
SourceFactory StorageFactoryEngineFactory
Plugin Architecture
MR Engine
Plugin Architecture
Hive Adapter HBase Adapter
load data save cubeHive
Source
HBase
Storage
adapt to IN adapt to OUT
 Engine
 MR V1
 MR V2
 Spark (early)
 Streaming (experimental)
 Source
 Hive
 Kafka
 Spark SQL & DataFrames
 Storage
 HBase
 ? Kudu
 ? Cassandra
Developing Modules
 Freedom
 Zoo break, not bound to Hadoop any more
 Free to go to a better engine or storage
 Extensibility
 Accept any input, e.g. Kafka
 Embrace next-gen distributed platform, e.g. Spark
 Flexibility
 Choose different engine for different data set
The Freedom, Extensibility, Flexibility
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
A,B,C,D
A,B,C A,B,D A,C,D B,C,D
Layered Cubing (MR Engine V1)
 Pros
 Simple implementation, depends
on MR shuffle to merge sort and
then aggregate
 Little requirement on memory
 Cons
 Aggregation happens at reducer
side
 Mapper outputs raw data thus
shuffle is huge
 Multiple rounds of MR overhead
 Shuffle can be 100x of cube size,
big I/O pressure
mapper mapper mapper
reducer
Fast Cubing
 Pros
 In-mem cubing algorithm that can
be reused by Streaming, Spark etc.
 Mapper side aggregation
 Lesser shuffling given the right data
split
 One round MR
 Cons
 Code complexity
 High mapper CPU/Mem
consumption
Data Split Data Split Data Split
……
Final Cube
Merge Sort
(Shuffle)
 If data splits are unique
 Fast cubing wins
 If data splits are common
 Layer cubing wins
 New cube engine chooses
the right algorithm based on
data sampling.
 Overall build time is 1.5x
faster, sum results from 500
jobs.
Fast Cubing (MR Engine V2)
 Slow queries are 5-10x
faster.
 New Hbase storage
enables partition on
cuboids that are big
enough.
 Overall query time is 2x
faster than before, sum
results from 10,000+
queries.
Parallel Scan
Query
Cuboid A
Cuboid B
Query
A1 B1
A2 B2
A3 C
Cuboid C
Server 1
Server 2
Server 3
Server 1
Server 2
Server 3
Near Realtime Incremental Build
 Minutes micro cubes
 Kafka source
 In-mem cubing
 Auto merge
Cube StorageReal-time In-Mem Store
streaming Kafka
SQL Query
minute batch
Latest second
Inverted
Index
Hybrid Storage
Interface
Cube
Future Lambda Architecture for Realtime
Use Case: SEO Operational Dashboard
 eBay Site
 ebay.com, ebay.co.uk, ebay.de
 Buyer Country
 US, CN, RU
 Search Engine
 Google, Bing, Yahoo!
 Referrer
 google.com, google.co.uk
 Page
 Search, View Item, Product
 User Experience
 Desktop, Mobile APP, mWeb
• Visits, GMB $, GMB share,
conversion rate, bounce rate, # of
view items, # of bought items etc.
Dimensions
Measurements
 HyperLogLog Count Distinct
 TopN
 BitMap Precise Count Distinct
 from Sun, Yerui (netease.com)
 Raw Records
 from Wang, Xiaoyu (jd.com)
 Domain specific aggregations now become easy
 aggregate user events to detect time serials or access patterns
 draw a sketch of certain user groups
 pre-calculate clusters of data points
 histogram…
User Defined Aggregation Types
DT,LOC TopN
2015-10-1,CN Item A, $500
Item B, $300
…
TopN Support
select dt, loc, item, sum(gmv)
from test_kylin_fact
where dt=‘2015-10-1’ and loc=‘CN’
group by dt, loc, item
order by 4 desc
limit 100 cube pre-calculation
 TopN as a measure
 Approximate algorithm
 SpaceSaving TopN
 Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”.
Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.
 A parallel version
 Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta
distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.
 Answer TopN queries directly from pre-calculation
 Works with Tableau 9.1
 Works with MS Excel
 Works with MS Power BI
ODBC Enhancement
Zeppelin Integration
Agenda
 What’s Apache Kylin?
 New Features in Kylin
 Plugin Architecture
 Fast Cubing
 Parallel Scan
 Streaming Cubing
 User Defined Aggregation
 Summary
 New in Apache Kylin
 Plugin-able architecture
 New MR Cube Engine with fast cubing (1.5x faster)
 New HBase Storage with parallel scan (2x faster)
 Near real-time analysis (experimental)
 User defined aggregations
 Excel / PowerBI / Zeppelin integration
Summary
Thanks!
http://kylin.apache.org

Weitere ähnliche Inhalte

Was ist angesagt?

Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들Woong Seok Kang
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use casesGDG Cloud Bengaluru
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
2022년 07월 21일 Confluent+Imply 웨비나 발표자료
2022년 07월 21일 Confluent+Imply 웨비나 발표자료2022년 07월 21일 Confluent+Imply 웨비나 발표자료
2022년 07월 21일 Confluent+Imply 웨비나 발표자료confluent
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021Akshay Rai
 
카프카 기반의 대규모 모니터링 플랫폼 개발이야기
카프카 기반의 대규모 모니터링 플랫폼 개발이야기카프카 기반의 대규모 모니터링 플랫폼 개발이야기
카프카 기반의 대규모 모니터링 플랫폼 개발이야기if kakao
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Qlik ReplicateでApache Kafkaをターゲットとして使用する
Qlik ReplicateでApache Kafkaをターゲットとして使用するQlik ReplicateでApache Kafkaをターゲットとして使用する
Qlik ReplicateでApache Kafkaをターゲットとして使用するQlikPresalesJapan
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterAttila Szegedi
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)SANG WON PARK
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Upfoundsearch
 

Was ist angesagt? (20)

Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use cases
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
2022년 07월 21일 Confluent+Imply 웨비나 발표자료
2022년 07월 21일 Confluent+Imply 웨비나 발표자료2022년 07월 21일 Confluent+Imply 웨비나 발표자료
2022년 07월 21일 Confluent+Imply 웨비나 발표자료
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
 
카프카 기반의 대규모 모니터링 플랫폼 개발이야기
카프카 기반의 대규모 모니터링 플랫폼 개발이야기카프카 기반의 대규모 모니터링 플랫폼 개발이야기
카프카 기반의 대규모 모니터링 플랫폼 개발이야기
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Qlik ReplicateでApache Kafkaをターゲットとして使用する
Qlik ReplicateでApache Kafkaをターゲットとして使用するQlik ReplicateでApache Kafkaをターゲットとして使用する
Qlik ReplicateでApache Kafkaをターゲットとして使用する
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
Apache Kylin
Apache KylinApache Kylin
Apache Kylin
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 

Ähnlich wie The Evolution of Apache Kylin

Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesYang Li
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)qhzhou
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataLuke Han
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...Luke Han
 
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...Amazon Web Services
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine TourLuke Han
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Amazon Web Services
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 

Ähnlich wie The Evolution of Apache Kylin (20)

Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
AWS re:Invent 2016: Auto Scaling – the Fleet Management Solution for Planet E...
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 

Mehr von DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

The Evolution of Apache Kylin

  • 1. The Evolution of Apache Kylin Realtime & Plugin Architecture in Kylin Li, Yang | 李扬 Co-founder & CTO at Kyligence Inc.
  • 2. Agenda  What’s Apache Kylin?  New Features in Kylin  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 3. Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Nov 2014 -- Apache Incubator Project • Nov 2015 – Apache Top Level Project
  • 4. Feature – SQL Interface Hive Table Build Cube (Index) SQL Query
  • 5.  eBay Feature – Big Data Case Cube Size Raw Records Session Analysis 20 TB 81+ billion rows Traffic Analysis 30 TB 28+ billion rows Transaction Analysis 560 GB 1.2+ billion rows
  • 6. 90% queries <5s Dark-blue line: 90%tile queries Light-blue line: 95%tile queries 90%ile query returns in 3 seconds Feature – Low Latency
  • 7. Feature – BI Integration via ODBC, JDBC
  • 8. Linear scale out with more nodes Feature – Scalable Throughput
  • 9. Agenda  What’s Apache Kylin?  New Features in Kylin  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 10. Cube Builder (MapReduce…) SQL Low Latency - SecondsRouting 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST ServerDataSource Abstraction Engine Abstraction Storage Abstraction Plugin Architecture Overview
  • 11. MR Engine IN OUT Hive Source HBase Storage Cube Metadata SourceFactory StorageFactoryEngineFactory Plugin Architecture
  • 12. MR Engine Plugin Architecture Hive Adapter HBase Adapter load data save cubeHive Source HBase Storage adapt to IN adapt to OUT
  • 13.  Engine  MR V1  MR V2  Spark (early)  Streaming (experimental)  Source  Hive  Kafka  Spark SQL & DataFrames  Storage  HBase  ? Kudu  ? Cassandra Developing Modules
  • 14.  Freedom  Zoo break, not bound to Hadoop any more  Free to go to a better engine or storage  Extensibility  Accept any input, e.g. Kafka  Embrace next-gen distributed platform, e.g. Spark  Flexibility  Choose different engine for different data set The Freedom, Extensibility, Flexibility
  • 15. Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR A,B,C,D A,B,C A,B,D A,C,D B,C,D Layered Cubing (MR Engine V1)  Pros  Simple implementation, depends on MR shuffle to merge sort and then aggregate  Little requirement on memory  Cons  Aggregation happens at reducer side  Mapper outputs raw data thus shuffle is huge  Multiple rounds of MR overhead  Shuffle can be 100x of cube size, big I/O pressure
  • 16. mapper mapper mapper reducer Fast Cubing  Pros  In-mem cubing algorithm that can be reused by Streaming, Spark etc.  Mapper side aggregation  Lesser shuffling given the right data split  One round MR  Cons  Code complexity  High mapper CPU/Mem consumption Data Split Data Split Data Split …… Final Cube Merge Sort (Shuffle)
  • 17.  If data splits are unique  Fast cubing wins  If data splits are common  Layer cubing wins  New cube engine chooses the right algorithm based on data sampling.  Overall build time is 1.5x faster, sum results from 500 jobs. Fast Cubing (MR Engine V2)
  • 18.  Slow queries are 5-10x faster.  New Hbase storage enables partition on cuboids that are big enough.  Overall query time is 2x faster than before, sum results from 10,000+ queries. Parallel Scan Query Cuboid A Cuboid B Query A1 B1 A2 B2 A3 C Cuboid C Server 1 Server 2 Server 3 Server 1 Server 2 Server 3
  • 19. Near Realtime Incremental Build  Minutes micro cubes  Kafka source  In-mem cubing  Auto merge
  • 20. Cube StorageReal-time In-Mem Store streaming Kafka SQL Query minute batch Latest second Inverted Index Hybrid Storage Interface Cube Future Lambda Architecture for Realtime
  • 21. Use Case: SEO Operational Dashboard  eBay Site  ebay.com, ebay.co.uk, ebay.de  Buyer Country  US, CN, RU  Search Engine  Google, Bing, Yahoo!  Referrer  google.com, google.co.uk  Page  Search, View Item, Product  User Experience  Desktop, Mobile APP, mWeb • Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc. Dimensions Measurements
  • 22.  HyperLogLog Count Distinct  TopN  BitMap Precise Count Distinct  from Sun, Yerui (netease.com)  Raw Records  from Wang, Xiaoyu (jd.com)  Domain specific aggregations now become easy  aggregate user events to detect time serials or access patterns  draw a sketch of certain user groups  pre-calculate clusters of data points  histogram… User Defined Aggregation Types
  • 23. DT,LOC TopN 2015-10-1,CN Item A, $500 Item B, $300 … TopN Support select dt, loc, item, sum(gmv) from test_kylin_fact where dt=‘2015-10-1’ and loc=‘CN’ group by dt, loc, item order by 4 desc limit 100 cube pre-calculation  TopN as a measure  Approximate algorithm  SpaceSaving TopN  Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”. Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005.  A parallel version  Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.  Answer TopN queries directly from pre-calculation
  • 24.  Works with Tableau 9.1  Works with MS Excel  Works with MS Power BI ODBC Enhancement
  • 26. Agenda  What’s Apache Kylin?  New Features in Kylin  Plugin Architecture  Fast Cubing  Parallel Scan  Streaming Cubing  User Defined Aggregation  Summary
  • 27.  New in Apache Kylin  Plugin-able architecture  New MR Cube Engine with fast cubing (1.5x faster)  New HBase Storage with parallel scan (2x faster)  Near real-time analysis (experimental)  User defined aggregations  Excel / PowerBI / Zeppelin integration Summary

Hinweis der Redaktion

  1. Olap Big data Vs ubuntu kylin Ebay 第一个贡献到apache的开源项目,也是完整由中国团队贡献到Apache的第一个项目
  2. 介绍query 1台机器4个tomcat instanc可以达到300左右的QPS
  3. A High Level Architecture for Kylin which is a Standard MOLAP Architecture built on Hadoop. Data Sources to build your MOLAP Cubes primarily Hive, We have a fantastic project in the works for a Storage Abstraction Layer and support other NoSQL Stores such as Cassandra/CouchBase. An Engine Abstraction which maintains the Cube Metadata and a Cube Builder. Today a set of Map Reduce Jobs to build the cubes. A storage layer to store the Cubes in Hbase, primarily through a Bulk Load of the aggregrates into Hbase. We are looking for active community participation to build out additional Data Source, Engine and Storage plugins into Kylin. A Query Engine that directly index into the multi-dimensional arrays built into Hbase.