SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Using Apache Kylin for Large Scale
Data Analytics
Apache Big Data Europe 2015
http://kylin.io
Agenda
 Big Data @ eBay
 What’s Apache Kylin?
 Cube Segments & Streaming
 Q&A
Big Data @ eBay
$28B
GMV VIA MOBILE
(2014)
266M
MOBILE APP
GLOBALLY
1B
LISTINGS CREATED
VIA MOBILE
157M
ACTIVE BUYERS
25M
ACTIVE SELLERS
800M
ACTIVE LISTINGS
8.8M
NEW LISTINGS
EVERY WEEK
CHECKOUT
WATCH LIST
SHARE
LOCAL
PREFERENCES
BID
100’s PB
9 Quarters of historical
data
100M
QUERIES/MONTH
20TB
Behavior Data Capture
every day
CLICK
SEARCH
Hadoop
@ eBay
2007
2010
2011
2012
2013/2014
2009
Extreme OLAP Engine for Big Data
Apache Kylin is an open source Distributed Analytics Engine
from eBay that provides SQL interface and multi-dimensional
analysis (OLAP) on Hadoop supporting extremely large
datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Accepted as Apache Incubator Project on Nov 25th, 2014
Business Needs for Big Data Analysis
 Sub-second query latency on billions of rows
 ANSI SQL for both analysts and engineers
 Full OLAP capability to offer advanced functionality
 Seamless Integration with BI Tools
 Support of high cardinality and high dimensions
 High concurrency – thousands of end users
 Distributed and scale out architecture for large data volume
 Extremely Fast OLAP Engine at Scale
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
 ANSI SQL Interface on Hadoop
Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions
 Seamless Integration with BI Tools
Kylin currently offers integration capability with BI Tools like Tableau.
 Interactive Query Capability
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive
queries for the same dataset
 MOLAP Cube
User can define a data model and pre-build in Kylin with more than 10+ billions of raw
data records
Features Highlights
Apache Kylin Adoption
External
25+ contributors in community
Adoption:
In Production: eBay, Baidu Maps
Evaluation: Huawei, Bloomberg Law,
British Gas, JD.com
Microsoft, Tableau…
eBay Use Cases
Case Cube Size Raw Records
User Session Analysis 14+TB 75+ billion rows
Traffic Analysis 21+ TB 20+ billion rows
Behavior Analysis 560+ GB 1.2+ billion rows
Cube Designer
Job Management
Query and Visualization
Tableau Integration
Kylin Architecture Overview
14
Cube Builder (MapReduce…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST Server
DataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
Cube Storage in HBase
OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
 Dynamic data management framework.
 Formerly known as Optiq, Calcite is an Apache incubator project, used by
Apache Drill and Apache Hive, among others.
 http://optiq.incubator.apache.org
Querying the Cube
• Metadata SPI
– Provide table schema from Kylin metadata
• Optimization Rules
– Translate the logic operators into kylin operators
• Relational Operators
– Find right cube
– Translate SQL into storage engine API call
– Generate physical execute plan by linq4j java implementation
• Result Enumerator
– Translate storage engine result into java implementation result.
• SQL Function
– Add HyperLogLog for distinct count
– Implement date time related functions (i.e. Quarter)
Query Engine based on Calcite
 Full Cube
 Pre-aggregate all dimension combinations
 “Curse of dimensionality”: N dimension cube has 2N cuboid.
 Partial Cubes
 To avoid dimension explosion, we divide the dimensions into
different aggregation groups
 2N+M+L  2N + 2M + 2L
 For cube with 30 dimensions, if we divide these dimensions into 3
group, the cuboid number will reduce from 1 Billion to 3 Thousands
 230  210 + 210 + 210
 Tradeoff between online aggregation and offline pre-aggregation
Optimizing Cube Builds – Full Cube vs. Partial
Cube
Optimizing Cube Builds: Partial Cubes
Optimizing Cube Building – Incremental Building
Optimizing Cube Builds: Improved Parallelism
• The current algorithm
– Many MRs, the number of dimensions
– Huge shuffles, aggregation at reduce side, 100x of total cube size
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
Optimizing Cube Builds: Improved Parallelism
• The to-be algorithm, 30%-50% faster
– 1 round MR
– Reduced shuffles, map side aggregation, 20x total cube size
– Hourly incremental build done in tens of minutes
Data Split
Cube Segment
Data Split
Cube Segment
Data Split
Cube Segment
……
Final Cube
Merge Sort
(Shuffle)
mapper mapper mapper
Historical DataRecent Data
Kafka
Topics
SQL Query
Optimizing Cube Builds: Streaming
Inverted
Index
Hybrid Storage
Interface
Cube Segments
Use Case: SEO Operational Dashboard
• eBay Site
– ebay.com, ebay.co.uk, ebay.de
• Buyer Country
– US, CN, RU
• Search Engine
– Google, Bing, Yahoo!
• Referrer
– google.com, google.co.uk
• Page
– Search, View Item, Product
• User Experience
– Desktop, Mobile APP, mWeb
• Visits, GMB $, GMB share,
conversion rate, bounce rate, # of
view items, # of bought items etc.
Dimensions
Measurements
Q & A
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
Cube Storage in HBase
 Plugin-able storage engine
 Common iterator interface for storage engine
 Isolate query engine from underline storage
 Translate cube query into HBase table scan
 Columns, Groups  Cuboid ID
 Filters -> Scan Range (Row Key)
 Aggregations -> Measure Columns (Row Values)
 Scan HBase table and translate HBase result into cube result
 HBase Result (key + value) -> Cube Result (dimensions + measures)
How to Query Cube?
Storage Engine
 Compression and Encoding Support
 Incremental Refresh of Cubes
 Approximate Query Capability for distinct Count (HyperLogLog)
 Leverage HBase Coprocessor for query latency
 Job Management and Monitoring
 Easy Web interface to manage, build, monitor and query cubes
 Security capability to set ACL at Cube/Project Level
 Support LDAP Integration
Features Highlights…
Data Modeling
Cube: …
Fact Table: …
Dimensions: …
Measures: …
Storage(HBase): …Fact
Dim Dim
Dim
Source
Star Schema
row A
row B
row C
Column Family
Val 1
Val 2
Val 3
Row Key Column
Target
HBase Storage
Mapping
Cube Metadata
End User Cube Modeler Admin
OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
• Different ages have different analysis need, resulting in different storage
architecture
• Inverted index may keep data long enough to support row level analysis
Optimizing Cube Builds: Streaming
ts ts … ts ts ts …
Streaming
ts ts
min min … min min … min min
hour hour … hour hour …
Last Hour
inverted index
in memory
row level
Last Week
cube segments
in memory
minute granularity
Months
cube segments
on disk
hour granularity
cubing
cubing
Transaction
Operation
Strategy
High Level
Aggregation
•Very High Level, e.g GMV by
site by vertical by weeks
Analysis
Query
•Middle level, e.g GMV by site by vertical,
by category (level x) past 12 weeks
Drill Down
to Detail
•Detail Level (Summary Table)
Low Level
Aggregation
•First Level
Aggragation
Transaction
Level
•Transaction Data
Analytics Query Taxonomy
OLAP
Kylin is designed to accelerate 80+% analytics queries performance on Hadoop
OLTP
Cube Build Job Flow
 Huge volume data
 Table scan
 Big table joins
 Data shuffling
 Analysis on different granularity
 Runtime aggregation expensive
 Map Reduce job
 Batch processing
Technical Challenges
Big Data Era
 More and more data becoming available on Hadoop
 Limitations in existing Business Intelligence (BI) Tools
 Limited support for Hadoop
 Data size growing exponentially
 High latency of interactive queries
 Scale-Up architecture
 Challenges to adopt Hadoop as interactive analysis system
 Majority of analyst groups are SQL savvy
 No mature SQL interface on Hadoop
 OLAP capability on Hadoop ecosystem not ready yet
Streaming Cubes
• Cube takes time to build, how about real-time
analysis?
• Sometimes we want to drill down to row level
information
• Streaming Cubing
– Build micro cube segments from streaming
– Use inverted index to capture last minute data
Query Engine – Kylin Explain Plan
Streaming…
Historic StoreReal-time Store
streaming Kafka
SQL Query
hourly/daily batch
minutes batch
Inverted
Index
Hybrid Storage
Interface
Cube
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
Kylin vs. Hive
# Query
Type
Return Dataset Query
On Kylin (s)
Query
On Hive (s)
Comments
1 High Level
Aggregation
4 0.129 157.437 1,217 times
2 Analysis Query 22,669 1.615 109.206 68 times
3 Drill Down to
Detail
325,029 12.058 113.123 9 times
4 Drill Down to
Detail
524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
0
50
100
150
200
SQL #1 SQL #2 SQL #3
Hive
Kylin
High Level
Aggregatio
n
Analysis
Query
Drill Down
to Detail
Low Level
Aggregatio
n
Transactio
n Level
Based on 12+B records case
Performance -- Concurrency
Linear scale out with more nodes
Performance - Query Latency
90%tile queries <5s
Green Line: 90%tile queries
Gray Line: 95%tile queries
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
Kylin Evolution Roadmap
 Kylin Core
 Fundamental framework of
Kylin OLAP Engine
 Extension
 Plugins to support for
additional functions and
features
 Integration
 Lifecycle Management
Support to integrate with
other applications
 Interface
 Allows for third party users to
build more features via user-
interface atop Kylin core
 Driver
 ODBC and JDBC Drivers
Kylin OLAP
Core
Extension
 Security
 Redis Storage
 Spark Engine
 Docker
Interface
 Web Console
 Customized BI
 Ambari/Hue Plugin
Integration
 ODBC Driver
 ETL
 Drill
 SparkSQL
Kylin Ecosystem
 Cubing Efficiency
 MR is not optimal framework
 Spark Cubing Engine
 Source from SparkSQL
 Read data from SparkSQL instead of Hive
 Route to SparkSQL
 Unsupported queries be coved by SparkSQL
Adding Spark Support
 Hive
 Input source
 Pre-join star schema during cube building
 MapReduce
 Pre-aggregation metrics during cube building
 HDFS
 Store intermediated files during cube building.
 HBase
 Store data cube.
 Serve query on data cube.
 Coprocessor is used for query processing.
 Calcite
 Query Parsing
 Query Execution
The DRY Principle
If you want to go fast, go alone.
If you want to go far, go together.
--African Proverb
http://kylin.io

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
Jerome Boulon
 

Was ist angesagt? (20)

Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 

Andere mochten auch

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 

Andere mochten auch (15)

Fraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data ConFraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data Con
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
 
Sybase BAM Overview
Sybase BAM OverviewSybase BAM Overview
Sybase BAM Overview
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/
 
Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBaseLow Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBase
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talk
 
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudIBM Message Hub service in Bluemix - Apache Kafka in a public cloud
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 

Ähnlich wie Apache Kylin @ Big Data Europe 2015

HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
Michael Stack
 

Ähnlich wie Apache Kylin @ Big Data Europe 2015 (20)

HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Apache Kylin @ Big Data Europe 2015

  • 1. Using Apache Kylin for Large Scale Data Analytics Apache Big Data Europe 2015
  • 2. http://kylin.io Agenda  Big Data @ eBay  What’s Apache Kylin?  Cube Segments & Streaming  Q&A
  • 3. Big Data @ eBay $28B GMV VIA MOBILE (2014) 266M MOBILE APP GLOBALLY 1B LISTINGS CREATED VIA MOBILE 157M ACTIVE BUYERS 25M ACTIVE SELLERS 800M ACTIVE LISTINGS 8.8M NEW LISTINGS EVERY WEEK
  • 4. CHECKOUT WATCH LIST SHARE LOCAL PREFERENCES BID 100’s PB 9 Quarters of historical data 100M QUERIES/MONTH 20TB Behavior Data Capture every day CLICK SEARCH
  • 6. Extreme OLAP Engine for Big Data Apache Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Accepted as Apache Incubator Project on Nov 25th, 2014
  • 7. Business Needs for Big Data Analysis  Sub-second query latency on billions of rows  ANSI SQL for both analysts and engineers  Full OLAP capability to offer advanced functionality  Seamless Integration with BI Tools  Support of high cardinality and high dimensions  High concurrency – thousands of end users  Distributed and scale out architecture for large data volume
  • 8.  Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data  ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions  Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau.  Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset  MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records Features Highlights
  • 9. Apache Kylin Adoption External 25+ contributors in community Adoption: In Production: eBay, Baidu Maps Evaluation: Huawei, Bloomberg Law, British Gas, JD.com Microsoft, Tableau… eBay Use Cases Case Cube Size Raw Records User Session Analysis 14+TB 75+ billion rows Traffic Analysis 21+ TB 20+ billion rows Behavior Analysis 560+ GB 1.2+ billion rows
  • 14. Kylin Architecture Overview 14 Cube Builder (MapReduce…) SQL Low Latency - SecondsRouting 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST Server DataSource Abstraction Engine Abstraction Storage Abstraction
  • 16. OLAP Cube – Balance between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids)
  • 17.  Dynamic data management framework.  Formerly known as Optiq, Calcite is an Apache incubator project, used by Apache Drill and Apache Hive, among others.  http://optiq.incubator.apache.org Querying the Cube
  • 18. • Metadata SPI – Provide table schema from Kylin metadata • Optimization Rules – Translate the logic operators into kylin operators • Relational Operators – Find right cube – Translate SQL into storage engine API call – Generate physical execute plan by linq4j java implementation • Result Enumerator – Translate storage engine result into java implementation result. • SQL Function – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) Query Engine based on Calcite
  • 19.  Full Cube  Pre-aggregate all dimension combinations  “Curse of dimensionality”: N dimension cube has 2N cuboid.  Partial Cubes  To avoid dimension explosion, we divide the dimensions into different aggregation groups  2N+M+L  2N + 2M + 2L  For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands  230  210 + 210 + 210  Tradeoff between online aggregation and offline pre-aggregation Optimizing Cube Builds – Full Cube vs. Partial Cube
  • 20. Optimizing Cube Builds: Partial Cubes
  • 21. Optimizing Cube Building – Incremental Building
  • 22. Optimizing Cube Builds: Improved Parallelism • The current algorithm – Many MRs, the number of dimensions – Huge shuffles, aggregation at reduce side, 100x of total cube size Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR
  • 23. Optimizing Cube Builds: Improved Parallelism • The to-be algorithm, 30%-50% faster – 1 round MR – Reduced shuffles, map side aggregation, 20x total cube size – Hourly incremental build done in tens of minutes Data Split Cube Segment Data Split Cube Segment Data Split Cube Segment …… Final Cube Merge Sort (Shuffle) mapper mapper mapper
  • 24. Historical DataRecent Data Kafka Topics SQL Query Optimizing Cube Builds: Streaming Inverted Index Hybrid Storage Interface Cube Segments
  • 25. Use Case: SEO Operational Dashboard • eBay Site – ebay.com, ebay.co.uk, ebay.de • Buyer Country – US, CN, RU • Search Engine – Google, Bing, Yahoo! • Referrer – google.com, google.co.uk • Page – Search, View Item, Product • User Experience – Desktop, Mobile APP, mWeb • Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc. Dimensions Measurements
  • 26. Q & A
  • 27. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 29.  Plugin-able storage engine  Common iterator interface for storage engine  Isolate query engine from underline storage  Translate cube query into HBase table scan  Columns, Groups  Cuboid ID  Filters -> Scan Range (Row Key)  Aggregations -> Measure Columns (Row Values)  Scan HBase table and translate HBase result into cube result  HBase Result (key + value) -> Cube Result (dimensions + measures) How to Query Cube? Storage Engine
  • 30.  Compression and Encoding Support  Incremental Refresh of Cubes  Approximate Query Capability for distinct Count (HyperLogLog)  Leverage HBase Coprocessor for query latency  Job Management and Monitoring  Easy Web interface to manage, build, monitor and query cubes  Security capability to set ACL at Cube/Project Level  Support LDAP Integration Features Highlights…
  • 31. Data Modeling Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): …Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin
  • 32. OLAP Cube – Balance between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids)
  • 33. • Different ages have different analysis need, resulting in different storage architecture • Inverted index may keep data long enough to support row level analysis Optimizing Cube Builds: Streaming ts ts … ts ts ts … Streaming ts ts min min … min min … min min hour hour … hour hour … Last Hour inverted index in memory row level Last Week cube segments in memory minute granularity Months cube segments on disk hour granularity cubing cubing
  • 34. Transaction Operation Strategy High Level Aggregation •Very High Level, e.g GMV by site by vertical by weeks Analysis Query •Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail •Detail Level (Summary Table) Low Level Aggregation •First Level Aggragation Transaction Level •Transaction Data Analytics Query Taxonomy OLAP Kylin is designed to accelerate 80+% analytics queries performance on Hadoop OLTP
  • 36.  Huge volume data  Table scan  Big table joins  Data shuffling  Analysis on different granularity  Runtime aggregation expensive  Map Reduce job  Batch processing Technical Challenges
  • 37. Big Data Era  More and more data becoming available on Hadoop  Limitations in existing Business Intelligence (BI) Tools  Limited support for Hadoop  Data size growing exponentially  High latency of interactive queries  Scale-Up architecture  Challenges to adopt Hadoop as interactive analysis system  Majority of analyst groups are SQL savvy  No mature SQL interface on Hadoop  OLAP capability on Hadoop ecosystem not ready yet
  • 38. Streaming Cubes • Cube takes time to build, how about real-time analysis? • Sometimes we want to drill down to row level information • Streaming Cubing – Build micro cube segments from streaming – Use inverted index to capture last minute data
  • 39. Query Engine – Kylin Explain Plan
  • 40. Streaming… Historic StoreReal-time Store streaming Kafka SQL Query hourly/daily batch minutes batch Inverted Index Hybrid Storage Interface Cube
  • 41. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 42. Kylin vs. Hive # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin High Level Aggregatio n Analysis Query Drill Down to Detail Low Level Aggregatio n Transactio n Level Based on 12+B records case
  • 43. Performance -- Concurrency Linear scale out with more nodes
  • 44. Performance - Query Latency 90%tile queries <5s Green Line: 90%tile queries Gray Line: 95%tile queries
  • 45. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 47.  Kylin Core  Fundamental framework of Kylin OLAP Engine  Extension  Plugins to support for additional functions and features  Integration  Lifecycle Management Support to integrate with other applications  Interface  Allows for third party users to build more features via user- interface atop Kylin core  Driver  ODBC and JDBC Drivers Kylin OLAP Core Extension  Security  Redis Storage  Spark Engine  Docker Interface  Web Console  Customized BI  Ambari/Hue Plugin Integration  ODBC Driver  ETL  Drill  SparkSQL Kylin Ecosystem
  • 48.  Cubing Efficiency  MR is not optimal framework  Spark Cubing Engine  Source from SparkSQL  Read data from SparkSQL instead of Hive  Route to SparkSQL  Unsupported queries be coved by SparkSQL Adding Spark Support
  • 49.  Hive  Input source  Pre-join star schema during cube building  MapReduce  Pre-aggregation metrics during cube building  HDFS  Store intermediated files during cube building.  HBase  Store data cube.  Serve query on data cube.  Coprocessor is used for query processing.  Calcite  Query Parsing  Query Execution The DRY Principle
  • 50. If you want to go fast, go alone. If you want to go far, go together. --African Proverb http://kylin.io

Hinweis der Redaktion

  1. Hello, My Name is Seshu Adunuthula and with me is Branky Shao. A quick Intro about myself, I am responsible for the Analytics Infrastructure at eBay which include Hadoop, Teradata, ETL and BI. The talk today will be on Apache Kylin, why we built Kylin and the latest features we have added to Kylin.
  2. Agenda for the talk. Some of the Stats on Big Data at eBay. The rationale for building Apache Kylin New capabilities we are introducing in 0.8 release, by integrating with Streaming Branky will talk about the SEO Optimization Usecase with Streaming.
  3. Some numbers to address the aspects of scale at eBay. 157M Active Buyers 25M Active Sellers: This is the key differentiation from our competitor up North. A lot of what we do with Analytics is centered around making the Seller successful. A key to that is the ability to analyze and put structure around the 800M listings we have on eBay with more than 8.8M listings being added every day. As with every one else more than 40% of the eBay revenue being generated on mobile devices and growing.
  4. Data we capture can be classified into Behavioral data, or Click Stream analysis on User Behavior on the eBay web site, and Transactional data, which is the actual sale made, a listing added or an update in an auction. We retain up to 9 quarters of this data for seasonal analysis. The Applications that get built on this data are your what you would expect Personalization: personalize the site based on your historical browsing behavior Targeting via email and mobile Deals Improving our search capabilities and SEO.
  5. Evolution of Hadoop infrastructure…need driving innovation in the infrastructure. Hadoop started out as a Single use system for Batch Apps only where the compute engine MR did both data processing and cluster resource management. First generation Hadoop met the need for affordable scalability and flexible data structure, and provided the democratization needed on the big data investment to give companies big and small the power and affordability of big data. However, it was only the first step. Limitations, like its batch-oriented job processing and consolidated resource management, drove the development of Yet Another Resource Negotiator (YARN), which was needed to take Hadoop to the next level.  Summary: 1st Gen Hadoop was a monolithic system. If you compare it to say Windows O/S, imagine the O/S and applications such as Notepad being tied together. Also, there was only 1 single app, i.e., Map Reduce. This was limiting and not flexible. In its second major release, Hadoop took a giant leap forward – from batch oriented to enabling interactive capabilities. It also included the strategic decision to support the innovation of more independent execution models, including MapReduce as a YARN application separate from the data persistence of HDFS. Thus, via YARN, Hadoop became a true data operating system and Evolution of the Hadoop Platform Driving the Next Generation Data Architecture with Hadoop Adoption 3 platform for YARN applications, and the newer, improved two-tier framework enabled parallel innovation of data engines (both as Apache projects and in vendor-proprietary products).  (Data lake strategy) Now, data from transaction oriented databases, document databases, and graph databases can all be stored within Hadoop, and also accessed via YARN applications without duplicating or moving data for a different workload usage. While most major applications have not yet begun to migrate to Hadoop or NoSQL data stores, this trend is expected to grow as the benefits of having a unified data repository capable of data reuse without duplication or latency are realized.  The Hadoop ecosystem enables innovation in Hadoop engines and YARN applications that will allow some to increase capabilities and performance (like Apache Hive) while other engines continue to mature and innovate (like Apache Drill). As different YARN applications for SQL access continue to improve, a critical factor will be those optimized for transaction processing with both record read and write operations with transaction integrity and atomicity, consistency, isolation, and durability (or, ACID) capability. This will enable current SQL-based applications to migrate to the Hadoop platform. And, as these migrations began to happen, Hadoop will enable new benefits to be realized from having the data already in the data operating system for other applications to leverage, rather than deciding whether it’s cost effective and timely to move data.
  6. What is Kylin We derived the project name for a mythical animal part of a chinese legend part of set of four divine creatures including a pheonix, a dragon and a turtle. One other reason we chose it, It is supposed to live for 2000 years, if Apache Kylin gets 1/100 of its life we will be happy… So What is Kylin, It is an OLAP engine built for extreme scale on big data technology including Hive and Hbase. We have been working on Hive for more than a year and open sourced in Oct and accepted into Apache Incubation in Nov.
  7. So why did we build Kylin. Like any enterprise wanting to internally deployed commercial OLAP engines like Microstrategy but soon starting hitting limits on Scalability. We were collecting behavioral data at the rate of 20TB a day. A monthly analysis required peta bytes scale Cubes. There was no products available commercially or in open source that supported cubes at this scale… We set out to build Kylin with the following requirements. Interactive Query capability on billions of rows of data. Full ANSI SQL Compliance OLAP style cubes Tight integration with BI tools like Tableau and Microstrategy Also support for large number of dimensions and large cardinality for each dimension Highly Available & Secure Fundamentally different in that it is built ground up distributed and supported for scale out.
  8. Since we open sourced last November, we have seen rapid adoption in the industry, more than 25+ contributors, an active dev mailling list, In production outside of eBay at Baidu Maps and couple other companies. In evaluation at a variety of locations, there has been interest in Support Model around this, unfortunately we are not there yet. It is still DIY if you want to deploy it. Within eBay some of the biggest use cases are the Traffic Data 21 TB cube and User Session data with 75 Billion rows and 14 TB cubes.
  9. A High Level Architecture for Kylin which is a Standard MOLAP Architecture built on Hadoop. Data Sources to build your MOLAP Cubes primarily Hive, We have a fantastic project in the works for a Storage Abstraction Layer and support other NoSQL Stores such as Cassandra/CouchBase. An Engine Abstraction which maintains the Cube Metadata and a Cube Builder. Today a set of Map Reduce Jobs to build the cubes. A storage layer to store the Cubes in Hbase, primarily through a Bulk Load of the aggregrates into Hbase. We are looking for active community participation to build out additional Data Source, Engine and Storage plugins into Kylin. A Query Engine that directly index into the multi-dimensional arrays built into Hbase.
  10. Drilling a bit into the Storage Layer - You have a set of measures on which you want to compute your aggregates and the dimensions you want to aggregate on. The Hbase RowKey is a combination of a CuboidID and set of Dimension IDs. The Row Value is aggregates for all the measures.
  11. What gets stored in Hbase is a lattice of Cuboids. In Data warehousing literature a base cuboid stores the base measures and the apex cubiod captures the highest level of summarization. A data cubes is lattice of all cuboids between the base and the apex. For e.g. L1 has the base measure, L2 here has the aggregation for supplier, L2 has supplier and time and so on. Each of these values are stored in Hbase using the Row key format we had talked about earlier.
  12. Once we have stored the multi-demensional array into Hbase it needs to be queried. We built the query engine based on Calcite. Calcite has several extensions points that we implement against. 1. The Metadata SPI, 2. Pluggable rules in the query optimizer to efficiently execute the SQL queries and integrate with Storage APIs.
  13. A few more details, will skip because of time concerns, but this is the core of the Kylin engine, if you are interested in details please follow up on the Kylin Dev List.
  14. All of this is good. The efficiency of any MOLAP Solutions rests in its ability to efficiently build the cubes. As the number and dimensionality of the dimensions increases you quickly run into combinatorial explosion which will render the MOLAP solutions quite useless. The next few slides we will walk thru a series of ways to make the Cube building process as efficient as possible. Step 1. is to avoid dimension explosion by dividing your dimensions into aggregate groups. A lot of time you notice that you aggregate time and location or time and category, but not all three together. You create separate aggregation groups that will help build your cubes faster and take up less space at the expense of improving query time latency. The unaggregated results are now computed at query time.
  15. In Step 2, based on the cardinality of the dimensions, we automatically prune some of the cuboids in the lattice, here the systems determines the set of cubiods that can be pruned based on
  16. Another approach to optimizing cube build times is the incremental building. The initial cube creation is done with the all available measure values. Subsequent cubes built on daily refreshes of the data A query spans the each of the cube segments that constitute the cube Periodically cube segments are merged together to improve query performance
  17. The existing algorithm is a series of Map Reduce Jobs building different levels in the lattice of Cuboids. Sequencing the different MR jobs increases the delay in building the cubes The Aggregations were done on the Reduce side resulting in 100X the total cube size Obviously this was an area ripe for improvement.
  18. With Map Side Aggregates the amount of data shuffles is reduced from 100X to 20X. A single MR job now computes all the different cuboids in the lattice. This also paves way to the support for Streaming or more accurately building cubes on Micro Batches as they arrive.
  19. With the addition of Streaming capabilities Kylin now acts as a Kafka Consumer Subscribes to Kafka Topics Messages from the topic are kept in memory and an inverted index is maintained on the rows as they arrive. The Kylin Query Engine is extended to aggregate results from the in memory tables As the data ages cube segments are built and combined with other cube segments.
  20. Measure eBay pages SEO performance: how’s free traffic from major search engine contributing to eBay business 10 Billion Raw Events 5 Million Sessions 10% of the traffic is generated through SEO We are able to drill down on the different dimensions Site/Country/Search Engine/Page/Device
  21. Some additional details For eg. Here is have a measure with dimensions Year, Country and Category, The Hbase Rows stores cartesian product of the aggregates
  22. We plan to geenralize this implementation