SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Qian Qiao Neng
eBay Senior Product Manager
Sep 2017
When OLAP Meets Real-time, What
Happens in eBay?
PLATFORM ENGINEERING REVIEW
Agenda
What’s OLAP & Apache Kylin
Challenge of Real-time OLAP
eBay Real-time OLAP
Architecture
3 Stages Data Store
Segment Window
Others
2
PLATFORM ENGINEERING REVIEW
What’s OLAP
• OLAP(Online Analytical Processing), is an approach to
answering multi-dimensional analytical queries quickly
in computing.
• OLAP Cube, a data set consists of numeric facts
called measures that are categorized by dimensions.
• OLAP Cube Build is the data process pre-calculating
measures by multi-dimensions, and save them in storage as a
cube.
• OLAP Cube trades off data processing latency for online
query latency
3
PLATFORM ENGINEERING REVIEW
What is Apache Kylin
Extreme OLAP Engine for Big Data
Apache Kylin is an open source Distributed Analytics Engine
designed to provide SQL interface and multi-dimensional analysis
(OLAP) on Hadoop supporting extremely large datasets, initiated and
continuously contributed by eBay Inc.
kylin 麒麟
--n. (in Chinese art) a mythical animal of composite form
Kylin fills an important gap – OLAP on Hadoop
4
PLATFORM ENGINEERING REVIEW
Kylin Architecture
5
PLATFORM ENGINEERING REVIEW
The Challenge
Remember Kylin is about pre-aggregations, it successfully reduced the query latencies by spending
more time and resources on data processing.
But for real-time OLAP, cubing time and data visibility latency is critical.
Can we pre-build the cubes in real-time in a more cost effective manner? This is hard but still doable.
6
PLATFORM ENGINEERING REVIEW
Real-time OLAP
Milliseconds Query Latency + Milliseconds Data Processing Latency
7
PLATFORM ENGINEERING REVIEW
3 Stages Data Store
We save the unbounded incoming streaming data into 3 stages.
8
InMem Stage
On Disk Stage
Full Cubing Stage
Continuously InMem
Aggregations
Flush to disk, columnar
based storage and indexes
Full cubing with MR or
Spark, save to HBase.
Unbounded
streaming events
PLATFORM ENGINEERING REVIEW
3 Stages Data Store
99
InMem Stage On Disk Stage Full Cubing Stage
InMemoryScanner FragementScanner HBaseScanner
GTScanner Interface
Small Real-time Data Volume
Few Aggregations
Large Historical Data Volume
All Aggregations
Low Data
Process
Latency
Users
Transparent
Low Query
Latency
PLATFORM ENGINEERING REVIEW
eBay’s Real-time OLAP built on Apache Kylin
10
PLATFORM ENGINEERING REVIEW
Streaming
Receivers Cluster
11
ReplicaSet1Streaming Sources
Cube
Storage
(HBase)
new streaming cube request
1
2
3
4
5
Build EngineSteaming Coordinator
6
7
How Real-time Cube Build Engine Works
8
ReplicaSet2
ReplicaSet3 ReplicaSet4
ReplicaSet5
9
10
PLATFORM ENGINEERING REVIEW
Streaming Receivers
Cluster
12
SQL Query
Steaming CoordinatorQuery Engine
1
2 3
How Real-time Cube Query Engine Works
12
Cube
Storage
(HBase)
PLATFORM ENGINEERING REVIEW
Data Storage
13
Raw
Record_1
…
RR_x
RR_x+1
…
RR_y
Seg_LL
Seg_4
Seg_3
1 … I
1 … J
1 … K
In Memory Active Fragments
Seg_LL
Seg_2
Seg_1
1 … L
1 … M
1 … N
Immutable Fragments
InMemoryScanner FragementScanner HBaseScanner
Open to Write Close to Process
GTScanner Interface
How to support milliseconds query latency
Processed
Unbounded
streaming events
PLATFORM ENGINEERING REVIEW
Hierarchy Data on Disk Stage
14
Fragment
Segment
Cube Cube
Segment 1
Fragment 1 Fragment 2 ...
...
...
Meta data Meta data Meta data Meta data
PLATFORM ENGINEERING REVIEW
Segments are divided by event time.
New created segments are first
In Active state.
After a preconfigured data retention
period, no further events coming,
The segment will be changed to
Immutable state and then write to
remote HDFS.
15
Segment Window And States
PLATFORM ENGINEERING REVIEW 16
Seg_4
1 … J
Active Segments
Seg_3
Seg_2
Seg_1
1 … L
1 … M
1 … N
Immutable Segments
Open to Write Close to Process
Segment States
Unbounded
streaming events
In Memory
Store
Fragments
PLATFORM ENGINEERING REVIEW 17
Segment Store On Disk
PLATFORM ENGINEERING REVIEW 18
Columnar Based Fragment File Design
Dimensions : D1,D2…Dn.
Metrics : M1,M2...Mm.
RawRecords : R1,R2...Rr.
D_1
…
D_n
Metric_1
…
Metric_m
Dimension_1 Dictionary
Dimension_1 Encoded
Value by Row
Index Data
Dimensions
Metrics
PLATFORM ENGINEERING REVIEW 19
Dimension_1 Dictionary
Dimension_1 Encoded
Value by Row
Inverted Index
Dim_1Value_1 Dic_1 Value_1
… …
Dim_1Value_x Dic_1 Value_x
Row_1’s Dic_1 Value
…
Row_r’s Dic_1 Value
Dic_1Value_1 Offset_1
… …
Dic_1Value_x Offset_x
Dic_1Value_1‘s occurrence of Rows
…
Dic_1Value_x‘s occurrence of Rows
Offset_1
Offset_...
Offset_x R_1
Occu
R_2
Occu
…
R_r
Occu
PLATFORM ENGINEERING REVIEW 20
Dimensions : Site, Category, Date
Metrics : SPS, PV.
RawRecords :
Site Category Date SPS PV
US Collectibles 8/27/16 11757 204143
US Collectibles 8/28/16 10949 192350
GE Fashion 8/28/16 5338 186659
US Collectibles 8/29/16 9609 186283
US Fashion 8/27/16 15174 177123
US Fashion 8/28/16 14612 172552
GE Home & Garden 8/27/16 7336 171364
US Fashion 8/29/16 13927 171235
GE Home & Garden 8/28/16 8615 169846
US Home & Garden 8/29/16 4560 166506
PLATFORM ENGINEERING REVIEW 21
D1
…
Dn
Metrics1
…
Metrics m
Dimension_1 Dictionary
Dimension_1 Encoded
Value by Row
Inverted Index
Dimensions
Metrics
US 1
GE 2
1121112121
1 offset1
2 offset2
1101110101
0010001010
11757
10949
5338
9609
15174
14612
7336
13927
8615
4560
Record_1 M1
Record_2 M1
……
Record_r M1
PLATFORM ENGINEERING REVIEW 22
Replica Set
The concept of Replica Set is similar to many other
distributed systems like Kafka, Mongo, Kubernetes, etc.
Replica Set ensures the replications for HA purpose.
Streaming receiver instances in the same replica set
have the exact same local state.
Streaming receiver instances are pre-allocated to join the
Replica Set.
Streaming
Receivers Cluster
ReplicaSet1 ReplicaSet2
ReplicaSet3 ReplicaSet4
ReplicaSet5
PLATFORM ENGINEERING REVIEW 23
Replica Set
Replica Set
Receiver1
Receiver2
Assignment:
“cube1”:[1,2]
“cube2”:[2,3]
Zookeeper
• All receivers in the set
share the same
assignment.
• The lead of the
ReplicaSet responsible
to upload segments to
HDFS
• Use Zookeeper to do
leader election
PLATFORM ENGINEERING REVIEW 24
Date Time Topic Name Partition # Offset SeqID of Active Segments
2016/10/01
12:00:00
Topic_1 1 x Seg_5, I Seg_4, J Seg_3, K
Check Point
Kafka
Seg_5
Seg_4
Seg_3
Topic_1 Part_1
1 … I
1 … J
1 … K
Active Fragments
... x x+1 … y y+1 …
Kafka
PLATFORM ENGINEERING REVIEW 25
Question?

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Databricks
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 

Was ist angesagt? (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Deep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAIDeep Learning to Production with MLflow & RedisAI
Deep Learning to Production with MLflow & RedisAI
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 

Ähnlich wie When OLAP Meets Real-Time, What Happens in eBay?

Ähnlich wie When OLAP Meets Real-Time, What Happens in eBay? (20)

Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
Ensuring Quality in Data Lakes (D&D Meetup Feb 22)
Ensuring Quality in Data Lakes  (D&D Meetup Feb 22)Ensuring Quality in Data Lakes  (D&D Meetup Feb 22)
Ensuring Quality in Data Lakes (D&D Meetup Feb 22)
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
 
Extending Analytic Reach
Extending Analytic ReachExtending Analytic Reach
Extending Analytic Reach
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

When OLAP Meets Real-Time, What Happens in eBay?

  • 1. Qian Qiao Neng eBay Senior Product Manager Sep 2017 When OLAP Meets Real-time, What Happens in eBay?
  • 2. PLATFORM ENGINEERING REVIEW Agenda What’s OLAP & Apache Kylin Challenge of Real-time OLAP eBay Real-time OLAP Architecture 3 Stages Data Store Segment Window Others 2
  • 3. PLATFORM ENGINEERING REVIEW What’s OLAP • OLAP(Online Analytical Processing), is an approach to answering multi-dimensional analytical queries quickly in computing. • OLAP Cube, a data set consists of numeric facts called measures that are categorized by dimensions. • OLAP Cube Build is the data process pre-calculating measures by multi-dimensions, and save them in storage as a cube. • OLAP Cube trades off data processing latency for online query latency 3
  • 4. PLATFORM ENGINEERING REVIEW What is Apache Kylin Extreme OLAP Engine for Big Data Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, initiated and continuously contributed by eBay Inc. kylin 麒麟 --n. (in Chinese art) a mythical animal of composite form Kylin fills an important gap – OLAP on Hadoop 4
  • 6. PLATFORM ENGINEERING REVIEW The Challenge Remember Kylin is about pre-aggregations, it successfully reduced the query latencies by spending more time and resources on data processing. But for real-time OLAP, cubing time and data visibility latency is critical. Can we pre-build the cubes in real-time in a more cost effective manner? This is hard but still doable. 6
  • 7. PLATFORM ENGINEERING REVIEW Real-time OLAP Milliseconds Query Latency + Milliseconds Data Processing Latency 7
  • 8. PLATFORM ENGINEERING REVIEW 3 Stages Data Store We save the unbounded incoming streaming data into 3 stages. 8 InMem Stage On Disk Stage Full Cubing Stage Continuously InMem Aggregations Flush to disk, columnar based storage and indexes Full cubing with MR or Spark, save to HBase. Unbounded streaming events
  • 9. PLATFORM ENGINEERING REVIEW 3 Stages Data Store 99 InMem Stage On Disk Stage Full Cubing Stage InMemoryScanner FragementScanner HBaseScanner GTScanner Interface Small Real-time Data Volume Few Aggregations Large Historical Data Volume All Aggregations Low Data Process Latency Users Transparent Low Query Latency
  • 10. PLATFORM ENGINEERING REVIEW eBay’s Real-time OLAP built on Apache Kylin 10
  • 11. PLATFORM ENGINEERING REVIEW Streaming Receivers Cluster 11 ReplicaSet1Streaming Sources Cube Storage (HBase) new streaming cube request 1 2 3 4 5 Build EngineSteaming Coordinator 6 7 How Real-time Cube Build Engine Works 8 ReplicaSet2 ReplicaSet3 ReplicaSet4 ReplicaSet5 9 10
  • 12. PLATFORM ENGINEERING REVIEW Streaming Receivers Cluster 12 SQL Query Steaming CoordinatorQuery Engine 1 2 3 How Real-time Cube Query Engine Works 12 Cube Storage (HBase)
  • 13. PLATFORM ENGINEERING REVIEW Data Storage 13 Raw Record_1 … RR_x RR_x+1 … RR_y Seg_LL Seg_4 Seg_3 1 … I 1 … J 1 … K In Memory Active Fragments Seg_LL Seg_2 Seg_1 1 … L 1 … M 1 … N Immutable Fragments InMemoryScanner FragementScanner HBaseScanner Open to Write Close to Process GTScanner Interface How to support milliseconds query latency Processed Unbounded streaming events
  • 14. PLATFORM ENGINEERING REVIEW Hierarchy Data on Disk Stage 14 Fragment Segment Cube Cube Segment 1 Fragment 1 Fragment 2 ... ... ... Meta data Meta data Meta data Meta data
  • 15. PLATFORM ENGINEERING REVIEW Segments are divided by event time. New created segments are first In Active state. After a preconfigured data retention period, no further events coming, The segment will be changed to Immutable state and then write to remote HDFS. 15 Segment Window And States
  • 16. PLATFORM ENGINEERING REVIEW 16 Seg_4 1 … J Active Segments Seg_3 Seg_2 Seg_1 1 … L 1 … M 1 … N Immutable Segments Open to Write Close to Process Segment States Unbounded streaming events In Memory Store Fragments
  • 17. PLATFORM ENGINEERING REVIEW 17 Segment Store On Disk
  • 18. PLATFORM ENGINEERING REVIEW 18 Columnar Based Fragment File Design Dimensions : D1,D2…Dn. Metrics : M1,M2...Mm. RawRecords : R1,R2...Rr. D_1 … D_n Metric_1 … Metric_m Dimension_1 Dictionary Dimension_1 Encoded Value by Row Index Data Dimensions Metrics
  • 19. PLATFORM ENGINEERING REVIEW 19 Dimension_1 Dictionary Dimension_1 Encoded Value by Row Inverted Index Dim_1Value_1 Dic_1 Value_1 … … Dim_1Value_x Dic_1 Value_x Row_1’s Dic_1 Value … Row_r’s Dic_1 Value Dic_1Value_1 Offset_1 … … Dic_1Value_x Offset_x Dic_1Value_1‘s occurrence of Rows … Dic_1Value_x‘s occurrence of Rows Offset_1 Offset_... Offset_x R_1 Occu R_2 Occu … R_r Occu
  • 20. PLATFORM ENGINEERING REVIEW 20 Dimensions : Site, Category, Date Metrics : SPS, PV. RawRecords : Site Category Date SPS PV US Collectibles 8/27/16 11757 204143 US Collectibles 8/28/16 10949 192350 GE Fashion 8/28/16 5338 186659 US Collectibles 8/29/16 9609 186283 US Fashion 8/27/16 15174 177123 US Fashion 8/28/16 14612 172552 GE Home & Garden 8/27/16 7336 171364 US Fashion 8/29/16 13927 171235 GE Home & Garden 8/28/16 8615 169846 US Home & Garden 8/29/16 4560 166506
  • 21. PLATFORM ENGINEERING REVIEW 21 D1 … Dn Metrics1 … Metrics m Dimension_1 Dictionary Dimension_1 Encoded Value by Row Inverted Index Dimensions Metrics US 1 GE 2 1121112121 1 offset1 2 offset2 1101110101 0010001010 11757 10949 5338 9609 15174 14612 7336 13927 8615 4560 Record_1 M1 Record_2 M1 …… Record_r M1
  • 22. PLATFORM ENGINEERING REVIEW 22 Replica Set The concept of Replica Set is similar to many other distributed systems like Kafka, Mongo, Kubernetes, etc. Replica Set ensures the replications for HA purpose. Streaming receiver instances in the same replica set have the exact same local state. Streaming receiver instances are pre-allocated to join the Replica Set. Streaming Receivers Cluster ReplicaSet1 ReplicaSet2 ReplicaSet3 ReplicaSet4 ReplicaSet5
  • 23. PLATFORM ENGINEERING REVIEW 23 Replica Set Replica Set Receiver1 Receiver2 Assignment: “cube1”:[1,2] “cube2”:[2,3] Zookeeper • All receivers in the set share the same assignment. • The lead of the ReplicaSet responsible to upload segments to HDFS • Use Zookeeper to do leader election
  • 24. PLATFORM ENGINEERING REVIEW 24 Date Time Topic Name Partition # Offset SeqID of Active Segments 2016/10/01 12:00:00 Topic_1 1 x Seg_5, I Seg_4, J Seg_3, K Check Point Kafka Seg_5 Seg_4 Seg_3 Topic_1 Part_1 1 … I 1 … J 1 … K Active Fragments ... x x+1 … y y+1 … Kafka