SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Real-Time Analytics
With StarRocks
A Linux Foundation Project
Albert Wong, albert.wong@celerdata.com
Agenda
● Trends and challenges in today's real-time
analytics
● How StarRocks solves the challenges of
real-time analytics
● Case studies of how StarRocks is being
used for real-time analytics
Community User Quote
“Real-time analytics is like trying to drink
from a fire hose. The data is constantly
flowing, and it can be difficult to keep up.
But if you can do it, the insights you gain can
be invaluable.”
Profiting off of near real time knowledge - Battle of Waterloo, Nathan Rothschild and London Stock Exchange.
Trends in real-time analytics
1 Make better decisions 2 Increased user
engagement
3
Staying ahead of the
competition
Real-time analytics is the process of collecting, processing, and analyzing data as it is
generated, in order to gain insights into the present state of a system or process. This can lead
to a number of benefits, such as:
Key trends in real-time analytics:
Rise of streaming data Growth of Edge Computing Increasing use of machine
learning
Democratization of real-time
analytics
Trends in OLAP databases
1 Cloud Native 2
A
Sub-second vs. Second/Minute
Query Response Time
3
Data Warehouse vs. Data Lake vs.
Data Lakehouse
Online analytical processing (OLAP) databases are evolving rapidly to meet the demands of
modern data analytics. Here are some of the key trends in OLAP databases:
2
B Streaming vs. Batch Data
2
C Mutable vs. Immutable Data
2
D
Remote (Object) Storage vs. Local
(SSD) Storage
2
E
Open Table Format vs. Product
Native Storage Format
Proprietary / Hybrid Open
Open Storage
Trends in OLAP databases
Compute
Table Format
Storage Format
Open Lakehouse vs Proprietary / Hybrid Lakehouse
Challenges in real-time analytics
1
Ingestion speed and data updates:
Real-time analytics requires high-quality
data that is available in real time. The logic
and mathematics required to handle
unbounded data sets is different than static.
2
Query Response Time: Real-time analytics
needs to be able to deliver insights with
sub-second response times versus seconds
or minutes long query times.
3
Scalability: Real-time analytics solutions
need to be scalable to handle large volumes
of data and high concurrency. Performance
and cost are always put to the test as a data
set grows over time.
4
Cost: Real-time analytics solutions can be
expensive to implement and maintain.
Real-Time Analytics (RTA) is a powerful tool that can help businesses deliver insights to their
users, but it is not without its challenges. Here are some of the key challenges with RTA:
How StarRocks solves the
challenges of real-time
analytics
StarRocks was designed to address the challenges of
real-time analytics, including the need to support high
concurrency, low latency, a wide range of analytical
workloads.
StarRocks enables:
● Directly query data on data lake and get data warehouse performance.
● Sub-second joins and aggregations on billions of rows.
● Serving hundreds of thousands of concurrent end-user requests.
● No need for external data transformation tool for denormalization or pre-aggregation
● Compatibility with standard sql protocol and Trino dialect.
● Perform significantly more queries per second at lower latency on less hardware.
● Increase the flexibility and capability of your data lake or analytics system.
● Reduced cost and complexity by eliminating the need for costly data transformation pipelines.
Low Latency Queries Solution
Sub-second query with thousands of QPS and beyond
High Concurrency
Sub-second query with thousands of QPS and beyond
IO Bound Workload
● Less complex queries such as point
queries.
● Very High QPS.
● Typically requires Latency in the low 10s
of ms on massive amount of data
(billions of rows).
CPU Bound Workload
● Complex OLAP-style queries with JOINs
and AGGs.
● Interactive analytics.
● Reports, dashboards, etc.
High Concurrency Solution
IO Bound Workload
Scan the least amount of data possible
1. Bucketing: Fine-grained control over how the
data is distributed in your cluster.
● Avoid data skew.
● Choose a high cardinality column that
often appears in your `WHERE` clauses.
2. Indexing: Use prefix index (order key), Bitmap,
Bloom Filter index to minimize disk access.
3. IO-optimized Hardware:
● Configure (multiple) high-performance
SSD drives.
● Good network in terms of latency and
throughput.
CPU Bound Workload
Pre-computation or reusing previous results to take load off
your CPUs
1. Intelligent Caching:
● Final Result Cache: For the folks who keep
hitting the same button just for fun.
● Query Cache (Intermediate Result Cache):
Reuse partial results even if the queries are not
identical.
1. Pre-computation:
● Denormalization through Column Partial
Update.
● Pre-aggregation through MVs and Aggregate
Table.
● Generated Column (3.1): Materialize a column.
Fresh Mutable Data Mutable data is important!!
Existing solutions relies on merge on read – Huge
compromise on query performance.
Fresh Mutable Data Solution
Delete and insert mechanism with Primary Key index.
Support Real-Time mutable data with no compromise on query
performance.
Approach Pros Cons
Delete and insert
Simple, efficient for
read-heavy workloads
Can be inefficient for
write-heavy workloads
Merge on read
Can be efficient for
write-heavy workloads
More complex than delete
and insert
Resource Isolation
Why Resource Isolation is Needed:
● Maintaining Business-Critical Operations: Uninterrupted and undisturbed performance for key tasks.
● Quality of Service Assurance: Resource assurance for varying user groups in a multi-tenant environment.
Current Approach: Physical Isolation:
● Pros:
○ Effective isolation of resources.
● Cons:
○ Low Hardware Utilization: Results in underutilized resources due to lack of sharing or
overprovisioning.
○ High Costs: Need to meet peak demand of each user group leads to idle resources and escalated
costs, especially in large-scale operations.
○ Gets worse when you grow to tens of thousands of tenants.
Resource Isolation Solution
Resource Group: Support over-provisioning CPU resources.
Three types of workloads defined:
● Short query: Time sensitive queries with dedicated resource (no over-provisioning).
● Query queue: Queries that are not time sensitive but are critical (cannot be killed).
● The rest: supports customized rules to kill big queries.
StarRocks was designed to address the
challenges of real-time analytics, including
the need to support high concurrency, low
latency, a wide range of analytical
workloads and offers the ability to query
data directly from data lakes.
Reduce the time and cost of
developing data analytics projects.
● No ingestion and data copying.
● Stop paying to denormalize
data that may never be
queried.
● Keep the tools you’ve been
using.
Sub-second query while serve
millions of users.
● Multi-dimensional interactive
analytics through on-the-fly
computations.
● Single source of truth on open
data lake analytics, with no
external system for caching or
pre-computation.
Make real-time decisions in all
business scenarios, especially when
updated data, like in logistics, is
needed.
● Real-time update without
sacrificing query performance.
● Simpler real-time data pipeline
● Ditch stateful stream
processing jobs
(denormalization &
preaggregation) with efficient
on-the-fly computation.
Data warehouse query performance on the
data lakehouse with no data copying: Natively
integrates with open data lake including Hive,
Hudi, Iceberg, and Delta Lake.
Ditch Denormalization: Perform joins on
multiple tables with millions of rows in seconds.
No need for external data transformation tool:
Perform on-demand pre-computation (like
denormalization) within StarRocks, eliminating
another processing tool in your data pipeline.
Intelligent Query Planning
● Cost-based optimizer generates
optimized query plan.
● Global runtime filter.
Efficient query execution
● In-memory data shuffling enables fast
and scalable JOIN operation.
● C++ SIMD-optimized columnar storage
and vectorized query executions deliver
the industry's fastest query
performance.
High concurrency
● Built-in Materialized View.
● An Intelligent caching system.
● Secondary Indexes.
Ditch stream processing tools for
denormalization: StarRocks’ multi-table
performance allows you to ditch the rigid and
expensive stateful stream processing jobs for
denormalization and pre-aggregation.
Real-time updates: Through primary key table,
support real-time data updates while having no
impact on query performance.
Real-time Query Performance at scale: The
synchronized materialized view (MV) further
accelerates aggregated queries at scale for
real-time analytics.
Benchmark StarRocks Offers 2.2x Performance over ClickHouse and 8.9x
Performance over Apache Druid® in Wide-table Scenarios Out
of the Box using product native table format.
Benchmark StarRocks Delivers 5.54x Query Performance over Trino in
Multi-table Scenarios using Apache Iceberg table format with
Parquet files.
Use Case: Tableau
Dashboard at Airbnb
The Airbnb Tableau Dashboard project is designed to serve both
internal and external users by providing interactive dashboards. It
requires a quick response to user queries. However, the query
latency of previous solutions is over 10 mins, which is not
acceptable. This project was just suspended until StarRocks is
adopted.
StarRocks Solution:
● StarRocks can directly connect and works very well with
Tableau.
● 3 tables (0.5B rows, 6B rows, 100M rows) + 4 joins + 3
distinct count + JSON functions and regex at same time,
response time just 3.6s.
● Reduce the query response time from mins level to
sub-seconds level.
Use Case: Game and
User Behavior Analytics
at Tencent IEG
● 400+ game data analysis and user behavior analysis
● Operation reports need to be real-time.
● Using ClickHouse for real-time analysis and Trino for
Ad-hoc before, but they want to integrate them all.
● Using Iceberg + COS store, need better performance.
● Need elastic in ad-hoc query to deduce cost.
StarRocks Solution:
● Using StarRocks Primary key to solve update problem.
● Using compute node on k8s to auto-scaling.
● Get much more performance in ad-hoc query.
Use Case: Trust
Analytics at Airbnb
To enhance security, Airbnb needs a real-time fraud detection
system (Trust Analytics) to identify various attacks and take
actions ASAP. This system must support Ad-Hoc query and
real-time update.
StarRocks Solution:
● StarRocks hosts real-time updated datasets via Primary
Key.
● Dataset import from Kafka has a sub-minute delay.
● StarRocks provides second-level query latency for
complex joins.
● Alerting can be achieved by just running a SQL query
regularly.
History of StarRocks and CelerData
StarRocks was designed to address the challenges of real-time analytics, including the need to support
high concurrency, low latency, and a wide range of analytical workloads. StarRocks also offers a number
of features that are not available in other real-time analytics databases, such as the ability to query data
directly from data lakes.
2020
Birth of StarRocks
StarRocks is created as a commercialized fork of the
Apache Doris database. Over time, 90% of the
original codebase has been re-written.
2022
CelerData is founded
CelerData is founded as a company to develop and
commercialize StarRocks.
2023
StarRocks moves to Linux Foundation
CelerData contributes StarRocks to the Linux
Foundation and moves to Apache 2.0 license.
2023
CelerData Cloud Launched
CelerData launches its managed cloud service for
StarRocks.
2023
Benchmarks outperform competition
Latest TPC-DS and SSB benchmarks shows 2x-9x
speed performance over Trino, Clickhouse and
Apache Druid.
Thank you.
● Website starrocks.io
● Managed Service cloud.celerdata.com
StarRocks Project

Weitere ähnliche Inhalte

Ähnlich wie Real-Time Analytics With StarRocks (DWH+DL).pdf

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureMarco van der Hart
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Differencejeetendra mandal
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8Navaid Khan
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Navaid Khan
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data PlatformNavaid Khan
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mydbops
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 

Ähnlich wie Real-Time Analytics With StarRocks (DWH+DL).pdf (20)

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochure
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data Platform
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 

Kürzlich hochgeladen

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 

Kürzlich hochgeladen (20)

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 

Real-Time Analytics With StarRocks (DWH+DL).pdf

  • 1. Real-Time Analytics With StarRocks A Linux Foundation Project Albert Wong, albert.wong@celerdata.com
  • 2. Agenda ● Trends and challenges in today's real-time analytics ● How StarRocks solves the challenges of real-time analytics ● Case studies of how StarRocks is being used for real-time analytics
  • 3. Community User Quote “Real-time analytics is like trying to drink from a fire hose. The data is constantly flowing, and it can be difficult to keep up. But if you can do it, the insights you gain can be invaluable.” Profiting off of near real time knowledge - Battle of Waterloo, Nathan Rothschild and London Stock Exchange.
  • 4. Trends in real-time analytics 1 Make better decisions 2 Increased user engagement 3 Staying ahead of the competition Real-time analytics is the process of collecting, processing, and analyzing data as it is generated, in order to gain insights into the present state of a system or process. This can lead to a number of benefits, such as: Key trends in real-time analytics: Rise of streaming data Growth of Edge Computing Increasing use of machine learning Democratization of real-time analytics
  • 5. Trends in OLAP databases 1 Cloud Native 2 A Sub-second vs. Second/Minute Query Response Time 3 Data Warehouse vs. Data Lake vs. Data Lakehouse Online analytical processing (OLAP) databases are evolving rapidly to meet the demands of modern data analytics. Here are some of the key trends in OLAP databases: 2 B Streaming vs. Batch Data 2 C Mutable vs. Immutable Data 2 D Remote (Object) Storage vs. Local (SSD) Storage 2 E Open Table Format vs. Product Native Storage Format
  • 6. Proprietary / Hybrid Open Open Storage Trends in OLAP databases Compute Table Format Storage Format Open Lakehouse vs Proprietary / Hybrid Lakehouse
  • 7. Challenges in real-time analytics 1 Ingestion speed and data updates: Real-time analytics requires high-quality data that is available in real time. The logic and mathematics required to handle unbounded data sets is different than static. 2 Query Response Time: Real-time analytics needs to be able to deliver insights with sub-second response times versus seconds or minutes long query times. 3 Scalability: Real-time analytics solutions need to be scalable to handle large volumes of data and high concurrency. Performance and cost are always put to the test as a data set grows over time. 4 Cost: Real-time analytics solutions can be expensive to implement and maintain. Real-Time Analytics (RTA) is a powerful tool that can help businesses deliver insights to their users, but it is not without its challenges. Here are some of the key challenges with RTA:
  • 8. How StarRocks solves the challenges of real-time analytics
  • 9. StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, a wide range of analytical workloads. StarRocks enables: ● Directly query data on data lake and get data warehouse performance. ● Sub-second joins and aggregations on billions of rows. ● Serving hundreds of thousands of concurrent end-user requests. ● No need for external data transformation tool for denormalization or pre-aggregation ● Compatibility with standard sql protocol and Trino dialect. ● Perform significantly more queries per second at lower latency on less hardware. ● Increase the flexibility and capability of your data lake or analytics system. ● Reduced cost and complexity by eliminating the need for costly data transformation pipelines.
  • 10. Low Latency Queries Solution Sub-second query with thousands of QPS and beyond
  • 11. High Concurrency Sub-second query with thousands of QPS and beyond IO Bound Workload ● Less complex queries such as point queries. ● Very High QPS. ● Typically requires Latency in the low 10s of ms on massive amount of data (billions of rows). CPU Bound Workload ● Complex OLAP-style queries with JOINs and AGGs. ● Interactive analytics. ● Reports, dashboards, etc.
  • 12. High Concurrency Solution IO Bound Workload Scan the least amount of data possible 1. Bucketing: Fine-grained control over how the data is distributed in your cluster. ● Avoid data skew. ● Choose a high cardinality column that often appears in your `WHERE` clauses. 2. Indexing: Use prefix index (order key), Bitmap, Bloom Filter index to minimize disk access. 3. IO-optimized Hardware: ● Configure (multiple) high-performance SSD drives. ● Good network in terms of latency and throughput. CPU Bound Workload Pre-computation or reusing previous results to take load off your CPUs 1. Intelligent Caching: ● Final Result Cache: For the folks who keep hitting the same button just for fun. ● Query Cache (Intermediate Result Cache): Reuse partial results even if the queries are not identical. 1. Pre-computation: ● Denormalization through Column Partial Update. ● Pre-aggregation through MVs and Aggregate Table. ● Generated Column (3.1): Materialize a column.
  • 13. Fresh Mutable Data Mutable data is important!! Existing solutions relies on merge on read – Huge compromise on query performance.
  • 14. Fresh Mutable Data Solution Delete and insert mechanism with Primary Key index. Support Real-Time mutable data with no compromise on query performance. Approach Pros Cons Delete and insert Simple, efficient for read-heavy workloads Can be inefficient for write-heavy workloads Merge on read Can be efficient for write-heavy workloads More complex than delete and insert
  • 15. Resource Isolation Why Resource Isolation is Needed: ● Maintaining Business-Critical Operations: Uninterrupted and undisturbed performance for key tasks. ● Quality of Service Assurance: Resource assurance for varying user groups in a multi-tenant environment. Current Approach: Physical Isolation: ● Pros: ○ Effective isolation of resources. ● Cons: ○ Low Hardware Utilization: Results in underutilized resources due to lack of sharing or overprovisioning. ○ High Costs: Need to meet peak demand of each user group leads to idle resources and escalated costs, especially in large-scale operations. ○ Gets worse when you grow to tens of thousands of tenants.
  • 16. Resource Isolation Solution Resource Group: Support over-provisioning CPU resources. Three types of workloads defined: ● Short query: Time sensitive queries with dedicated resource (no over-provisioning). ● Query queue: Queries that are not time sensitive but are critical (cannot be killed). ● The rest: supports customized rules to kill big queries.
  • 17. StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, a wide range of analytical workloads and offers the ability to query data directly from data lakes.
  • 18. Reduce the time and cost of developing data analytics projects. ● No ingestion and data copying. ● Stop paying to denormalize data that may never be queried. ● Keep the tools you’ve been using. Sub-second query while serve millions of users. ● Multi-dimensional interactive analytics through on-the-fly computations. ● Single source of truth on open data lake analytics, with no external system for caching or pre-computation. Make real-time decisions in all business scenarios, especially when updated data, like in logistics, is needed. ● Real-time update without sacrificing query performance. ● Simpler real-time data pipeline ● Ditch stateful stream processing jobs (denormalization & preaggregation) with efficient on-the-fly computation.
  • 19. Data warehouse query performance on the data lakehouse with no data copying: Natively integrates with open data lake including Hive, Hudi, Iceberg, and Delta Lake. Ditch Denormalization: Perform joins on multiple tables with millions of rows in seconds. No need for external data transformation tool: Perform on-demand pre-computation (like denormalization) within StarRocks, eliminating another processing tool in your data pipeline. Intelligent Query Planning ● Cost-based optimizer generates optimized query plan. ● Global runtime filter. Efficient query execution ● In-memory data shuffling enables fast and scalable JOIN operation. ● C++ SIMD-optimized columnar storage and vectorized query executions deliver the industry's fastest query performance. High concurrency ● Built-in Materialized View. ● An Intelligent caching system. ● Secondary Indexes. Ditch stream processing tools for denormalization: StarRocks’ multi-table performance allows you to ditch the rigid and expensive stateful stream processing jobs for denormalization and pre-aggregation. Real-time updates: Through primary key table, support real-time data updates while having no impact on query performance. Real-time Query Performance at scale: The synchronized materialized view (MV) further accelerates aggregated queries at scale for real-time analytics.
  • 20. Benchmark StarRocks Offers 2.2x Performance over ClickHouse and 8.9x Performance over Apache Druid® in Wide-table Scenarios Out of the Box using product native table format.
  • 21. Benchmark StarRocks Delivers 5.54x Query Performance over Trino in Multi-table Scenarios using Apache Iceberg table format with Parquet files.
  • 22. Use Case: Tableau Dashboard at Airbnb The Airbnb Tableau Dashboard project is designed to serve both internal and external users by providing interactive dashboards. It requires a quick response to user queries. However, the query latency of previous solutions is over 10 mins, which is not acceptable. This project was just suspended until StarRocks is adopted. StarRocks Solution: ● StarRocks can directly connect and works very well with Tableau. ● 3 tables (0.5B rows, 6B rows, 100M rows) + 4 joins + 3 distinct count + JSON functions and regex at same time, response time just 3.6s. ● Reduce the query response time from mins level to sub-seconds level.
  • 23. Use Case: Game and User Behavior Analytics at Tencent IEG ● 400+ game data analysis and user behavior analysis ● Operation reports need to be real-time. ● Using ClickHouse for real-time analysis and Trino for Ad-hoc before, but they want to integrate them all. ● Using Iceberg + COS store, need better performance. ● Need elastic in ad-hoc query to deduce cost. StarRocks Solution: ● Using StarRocks Primary key to solve update problem. ● Using compute node on k8s to auto-scaling. ● Get much more performance in ad-hoc query.
  • 24. Use Case: Trust Analytics at Airbnb To enhance security, Airbnb needs a real-time fraud detection system (Trust Analytics) to identify various attacks and take actions ASAP. This system must support Ad-Hoc query and real-time update. StarRocks Solution: ● StarRocks hosts real-time updated datasets via Primary Key. ● Dataset import from Kafka has a sub-minute delay. ● StarRocks provides second-level query latency for complex joins. ● Alerting can be achieved by just running a SQL query regularly.
  • 25. History of StarRocks and CelerData StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, and a wide range of analytical workloads. StarRocks also offers a number of features that are not available in other real-time analytics databases, such as the ability to query data directly from data lakes. 2020 Birth of StarRocks StarRocks is created as a commercialized fork of the Apache Doris database. Over time, 90% of the original codebase has been re-written. 2022 CelerData is founded CelerData is founded as a company to develop and commercialize StarRocks. 2023 StarRocks moves to Linux Foundation CelerData contributes StarRocks to the Linux Foundation and moves to Apache 2.0 license. 2023 CelerData Cloud Launched CelerData launches its managed cloud service for StarRocks. 2023 Benchmarks outperform competition Latest TPC-DS and SSB benchmarks shows 2x-9x speed performance over Trino, Clickhouse and Apache Druid.
  • 26. Thank you. ● Website starrocks.io ● Managed Service cloud.celerdata.com StarRocks Project