SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Build User-Facing
Analytics Application
That Scales
Using StarRocks, a Linux Foundation Project
Albert Wong, albert.wong@celerdata.com
Agenda
● Trends and challenges in today's
user-facing analytics
● How StarRocks solves the challenges of
user-facing analytics
● Case studies of how StarRocks is being
used for user-facing analytics
Community User Quote
“Real-time analytics is like trying to drink
from a fire hose. The data is constantly
flowing, and it can be difficult to keep up.
But if you can do it, the insights you gain can
be invaluable.”
Profiting off of near real time knowledge - Battle of Waterloo, Nathan Rothschild and London Stock Exchange.
Trends in user-facing analytics
1
Improved decision
making
2 Increased user
engagement
3 Reduced reliance on IT
User-facing analytics (UFA) is a rapidly growing field that is transforming the way businesses
deliver insights to their users. UFA empowers users to explore and analyze data for themselves,
without the need for technical expertise. This can lead to a number of benefits, such as:
Key trends in user-facing analytics:
Self-service Analytics Embedded Analytics Real-Time Analytics Augmented Analytics Conversational Analytics
Trends in OLAP databases
1 Cloud Native 2
A
Sub-second vs. Second/Minute
Query Response Time
3
Data Warehouse vs. Data Lake vs.
Data Lakehouse
Online analytical processing (OLAP) databases are evolving rapidly to meet the demands of
modern data analytics. Here are some of the key trends in OLAP databases:
2
B Streaming vs. Batch Data
2
C Mutable vs. Immutable Data
2
D
Remote (Object) Storage vs. Local
(SSD) Storage
2
E
Open Table Format vs. Product
Native Storage Format
Proprietary / Hybrid Open
Open Storage
Trends in OLAP databases
Compute
Table Format
Storage Format
Open Lakehouse vs Proprietary / Hybrid Lakehouse
Challenges in user-facing analytics
1
Ingestion speed and data updates:
User-facing analytics requires high-quality
data that is available in real time. The logic
and mathematics required to handle
unbounded data sets is different than static.
2
Query Response Time: User-facing
analytics needs to be able to deliver insights
with sub-second response times versus
seconds or minutes long query times.
3
Scalability: User-facing analytics solutions
need to be scalable to handle large volumes
of data and high concurrency. Performance
and cost are always put to the test as a data
set grows over time.
4
Cost: User-facing analytics solutions can be
expensive to implement and maintain.
User-facing analytics (UFA) is a powerful tool that can help businesses deliver insights to their
users, but it is not without its challenges. Here are some of the key challenges with UFA:
How StarRocks solves the
challenges of user-facing
analytics
StarRocks is an open-source query engine that
delivers data warehouse performance on the
data lake.
StarRocks enables:
● Directly query data on data lake and get data warehouse performance.
● Sub-second joins and aggregations on billions of rows.
● Serving hundreds of thousands of concurrent end-user requests.
● No need for external data transformation tool for denormalization or pre-aggregation
● Compatibility with standard sql protocol and Trino dialect.
● Perform significantly more queries per second at lower latency on less hardware.
● Increase the flexibility and capability of your data lake or analytics system.
● Reduced cost and complexity by eliminating the need for costly data transformation pipelines.
Low Latency Queries Solution
Sub-second query with thousands of QPS and beyond
High Concurrency
Sub-second query with thousands of QPS and beyond
IO Bound Workload
● Less complex queries such as point
queries.
● Very High QPS.
● Typically requires Latency in the low 10s
of ms on massive amount of data
(billions of rows).
CPU Bound Workload
● Complex OLAP-style queries with JOINs
and AGGs.
● Interactive analytics.
● Reports, dashboards, etc.
High Concurrency Solution
IO Bound Workload
Scan the least amount of data possible
1. Bucketing: Fine-grained control over how the
data is distributed in your cluster.
● Avoid data skew.
● Choose a high cardinality column that
often appears in your `WHERE` clauses.
2. Indexing: Use prefix index (order key), Bitmap,
Bloom Filter index to minimize disk access.
3. IO-optimized Hardware:
● Configure (multiple) high-performance
SSD drives.
● Good network in terms of latency and
throughput.
CPU Bound Workload
Pre-computation or reusing previous results to take load off
your CPUs
1. Intelligent Caching:
● Final Result Cache: For the folks who keep
hitting the same button just for fun.
● Query Cache (Intermediate Result Cache):
Reuse partial results even if the queries are not
identical.
1. Pre-computation:
● Denormalization through Column Partial
Update.
● Pre-aggregation through MVs and Aggregate
Table.
● Generated Column (3.1): Materialize a column.
Fresh Mutable Data Mutable data is important!!
Existing solutions relies on merge on read – Huge
compromise on query performance.
Fresh Mutable Data Solution
Delete and insert mechanism with Primary Key index.
Support Real-Time mutable data with no compromise on query
performance.
Approach Pros Cons
Delete and insert
Simple, efficient for
read-heavy workloads
Can be inefficient for
write-heavy workloads
Merge on read
Can be efficient for
write-heavy workloads
More complex than delete
and insert
Resource Isolation
Why Resource Isolation is Needed:
● Maintaining Business-Critical Operations: Uninterrupted and undisturbed performance for key tasks.
● Quality of Service Assurance: Resource assurance for varying user groups in a multi-tenant environment.
Current Approach: Physical Isolation:
● Pros:
○ Effective isolation of resources.
● Cons:
○ Low Hardware Utilization: Results in underutilized resources due to lack of sharing or
overprovisioning.
○ High Costs: Need to meet peak demand of each user group leads to idle resources and escalated
costs, especially in large-scale operations.
○ Gets worse when you grow to tens of thousands of tenants.
Resource Isolation Solution
Resource Group: Support over-provisioning CPU resources.
Three types of workloads defined:
● Short query: Time sensitive queries with dedicated resource (no over-provisioning).
● Query queue: Queries that are not time sensitive but are critical (cannot be killed).
● The rest: supports customized rules to kill big queries.
StarRocks is an open-source
query engine that delivers data
warehouse performance on the
data lake.
Reduce the time and cost of
developing data analytics projects.
● No ingestion and data copying.
● Stop paying to denormalize
data that may never be
queried.
● Keep the tools you’ve been
using.
Sub-second query while serve
millions of users.
● Multi-dimensional interactive
analytics through on-the-fly
computations.
● Single source of truth on open
data lake analytics, with no
external system for caching or
pre-computation.
Make real-time decisions in all
business scenarios, especially when
updated data, like in logistics, is
needed.
● Real-time update without
sacrificing query performance.
● Simpler real-time data pipeline
● Ditch stateful stream
processing jobs
(denormalization &
preaggregation) with efficient
on-the-fly computation.
Data warehouse query performance on the
data lakehouse with no data copying: Natively
integrates with open data lake including Hive,
Hudi, Iceberg, and Delta Lake.
Ditch Denormalization: Perform joins on
multiple tables with millions of rows in seconds.
No need for external data transformation tool:
Perform on-demand pre-computation (like
denormalization) within StarRocks, eliminating
another processing tool in your data pipeline.
Intelligent Query Planning
● Cost-based optimizer generates
optimized query plan.
● Global runtime filter.
Efficient query execution
● In-memory data shuffling enables fast
and scalable JOIN operation.
● C++ SIMD-optimized columnar storage
and vectorized query executions deliver
the industry's fastest query
performance.
High concurrency
● Built-in Materialized View.
● An Intelligent caching system.
● Secondary Indexes.
Ditch stream processing tools for
denormalization: StarRocks’ multi-table
performance allows you to ditch the rigid and
expensive stateful stream processing jobs for
denormalization and pre-aggregation.
Real-time updates: Through primary key table,
support real-time data updates while having no
impact on query performance.
Real-time Query Performance at scale: The
synchronized materialized view (MV) further
accelerates aggregated queries at scale for
real-time analytics.
Benchmark StarRocks Offers 2.2x Performance over ClickHouse and 8.9x
Performance over Apache Druid® in Wide-table Scenarios Out
of the Box using product native table format.
Benchmark StarRocks Delivers 5.54x Query Performance over Trino in
Multi-table Scenarios using Apache Iceberg table format with
Parquet files.
Use Case: Tableau
Dashboard at Airbnb
The Airbnb Tableau Dashboard project is designed to serve both
internal and external users by providing interactive dashboards. It
requires a quick response to user queries. However, the query
latency of previous solutions is over 10 mins, which is not
acceptable. This project was just suspended until StarRocks is
adopted.
StarRocks Solution:
● StarRocks can directly connect and works very well with
Tableau.
● 3 tables (0.5B rows, 6B rows, 100M rows) + 4 joins + 3
distinct count + JSON functions and regex at same time,
response time just 3.6s.
● Reduce the query response time from mins level to
sub-seconds level.
Use Case: Game and
User Behavior Analytics
at Tencent IEG
● 400+ game data analysis and user behavior analysis
● Operation reports need to be real-time.
● Using ClickHouse for real-time analysis and Trino for
Ad-hoc before, but they want to integrate them all.
● Using Iceberg + COS store, need better performance.
● Need elastic in ad-hoc query to deduce cost.
StarRocks Solution:
● Using StarRocks Primary key to solve update problem.
● Using compute node on k8s to auto-scaling.
● Get much more performance in ad-hoc query.
Use Case: Trust
Analytics at Airbnb
To enhance security, Airbnb needs a real-time fraud detection
system (Trust Analytics) to identify various attacks and take
actions ASAP. This system must support Ad-Hoc query and
real-time update.
StarRocks Solution:
● StarRocks hosts real-time updated datasets via Primary
Key.
● Dataset import from Kafka has a sub-minute delay.
● StarRocks provides second-level query latency for
complex joins.
● Alerting can be achieved by just running a SQL query
regularly.
History of StarRocks and CelerData
StarRocks was designed to address the challenges of real-time analytics, including the need to support
high concurrency, low latency, and a wide range of analytical workloads. StarRocks also offers a number
of features that are not available in other real-time analytics databases, such as the ability to query data
directly from data lakes.
2020
Birth of StarRocks
StarRocks is created as a commercialized fork of the
Apache Doris database. Over time, 90% of the
original codebase has been re-written.
2022
CelerData is founded
CelerData is founded as a company to develop and
commercialize StarRocks.
2023
StarRocks moves to Linux Foundation
CelerData contributes StarRocks to the Linux
Foundation and moves to Apache 2.0 license.
2023
CelerData Cloud Launched
CelerData launches its managed cloud service for
StarRocks.
2023
Benchmarks outperform competition
Latest TPC-DS and SSB benchmarks shows 2x-9x
speed performance over Trino, Clickhouse and
Apache Druid.
Thank you.
● Website starrocks.io
● Managed Service cloud.celerdata.com
StarRocks Project

Weitere ähnliche Inhalte

Ähnlich wie Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf

Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8Navaid Khan
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Navaid Khan
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data PlatformNavaid Khan
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...InfluxData
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptQingsong Yao
 
How Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments WebcastHow Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments WebcastYellowbrick Data
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureMarco van der Hart
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - TestKiran Naiga
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
HyperconvergedFantasyAnalytics
HyperconvergedFantasyAnalyticsHyperconvergedFantasyAnalytics
HyperconvergedFantasyAnalyticsJerry Jermann
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 

Ähnlich wie Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf (20)

Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data Platform
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
How Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments WebcastHow Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments Webcast
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochure
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
HyperconvergedFantasyAnalytics
HyperconvergedFantasyAnalyticsHyperconvergedFantasyAnalytics
HyperconvergedFantasyAnalytics
 
The new EDW
The new EDWThe new EDW
The new EDW
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 

Kürzlich hochgeladen

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Kürzlich hochgeladen (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf

  • 1. Build User-Facing Analytics Application That Scales Using StarRocks, a Linux Foundation Project Albert Wong, albert.wong@celerdata.com
  • 2. Agenda ● Trends and challenges in today's user-facing analytics ● How StarRocks solves the challenges of user-facing analytics ● Case studies of how StarRocks is being used for user-facing analytics
  • 3. Community User Quote “Real-time analytics is like trying to drink from a fire hose. The data is constantly flowing, and it can be difficult to keep up. But if you can do it, the insights you gain can be invaluable.” Profiting off of near real time knowledge - Battle of Waterloo, Nathan Rothschild and London Stock Exchange.
  • 4. Trends in user-facing analytics 1 Improved decision making 2 Increased user engagement 3 Reduced reliance on IT User-facing analytics (UFA) is a rapidly growing field that is transforming the way businesses deliver insights to their users. UFA empowers users to explore and analyze data for themselves, without the need for technical expertise. This can lead to a number of benefits, such as: Key trends in user-facing analytics: Self-service Analytics Embedded Analytics Real-Time Analytics Augmented Analytics Conversational Analytics
  • 5. Trends in OLAP databases 1 Cloud Native 2 A Sub-second vs. Second/Minute Query Response Time 3 Data Warehouse vs. Data Lake vs. Data Lakehouse Online analytical processing (OLAP) databases are evolving rapidly to meet the demands of modern data analytics. Here are some of the key trends in OLAP databases: 2 B Streaming vs. Batch Data 2 C Mutable vs. Immutable Data 2 D Remote (Object) Storage vs. Local (SSD) Storage 2 E Open Table Format vs. Product Native Storage Format
  • 6. Proprietary / Hybrid Open Open Storage Trends in OLAP databases Compute Table Format Storage Format Open Lakehouse vs Proprietary / Hybrid Lakehouse
  • 7. Challenges in user-facing analytics 1 Ingestion speed and data updates: User-facing analytics requires high-quality data that is available in real time. The logic and mathematics required to handle unbounded data sets is different than static. 2 Query Response Time: User-facing analytics needs to be able to deliver insights with sub-second response times versus seconds or minutes long query times. 3 Scalability: User-facing analytics solutions need to be scalable to handle large volumes of data and high concurrency. Performance and cost are always put to the test as a data set grows over time. 4 Cost: User-facing analytics solutions can be expensive to implement and maintain. User-facing analytics (UFA) is a powerful tool that can help businesses deliver insights to their users, but it is not without its challenges. Here are some of the key challenges with UFA:
  • 8. How StarRocks solves the challenges of user-facing analytics
  • 9. StarRocks is an open-source query engine that delivers data warehouse performance on the data lake. StarRocks enables: ● Directly query data on data lake and get data warehouse performance. ● Sub-second joins and aggregations on billions of rows. ● Serving hundreds of thousands of concurrent end-user requests. ● No need for external data transformation tool for denormalization or pre-aggregation ● Compatibility with standard sql protocol and Trino dialect. ● Perform significantly more queries per second at lower latency on less hardware. ● Increase the flexibility and capability of your data lake or analytics system. ● Reduced cost and complexity by eliminating the need for costly data transformation pipelines.
  • 10. Low Latency Queries Solution Sub-second query with thousands of QPS and beyond
  • 11. High Concurrency Sub-second query with thousands of QPS and beyond IO Bound Workload ● Less complex queries such as point queries. ● Very High QPS. ● Typically requires Latency in the low 10s of ms on massive amount of data (billions of rows). CPU Bound Workload ● Complex OLAP-style queries with JOINs and AGGs. ● Interactive analytics. ● Reports, dashboards, etc.
  • 12. High Concurrency Solution IO Bound Workload Scan the least amount of data possible 1. Bucketing: Fine-grained control over how the data is distributed in your cluster. ● Avoid data skew. ● Choose a high cardinality column that often appears in your `WHERE` clauses. 2. Indexing: Use prefix index (order key), Bitmap, Bloom Filter index to minimize disk access. 3. IO-optimized Hardware: ● Configure (multiple) high-performance SSD drives. ● Good network in terms of latency and throughput. CPU Bound Workload Pre-computation or reusing previous results to take load off your CPUs 1. Intelligent Caching: ● Final Result Cache: For the folks who keep hitting the same button just for fun. ● Query Cache (Intermediate Result Cache): Reuse partial results even if the queries are not identical. 1. Pre-computation: ● Denormalization through Column Partial Update. ● Pre-aggregation through MVs and Aggregate Table. ● Generated Column (3.1): Materialize a column.
  • 13. Fresh Mutable Data Mutable data is important!! Existing solutions relies on merge on read – Huge compromise on query performance.
  • 14. Fresh Mutable Data Solution Delete and insert mechanism with Primary Key index. Support Real-Time mutable data with no compromise on query performance. Approach Pros Cons Delete and insert Simple, efficient for read-heavy workloads Can be inefficient for write-heavy workloads Merge on read Can be efficient for write-heavy workloads More complex than delete and insert
  • 15. Resource Isolation Why Resource Isolation is Needed: ● Maintaining Business-Critical Operations: Uninterrupted and undisturbed performance for key tasks. ● Quality of Service Assurance: Resource assurance for varying user groups in a multi-tenant environment. Current Approach: Physical Isolation: ● Pros: ○ Effective isolation of resources. ● Cons: ○ Low Hardware Utilization: Results in underutilized resources due to lack of sharing or overprovisioning. ○ High Costs: Need to meet peak demand of each user group leads to idle resources and escalated costs, especially in large-scale operations. ○ Gets worse when you grow to tens of thousands of tenants.
  • 16. Resource Isolation Solution Resource Group: Support over-provisioning CPU resources. Three types of workloads defined: ● Short query: Time sensitive queries with dedicated resource (no over-provisioning). ● Query queue: Queries that are not time sensitive but are critical (cannot be killed). ● The rest: supports customized rules to kill big queries.
  • 17. StarRocks is an open-source query engine that delivers data warehouse performance on the data lake.
  • 18. Reduce the time and cost of developing data analytics projects. ● No ingestion and data copying. ● Stop paying to denormalize data that may never be queried. ● Keep the tools you’ve been using. Sub-second query while serve millions of users. ● Multi-dimensional interactive analytics through on-the-fly computations. ● Single source of truth on open data lake analytics, with no external system for caching or pre-computation. Make real-time decisions in all business scenarios, especially when updated data, like in logistics, is needed. ● Real-time update without sacrificing query performance. ● Simpler real-time data pipeline ● Ditch stateful stream processing jobs (denormalization & preaggregation) with efficient on-the-fly computation.
  • 19. Data warehouse query performance on the data lakehouse with no data copying: Natively integrates with open data lake including Hive, Hudi, Iceberg, and Delta Lake. Ditch Denormalization: Perform joins on multiple tables with millions of rows in seconds. No need for external data transformation tool: Perform on-demand pre-computation (like denormalization) within StarRocks, eliminating another processing tool in your data pipeline. Intelligent Query Planning ● Cost-based optimizer generates optimized query plan. ● Global runtime filter. Efficient query execution ● In-memory data shuffling enables fast and scalable JOIN operation. ● C++ SIMD-optimized columnar storage and vectorized query executions deliver the industry's fastest query performance. High concurrency ● Built-in Materialized View. ● An Intelligent caching system. ● Secondary Indexes. Ditch stream processing tools for denormalization: StarRocks’ multi-table performance allows you to ditch the rigid and expensive stateful stream processing jobs for denormalization and pre-aggregation. Real-time updates: Through primary key table, support real-time data updates while having no impact on query performance. Real-time Query Performance at scale: The synchronized materialized view (MV) further accelerates aggregated queries at scale for real-time analytics.
  • 20. Benchmark StarRocks Offers 2.2x Performance over ClickHouse and 8.9x Performance over Apache Druid® in Wide-table Scenarios Out of the Box using product native table format.
  • 21. Benchmark StarRocks Delivers 5.54x Query Performance over Trino in Multi-table Scenarios using Apache Iceberg table format with Parquet files.
  • 22. Use Case: Tableau Dashboard at Airbnb The Airbnb Tableau Dashboard project is designed to serve both internal and external users by providing interactive dashboards. It requires a quick response to user queries. However, the query latency of previous solutions is over 10 mins, which is not acceptable. This project was just suspended until StarRocks is adopted. StarRocks Solution: ● StarRocks can directly connect and works very well with Tableau. ● 3 tables (0.5B rows, 6B rows, 100M rows) + 4 joins + 3 distinct count + JSON functions and regex at same time, response time just 3.6s. ● Reduce the query response time from mins level to sub-seconds level.
  • 23. Use Case: Game and User Behavior Analytics at Tencent IEG ● 400+ game data analysis and user behavior analysis ● Operation reports need to be real-time. ● Using ClickHouse for real-time analysis and Trino for Ad-hoc before, but they want to integrate them all. ● Using Iceberg + COS store, need better performance. ● Need elastic in ad-hoc query to deduce cost. StarRocks Solution: ● Using StarRocks Primary key to solve update problem. ● Using compute node on k8s to auto-scaling. ● Get much more performance in ad-hoc query.
  • 24. Use Case: Trust Analytics at Airbnb To enhance security, Airbnb needs a real-time fraud detection system (Trust Analytics) to identify various attacks and take actions ASAP. This system must support Ad-Hoc query and real-time update. StarRocks Solution: ● StarRocks hosts real-time updated datasets via Primary Key. ● Dataset import from Kafka has a sub-minute delay. ● StarRocks provides second-level query latency for complex joins. ● Alerting can be achieved by just running a SQL query regularly.
  • 25. History of StarRocks and CelerData StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, and a wide range of analytical workloads. StarRocks also offers a number of features that are not available in other real-time analytics databases, such as the ability to query data directly from data lakes. 2020 Birth of StarRocks StarRocks is created as a commercialized fork of the Apache Doris database. Over time, 90% of the original codebase has been re-written. 2022 CelerData is founded CelerData is founded as a company to develop and commercialize StarRocks. 2023 StarRocks moves to Linux Foundation CelerData contributes StarRocks to the Linux Foundation and moves to Apache 2.0 license. 2023 CelerData Cloud Launched CelerData launches its managed cloud service for StarRocks. 2023 Benchmarks outperform competition Latest TPC-DS and SSB benchmarks shows 2x-9x speed performance over Trino, Clickhouse and Apache Druid.
  • 26. Thank you. ● Website starrocks.io ● Managed Service cloud.celerdata.com StarRocks Project