The of Operational Analytics Data Store

•Als PPTX, PDF herunterladen•

2 gefällt mir•290 views

Rommel Garcia

Apache Druid basics and history. Where it fits. What it's best for. How is it different from other similar tech.

Technologie

Apache Druid
The rise of operational analytic data stores
Rommel Garcia
rommel.garcia@imply.io

Who am I?
Rommel Garcia
Director, Solutions Engineering at
Author: Virtualizing Hadoop
11 years working on distributed, scalable systems
2

Agenda
● How Druid was born
● Druid Architecture
● Use Cases
● Data Modeling
● The Future
3

The Former Team @ Metamarkets
5
Vadim Ogievetsky Gian Merlino Fangjin Yang

Their Challenges
● Scale: when data is large, we need a lot of servers
● Speed: aiming for sub-second response time
● Complexity: too much fine grain to precompute
● High dimensionality: 10s or 100s of dimensions
● Concurrency: many users and tenants
● Freshness: load from streams
6

What They Tried
● Relational
● NoSQL
● Hadoop
7

Key features
● Low latency ingestion from Kafka
● Bulk load from Hadoop
● Can pre-aggregate data during ingestion
● “Schema light”
● Ad-hoc queries
● Exact and approximate algorithms
● Can keep a lot of history (years are ok)
9

Druid: Decoupled, Distributed, Redundant
10

Druid: Ingest/Query Mechanism
11
Indexer
Indexer
Indexer
Files
Historical
Historical
Historical
Streams
Segments
Broker Broker
Queries

Segments
● Fundamental storage unit in Druid
● Immutable once created
● No contention between reads and writes
● One thread scans one segment
12

Powered by Druid
14
Source: http://druid.io/druid-powered.html

What is Druid?
● “high performance”: low query latency, high ingest rates
● “analytics”: counting, ranking, groupBy, time trend
● “data store”: the cluster stores a copy of your data
● “event-driven data”: fact data like clickstream, network flows,
user behavior, digital marketing, server metrics, IoT
15

New class of data store
● Column oriented
● High concurrency
● Scalable to 100s of servers, millions of messages/sec
● Partition key for query pruning
● May or may not have secondary indexes
● Query through SQL
● Rapid queries on denormalized data
16

New class of data store
● “Operational analytics” or “big OLAP” data stores
● Examples
○ Apache Druid [incubating] (open source community)
○ Scuba (from Facebook)
○ Pinot (from LinkedIn)
○ Doris, formerly Palo (from Baidu)
○ ClickHouse (from Yandex)
17

Use cases
● Clickstreams, user behavior
● Application performance management
● Network flows
● IoT
● Digital marketing
● OLAP / business intelligence
19

Optimized For A Reason
● denormalized
● roll up or to-not-roll-up
● Query time vs ingest time aggregation
● no joins
● lookups for slowly changing dimensions
21

Druid roadmap
● Parallel loading of data files without Hadoop
● Automatic compaction
● Smaller, faster compression (FastPFOR, etc)
● Subtotals, SQL “grouping sets”
● SQL standard null handling
● Vectorized query engine
● Garbage-free expression engine
● More to come!!
23

Download
Druid community site (current): http://druid.io/
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started
25

Contribute
26
https://github.com/apache/incubator-druid

Stay in touch
27
@druidio
http://druid.io/community

Empfohlen

Apache Druid: The Foundation of Fortune 500 “Analytical Decision-Making"Rommel Garcia

Apache Druid Design and Future prospectc-bslim

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen

Big Data Best Practices on GCPAllCloud

Benchmarking Apache Druid Matt Sarrel

How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply

How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent

July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network

Empfohlen

Apache Druid: The Foundation of Fortune 500 “Analytical Decision-Making"Rommel Garcia

Apache Druid Design and Future prospectc-bslim

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen

Big Data Best Practices on GCPAllCloud

Benchmarking Apache Druid Matt Sarrel

How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply

How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent

July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network

Apache Druid Vision and RoadmapImply

Accelerating analytics in a new era of dataArnon Shimoni

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

Apache Druid®: A Dance of Distributed ProcessesImply

August meetup - All about Apache Druid Imply

Solving Hybrid Cloud Data Replication with Apache CassandraAaron Ploetz

How DataStax Enterprise and Azure Make Your Apps Scale from Day 1DataStax

GPU databases - How to use them and what the future holdsArnon Shimoni

"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso

How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit

SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesArnon Shimoni

Self Service Analytics at TwitchImply

"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream TechnologiesDataconomy Media

Druid in Spot InstancesImply

"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media

Meetup Google BigQuery powered by aiIdo Volff

Netflix Big Data Paris 2017Jason Flittner

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply

Потоковая обработка больших данныхCEE-SEC(R)

In-Stream Processing Service Blueprint, Reference architecture for real-time ...Grid Dynamics

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Druid Vision and RoadmapImply

Accelerating analytics in a new era of dataArnon Shimoni

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

Apache Druid®: A Dance of Distributed ProcessesImply

August meetup - All about Apache Druid Imply

Solving Hybrid Cloud Data Replication with Apache CassandraAaron Ploetz

How DataStax Enterprise and Azure Make Your Apps Scale from Day 1DataStax

GPU databases - How to use them and what the future holdsArnon Shimoni

"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso

How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit

SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesArnon Shimoni

Self Service Analytics at TwitchImply

"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream TechnologiesDataconomy Media

Druid in Spot InstancesImply

"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media

Meetup Google BigQuery powered by aiIdo Volff

Netflix Big Data Paris 2017Jason Flittner

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply

Was ist angesagt? (20)

Apache Druid Vision and Roadmap

Accelerating analytics in a new era of data

Big Data Streams Architectures. Why? What? How?

Apache Druid®: A Dance of Distributed Processes

August meetup - All about Apache Druid

Solving Hybrid Cloud Data Replication with Apache Cassandra

How DataStax Enterprise and Azure Make Your Apps Scale from Day 1

GPU databases - How to use them and what the future holds

"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...

How to teach your data scientist to leverage an analytics cluster with Presto...

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes

Self Service Analytics at Twitch

"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies

Druid in Spot Instances

"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...

Meetup Google BigQuery powered by ai

Netflix Big Data Paris 2017

Accelerate Analytics and ML in the Hybrid Cloud Era

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...

Ähnlich wie The of Operational Analytics Data Store

Потоковая обработка больших данныхCEE-SEC(R)

In-Stream Processing Service Blueprint, Reference architecture for real-time ...Grid Dynamics

Transform from database professional to a Big Data architectSaurabh K. Gupta

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Gluent Extending Enterprise Applications with Hadoopgluent.

Data Services and the Modern Data Ecosystem (Middle East)Denodo

Data ops in practice - Swedish styleLars Albertsson

Big data at scrapinghubDana Brophy

Why data warehouses cannot support hot analyticsImply

Apache Druid 101Data Con LA

Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club

From open data to API-driven businessOpenDataSoft

Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty

Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB

Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird

Executive Intro to BigQueryWilliam M. Cohee

Big data on google platform dev fest presentationPrzemysław Pastuszka

Druid Overview by Rachel PedreschiBrian Olsen

Data Platform in the CloudAmihay Zer-Kavod

Pldc2012 monitoring-and-trending-with-mysqlradiocats

Ähnlich wie The of Operational Analytics Data Store (20)

Потоковая обработка больших данных

In-Stream Processing Service Blueprint, Reference architecture for real-time ...

Transform from database professional to a Big Data architect

AWS Big Data Demystified #1: Big data architecture lessons learned

Gluent Extending Enterprise Applications with Hadoop

Data Services and the Modern Data Ecosystem (Middle East)

Data ops in practice - Swedish style

Big data at scrapinghub

Why data warehouses cannot support hot analytics

Apache Druid 101

Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...

From open data to API-driven business

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Webinar: Enterprise Trends for Database-as-a-Service

Real time data viz with Spark Streaming, Kafka and D3.js

Executive Intro to BigQuery

Big data on google platform dev fest presentation

Druid Overview by Rachel Pedreschi

Data Platform in the Cloud

Pldc2012 monitoring-and-trending-with-mysql

Mehr von Rommel Garcia

What does Netflix, NTT and Rubicon Project have in common? Apache Druid.Rommel Garcia

GPU 101: The Beast In Data CentersRommel Garcia

PCI Compliane With HadoopRommel Garcia

Virtualizing HadoopRommel Garcia

Open Source Security Tools for Big DataRommel Garcia

Apache RangerRommel Garcia

Hadoop Meets ScrumRommel Garcia

Realtime analytics + hadoop 2.0Rommel Garcia

Interactive query in hadoopRommel Garcia

YARN - Presented At Dallas Hadoop User GroupRommel Garcia

Hadoop 1.x vs 2Rommel Garcia

Mehr von Rommel Garcia (11)

What does Netflix, NTT and Rubicon Project have in common? Apache Druid.

GPU 101: The Beast In Data Centers

PCI Compliane With Hadoop

Virtualizing Hadoop

Open Source Security Tools for Big Data

Apache Ranger

Hadoop Meets Scrum

Realtime analytics + hadoop 2.0

Interactive query in hadoop

YARN - Presented At Dallas Hadoop User Group

Hadoop 1.x vs 2

Kürzlich hochgeladen

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

A Call to Action for Generative AI in 2024Results

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Histor y of HAM Radio presentation slidevu2urc

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

A Call to Action for Generative AI in 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

A Domino Admins Adventures (Engage 2024)

Handwritten Text Recognition for manuscripts and early printed texts

Histor y of HAM Radio presentation slide

Presentation on how to chat with PDF using ChatGPT code interpreter

Data Cloud, More than a CDP by Matt Robison

Injustice - Developers Among Us (SciFiDevCon 2024)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Breaking the Kubernetes Kill Chain: Host Path Mount

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

08448380779 Call Girls In Civil Lines Women Seeking Men

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Unblocking The Main Thread Solving ANRs and Frozen Frames

Google AI Hackathon: LLM based Evaluator for RAG

08448380779 Call Girls In Friends Colony Women Seeking Men

The of Operational Analytics Data Store

1. Apache Druid The rise of operational analytic data stores Rommel Garcia rommel.garcia@imply.io

2. Who am I? Rommel Garcia Director, Solutions Engineering at Author: Virtualizing Hadoop 11 years working on distributed, scalable systems 2

3. Agenda ● How Druid was born ● Druid Architecture ● Use Cases ● Data Modeling ● The Future 3

4. The Genesis 4

5. The Former Team @ Metamarkets 5 Vadim Ogievetsky Gian Merlino Fangjin Yang

6. Their Challenges ● Scale: when data is large, we need a lot of servers ● Speed: aiming for sub-second response time ● Complexity: too much fine grain to precompute ● High dimensionality: 10s or 100s of dimensions ● Concurrency: many users and tenants ● Freshness: load from streams 6

7. What They Tried ● Relational ● NoSQL ● Hadoop 7

8. Architecture 8

9. Key features ● Low latency ingestion from Kafka ● Bulk load from Hadoop ● Can pre-aggregate data during ingestion ● “Schema light” ● Ad-hoc queries ● Exact and approximate algorithms ● Can keep a lot of history (years are ok) 9

10. Druid: Decoupled, Distributed, Redundant 10

11. Druid: Ingest/Query Mechanism 11 Indexer Indexer Indexer Files Historical Historical Historical Streams Segments Broker Broker Queries

12. Segments ● Fundamental storage unit in Druid ● Immutable once created ● No contention between reads and writes ● One thread scans one segment 12

13. Segments & Time 13

14. Powered by Druid 14 Source: http://druid.io/druid-powered.html

15. What is Druid? ● “high performance”: low query latency, high ingest rates ● “analytics”: counting, ranking, groupBy, time trend ● “data store”: the cluster stores a copy of your data ● “event-driven data”: fact data like clickstream, network flows, user behavior, digital marketing, server metrics, IoT 15

16. New class of data store ● Column oriented ● High concurrency ● Scalable to 100s of servers, millions of messages/sec ● Partition key for query pruning ● May or may not have secondary indexes ● Query through SQL ● Rapid queries on denormalized data 16

17. New class of data store ● “Operational analytics” or “big OLAP” data stores ● Examples ○ Apache Druid [incubating] (open source community) ○ Scuba (from Facebook) ○ Pinot (from LinkedIn) ○ Doris, formerly Palo (from Baidu) ○ ClickHouse (from Yandex) 17

18. Use Cases 18

19. Use cases ● Clickstreams, user behavior ● Application performance management ● Network flows ● IoT ● Digital marketing ● OLAP / business intelligence 19

20. Data Modeling 20

21. Optimized For A Reason ● denormalized ● roll up or to-not-roll-up ● Query time vs ingest time aggregation ● no joins ● lookups for slowly changing dimensions 21

22. The Future 22

23. Druid roadmap ● Parallel loading of data files without Hadoop ● Automatic compaction ● Smaller, faster compression (FastPFOR, etc) ● Subtotals, SQL “grouping sets” ● SQL standard null handling ● Vectorized query engine ● Garbage-free expression engine ● More to come!! 23

24. Try this at home 24

25. Download Druid community site (current): http://druid.io/ Druid community site (new): https://druid.apache.org/ Imply distribution: https://imply.io/get-started 25

26. Contribute 26 https://github.com/apache/incubator-druid

27. Stay in touch 27 @druidio http://druid.io/community