SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
Cómo se diseña una base de
datos que pueda ingerir más de
cuatro millones de eventos por
segundo
Javier Ramirez
Head of Developer Relations
@supercoco9
Some things I will talk about
● Accept you are not PostgreSQL. You are not for everyone and cannot do everything
● Make the right assumptions
● Take advantage of modern hardware and operating systems
● Obsess about storage
● Reduce/control your dependencies
● Measure-implement-repeat continuously to improve performance
We would like to be known for:
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
Quick demo/overview
https://demo.questdb.io/
But I don’t need 4 million rows per second
Good, because you
probably aren’t getting
them.
8
Try out query performance on open datasets
https://demo.questdb.io/
All benchmarks are lies (but they give us a ballpark)
Ingesting over 1.4 million rows per second (using 5 CPU threads)
https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-benchmark/
While running queries scanning over 4 billion rows per second (16 CPU threads)
https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/
Time-series specialised benchmark
https://github.com/timescale/tsbs
https://github.com/questdb/questdb/releases/tag/7.0.1 (feb 23)
If you can use only one
database for everything,
go with PostgreSQL
Not all (big) (or fast)
data problems are the
same
Do you have a time-series problem? Write patterns
● You mostly insert data. You rarely update or delete individual rows
● It is likely you write data more frequently than you read data
● Since data keeps growing, you will very likely end up with much bigger
data than your typical operational database would be happy with
● Your data origin might experience bursts or lag, but keeping the correct
order of events is important to you
● Both ingestion and querying speed are critical for your business
Do you have a time-series problem? Read patterns
● Most of your queries are scoped to a time range
● You typically access recent/fresh data rather than older data
● But still want to keep older data around for occasional analytics
● You often need to resample your data for aggregations/analytics
● You often need to align timestamps from multiple data series
We can make many
assumptions about the shape
of the data and usage patterns
Data will most often be queried in a continuous range, and recent data will be preferred =>
Store data physically sorted by “designated timestamp” on disk (deal with out of order
data)
Store data in partitions, so we can skip a lot of data quickly
Aggressive use of prefetching by the file system
Most queries are not a select *, but aggregations on timestamp + a few columns =>
Columnar storage model. Open only the files for the column the query needs
Most rows will have some sort of non-unique ID (string or numeric) to scope on =>
Special Symbol type, looks like a String, behaves like a Number. Faster and smaller
Some assumptions when reading data
Data will be fast and continuous =>
Keep (configurable) buffers to reduce write operations
Slower reads should not slow down writes =>
Shared CPU/threads pool model, with default separate thread for ingestion and
possibility to dedicate threads for parsing or other tasks
Stale data is useful, but longer latencies are fine =>
Allow mounting old partitions on slower/cheaper drives
Old data needs to be removed eventually =>
Allow unmounting/deleting partitions (archiving into object storage in the roadmap)
Some assumptions when writing data
Queries should allow for reasonably complex filters and aggregations =>
Implement SQL, with pg-wire compatibility for compatibility
Writes should be fast. Also, some users might be already using other TSDB =>
Implement the Influx Line Protocol (ILP) for speed and compatibility. Provide client
libraries, as ILP is not as popular
Many integrations might be from IoT or simple devices with bash scripting =>
Implement HTTP endpoint for querying, importing, and exporting data
Operations teams will want to read QuestDB metrics, not stored data
Implement health and metrics endpoint, with Prometheus compatibility
Some assumptions when connecting
Say no to
nice-to-have
features that
would degrade
performance
But also say yes
When it makes
sense
Technical decisions and trade offs
Java memory
management
Native unsafe memory. Shared across languages and OS
Java
C/C++
Rust *
Mmap
https://db.cs.cmu.edu/mmap-cidr2022/
* https://github.com/questdb/rust-maven-plugin
SIMD vectorization and own JIT compiler
Single Instruction, multiple Data (SIMD):
parallelizes/vectorizes operations in multiple
registers. QuestDB only supports it on Intel and AMD
processors.
JIT compiler: compiles SQL statements
EXPLAIN: helps understand execution plans and
vectorization
28
SELECT count(), max(total_amount),
avg(total_amount)
FROM trips
WHERE total_amount > 150 AND passenger_count = 1;
(Trips table has 1.6 billion rows and
24 columns, but we only access 2 of
them)
You can try it live at
https://demo.questdb.io
Re-implement the JAVA std library
● Java Classes work with heap memory. We need off-heap
● JAVA classes tend to do too many things (they are
generic) and a lot of type conversions
● This includes IO, logging, atomics, collections… using zero
GC and native memory
● Zero Dependencies (except for testing) on our pom.xml
Down to the nanosecond
Benchmark Mode Cnt Score Error Units
LogBenchmark.testLogOneIntBlocking avgt 2 265.391 ns/op
LogBenchmark.testLogOneInt avgt 2 82.985 ns/op
LogBenchmark.testLogOneIntDisabled avgt 2 0.661 ns/op
Log4jBenchmark.testLogOneInt avgt 2 877.266 ns/op
Log4jBenchmark.testLogOneIntDisabled avgt 2 1.368 ns/op
How would *YOU* efficiently sort a multi GB
unordered CSV file?
Improved batch import (3 Million rows/second)*
● File doesn’t fit into memory, so we need to rely on disk IO for sorting
● Designed a multi-pass parallel strategy
● Using the new io_uring Linux IO interface to max out disk access
concurrency
Before:
A 76GB heavily unordered CSV file would take ~28 minutes to ingest
After:
Same file takes 335 seconds to ingest, at about 3 Million rows per second (also
changed disk type)
https://questdb.io/blog/2022/09/12/importing-3m-rows-with-io-uring * Remember all benchmarks are lies
Some things we are trying out next for performance
● Compression, and exploring data formats like arrow/ parquet
● Own ingestion protocol
● Embedding Julia in the database for custom code/UDFs
● Moving some parts to Rust
● Second level partitioning
● Improved vectorization of some operations (group by multiple
columns or by expressions
● Add specific joins optimizations (index nested loop joins, for
example)
https://github.com/questdb/questdb
https://questdb.io/cloud/
Quick recap
● Accept you are not PostgreSQL. You are not for everyone and cannot do
everything
● Make the right assumptions
● Take advantage of modern hardware and operating systems
● Obsess about storage
● Reduce/control your dependencies
● Measure-implement-repeat continuously to improve performance
● All benchmark are lies, but if you like them take a look at
https://questdb.io/blog/tags/engineering/
More info
https://questdb.io
https://demo.questdb.io
https://github.com/javier/questdb-quickstart
We 💕 contributions and ⭐ stars
github.com/questdb/questdb
THANKS!
Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9

Weitere ähnliche Inhalte

Ähnlich wie Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de eventos por segundo

Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisJignesh Shah
 
Planning For High Performance Web Application
Planning For High Performance Web ApplicationPlanning For High Performance Web Application
Planning For High Performance Web ApplicationYue Tian
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
Apache Traffic Server
Apache Traffic ServerApache Traffic Server
Apache Traffic Serversupertom
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Lucidworks
 
dh-slides-perf.ppt
dh-slides-perf.pptdh-slides-perf.ppt
dh-slides-perf.ppthackday08
 
dh-slides-perf.ppt
dh-slides-perf.pptdh-slides-perf.ppt
dh-slides-perf.ppthackday08
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysqlliufabin 66688
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Aaron Shilo
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceAshok Modi
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)Doug Burns
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 

Ähnlich wie Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de eventos por segundo (20)

Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
11g R2
11g R211g R2
11g R2
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
Planning For High Performance Web Application
Planning For High Performance Web ApplicationPlanning For High Performance Web Application
Planning For High Performance Web Application
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Apache Traffic Server
Apache Traffic ServerApache Traffic Server
Apache Traffic Server
 
Dba tuning
Dba tuningDba tuning
Dba tuning
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
dh-slides-perf.ppt
dh-slides-perf.pptdh-slides-perf.ppt
dh-slides-perf.ppt
 
dh-slides-perf.ppt
dh-slides-perf.pptdh-slides-perf.ppt
dh-slides-perf.ppt
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysql
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

Mehr von javier ramirez

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfestjavier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessjavier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloudjavier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMjavier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analyticsjavier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Divejavier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...javier ramirez
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...javier ramirez
 
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...javier ramirez
 

Mehr von javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
 
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
Open Distro for ElasticSearch and how Grimoire is using it. Madrid DevOps Oct...
 

Kürzlich hochgeladen

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Kürzlich hochgeladen (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de eventos por segundo

  • 1. Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de eventos por segundo Javier Ramirez Head of Developer Relations @supercoco9
  • 2. Some things I will talk about ● Accept you are not PostgreSQL. You are not for everyone and cannot do everything ● Make the right assumptions ● Take advantage of modern hardware and operating systems ● Obsess about storage ● Reduce/control your dependencies ● Measure-implement-repeat continuously to improve performance
  • 3.
  • 4. We would like to be known for: ● Performance ○ Better performance with smaller machines ● Developer Experience ● Proudly Open Source (Apache 2.0)
  • 6. But I don’t need 4 million rows per second
  • 7. Good, because you probably aren’t getting them.
  • 8. 8
  • 9. Try out query performance on open datasets https://demo.questdb.io/
  • 10. All benchmarks are lies (but they give us a ballpark) Ingesting over 1.4 million rows per second (using 5 CPU threads) https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-benchmark/ While running queries scanning over 4 billion rows per second (16 CPU threads) https://questdb.io/blog/2022/05/26/query-benchmark-questdb-versus-clickhouse-timescale/ Time-series specialised benchmark https://github.com/timescale/tsbs
  • 11.
  • 13. If you can use only one database for everything, go with PostgreSQL
  • 14.
  • 15. Not all (big) (or fast) data problems are the same
  • 16. Do you have a time-series problem? Write patterns ● You mostly insert data. You rarely update or delete individual rows ● It is likely you write data more frequently than you read data ● Since data keeps growing, you will very likely end up with much bigger data than your typical operational database would be happy with ● Your data origin might experience bursts or lag, but keeping the correct order of events is important to you ● Both ingestion and querying speed are critical for your business
  • 17. Do you have a time-series problem? Read patterns ● Most of your queries are scoped to a time range ● You typically access recent/fresh data rather than older data ● But still want to keep older data around for occasional analytics ● You often need to resample your data for aggregations/analytics ● You often need to align timestamps from multiple data series
  • 18. We can make many assumptions about the shape of the data and usage patterns
  • 19. Data will most often be queried in a continuous range, and recent data will be preferred => Store data physically sorted by “designated timestamp” on disk (deal with out of order data) Store data in partitions, so we can skip a lot of data quickly Aggressive use of prefetching by the file system Most queries are not a select *, but aggregations on timestamp + a few columns => Columnar storage model. Open only the files for the column the query needs Most rows will have some sort of non-unique ID (string or numeric) to scope on => Special Symbol type, looks like a String, behaves like a Number. Faster and smaller Some assumptions when reading data
  • 20. Data will be fast and continuous => Keep (configurable) buffers to reduce write operations Slower reads should not slow down writes => Shared CPU/threads pool model, with default separate thread for ingestion and possibility to dedicate threads for parsing or other tasks Stale data is useful, but longer latencies are fine => Allow mounting old partitions on slower/cheaper drives Old data needs to be removed eventually => Allow unmounting/deleting partitions (archiving into object storage in the roadmap) Some assumptions when writing data
  • 21. Queries should allow for reasonably complex filters and aggregations => Implement SQL, with pg-wire compatibility for compatibility Writes should be fast. Also, some users might be already using other TSDB => Implement the Influx Line Protocol (ILP) for speed and compatibility. Provide client libraries, as ILP is not as popular Many integrations might be from IoT or simple devices with bash scripting => Implement HTTP endpoint for querying, importing, and exporting data Operations teams will want to read QuestDB metrics, not stored data Implement health and metrics endpoint, with Prometheus compatibility Some assumptions when connecting
  • 22. Say no to nice-to-have features that would degrade performance
  • 23. But also say yes When it makes sense
  • 26. Native unsafe memory. Shared across languages and OS Java C/C++ Rust * Mmap https://db.cs.cmu.edu/mmap-cidr2022/ * https://github.com/questdb/rust-maven-plugin
  • 27. SIMD vectorization and own JIT compiler Single Instruction, multiple Data (SIMD): parallelizes/vectorizes operations in multiple registers. QuestDB only supports it on Intel and AMD processors. JIT compiler: compiles SQL statements EXPLAIN: helps understand execution plans and vectorization
  • 28. 28
  • 29. SELECT count(), max(total_amount), avg(total_amount) FROM trips WHERE total_amount > 150 AND passenger_count = 1; (Trips table has 1.6 billion rows and 24 columns, but we only access 2 of them) You can try it live at https://demo.questdb.io
  • 30. Re-implement the JAVA std library ● Java Classes work with heap memory. We need off-heap ● JAVA classes tend to do too many things (they are generic) and a lot of type conversions ● This includes IO, logging, atomics, collections… using zero GC and native memory ● Zero Dependencies (except for testing) on our pom.xml
  • 31. Down to the nanosecond Benchmark Mode Cnt Score Error Units LogBenchmark.testLogOneIntBlocking avgt 2 265.391 ns/op LogBenchmark.testLogOneInt avgt 2 82.985 ns/op LogBenchmark.testLogOneIntDisabled avgt 2 0.661 ns/op Log4jBenchmark.testLogOneInt avgt 2 877.266 ns/op Log4jBenchmark.testLogOneIntDisabled avgt 2 1.368 ns/op
  • 32.
  • 33. How would *YOU* efficiently sort a multi GB unordered CSV file?
  • 34. Improved batch import (3 Million rows/second)* ● File doesn’t fit into memory, so we need to rely on disk IO for sorting ● Designed a multi-pass parallel strategy ● Using the new io_uring Linux IO interface to max out disk access concurrency Before: A 76GB heavily unordered CSV file would take ~28 minutes to ingest After: Same file takes 335 seconds to ingest, at about 3 Million rows per second (also changed disk type) https://questdb.io/blog/2022/09/12/importing-3m-rows-with-io-uring * Remember all benchmarks are lies
  • 35. Some things we are trying out next for performance ● Compression, and exploring data formats like arrow/ parquet ● Own ingestion protocol ● Embedding Julia in the database for custom code/UDFs ● Moving some parts to Rust ● Second level partitioning ● Improved vectorization of some operations (group by multiple columns or by expressions ● Add specific joins optimizations (index nested loop joins, for example)
  • 37. Quick recap ● Accept you are not PostgreSQL. You are not for everyone and cannot do everything ● Make the right assumptions ● Take advantage of modern hardware and operating systems ● Obsess about storage ● Reduce/control your dependencies ● Measure-implement-repeat continuously to improve performance ● All benchmark are lies, but if you like them take a look at https://questdb.io/blog/tags/engineering/
  • 38. More info https://questdb.io https://demo.questdb.io https://github.com/javier/questdb-quickstart We 💕 contributions and ⭐ stars github.com/questdb/questdb THANKS! Javier Ramirez, Head of Developer Relations at QuestDB, @supercoco9