SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
Everything We Learned About In-
Memory Data Layout While Building
John Hugg
June 20th, 2017
@johnhugg / jhugg@voltdb.com
Who Am I?
• Engineer #1 at
• Responsible for many poor
decisions and even a few
good ones.
• jhugg@voltdb.com
• @johnhugg
• http://chat.voltdb.com
Origins
VoltDB History
The End of an Architectural Era
(It’s Time for a Complete Rewrite)
Michael Stonebraker
Samuel Madden
Daniel J. Abadi
Stavros Harizopoulos
MIT CSAIL
{stonebraker, madden, dna,
stavros}@csail.mit.edu
Nabil Hachem
AvantGarde Consulting, LLC
nhachem@agdba.com
Pat Helland
Microsoft Corporation
phelland@microsoft.com
ABSTRACT
In previous papers [SC05, SBC+07], some of us predicted the end
of “one size fits all” as a commercial relational DBMS paradigm.
These papers presented reasons and experimental evidence that
showed that the major RDBMS vendors can be outperformed by
1-2 orders of magnitude by specialized engines in the data
warehouse, stream processing, text, and scientific database
markets.
Assuming that specialized engines dominate these markets over
time, the current relational DBMS code lines will be left with the
business data processing (OLTP) market and hybrid markets
where more than one kind of capability is required. In this paper
we show that current RDBMSs can be beaten by nearly two
orders of magnitude in the OLTP market as well. The
experimental evidence comes from comparing a new OLTP
prototype, H-Store, which we have built at M.I.T., to a popular
RDBMS on the standard transactional benchmark, TPC-C.
We conclude that the current RDBMS code lines, while
attempting to be a “one size fits all” solution, in fact, excel at
nothing. Hence, they are 25 year old legacy code lines that should
be retired in favor of a collection of “from scratch” specialized
engines. The DBMS vendors (and the research community)
should start with a clean sheet of paper and design systems for
tomorrow’s requirements, not continue to push code lines and
architectures designed for yesterday’s needs.
1. INTRODUCTION
The popular relational DBMSs all trace their roots to System R
from the 1970s. For example, DB2 is a direct descendent of
System R, having used the RDS portion of System R intact in
their first release. Similarly, SQL Server is a direct descendent of
Sybase System 5, which borrowed heavily from System R.
Lastly, the first release of Oracle implemented the user interface
All three systems were architected more than 25 years ago, when
hardware characteristics were much different than today.
Processors are thousands of times faster and memories are
thousands of times larger. Disk volumes have increased
enormously, making it possible to keep essentially everything, if
one chooses to. However, the bandwidth between disk and main
memory has increased much more slowly. One would expect this
relentless pace of technology to have changed the architecture of
database systems dramatically over the last quarter of a century,
but surprisingly the architecture of most DBMSs is essentially
identical to that of System R.
Moreover, at the time relational DBMSs were conceived, there
was only a single DBMS market, business data processing. In the
last 25 years, a number of other markets have evolved, including
data warehouses, text management, and stream processing. These
markets have very different requirements than business data
processing.
Lastly, the main user interface device at the time RDBMSs were
architected was the dumb terminal, and vendors imagined
operators inputting queries through an interactive terminal
prompt. Now it is a powerful personal computer connected to the
World Wide Web. Web sites that use OLTP DBMSs rarely run
interactive transactions or present users with direct SQL
interfaces.
In summary, the current RDBMSs were architected for the
business data processing market in a time of different user
interfaces and different hardware characteristics. Hence, they all
include the following System R architectural features:
 Disk oriented storage and indexing structures
 Multithreading to hide latency
 Locking-based concurrency control mechanisms
 Log-based recovery
Buffer Pool Management
Concurrency
Use Main Memory
Single Threaded
Waiting on users leaves CPU idle
Single threaded doesn’t jive with multicore world
Waiting on Users
• Don’t.
• External transactions control and performance are not friends.
• Use server side transactional logic.
• Move the logic to the data, not the other way around.
• (Server side processing using Java/JVM)
Using *ALL* the Cores
• Partitioning data is a requirement for scale-out.
• Single-threaded is desired for efficiency.
• Why not partition to the core instead of the node?
• Concurrency via scheduling, not shared memory.
Important Point
• VoltDB is not just a traditional RDBMS
with some tweaks sitting on RAM rather
than spindles.
• VoltDB is weird and new and exciting
and not very compatible with Hibernate.
• VoltDB sounds like MemSQL, NuoDB,
Clustrix, HANA or whomever on first
blush, but has a really really different
architecture.
Open Source on Github
Commodity Cluster Scale Out / Active-Active HA
Millions of ACID serializable operations per second
Synchronous Disk Persistence
Avg latency under 1ms, 99.999 under 50ms.
Multi-TB customers in production.
To make good architecture
choices, you need to decide
what success looks like first!
What kind of workload?
• OLTP vs OLAP? Search/Fulltext? Graph? Machine Learning?
• Ratio of reads to writes?
• CRUD vs scans?
• Bursty? Batch? Regular? None of the above?
• Concurrency needed?
Focus on VoltDB Assumptions
• Mutable data.
• Many parallel shorter operations.
• Lots of read queries supported by indexes, materialized views,
etc…
• Few full table scans.
What’s the goal?
SPEED! WHOOSH! SIZE! SO CUTE!
WE WANT THE FASTEST DATABASE!
YOLO
Need to define the problem a bit…
What’s an Operational Event?
• Something happened and it’s the databases job to understand and
respond to it.
• Examples:
• User is trying to make a call. Should we let the call though?
• User is swiping a credit card. Should we process the charge?
• User is loading a web page, which ad should we show?
Measurements
Throughput
Events per Second
Latency
Time per Event
Rathole Warning
How to Measure Throughput & Latency?
• Average measurement
• Min and Max values
• Percentile Measurement (99%, 99.9%, 99.99%… etc)
• What is the latency measurement of the 99.9% operation?
• What percent of operations fail to meet SLA?
• What percent of wall-clock time will hypothetical requests fail SLA? 

¯_( )_/¯
How VoltDB does Data Layout
Literally how we started:
• “NValue” is a variant type
• typedef std::vector<NValue> Tuple;
• typedef std::vector<Tuple> Table;
• typedef std::vector<NValue> Key;
• typedef std::map<Key, Tuple> Index;
• typedef std::vector<int64_t> FreeList;
4 TPS
1.0
• “NValue” is a variant type
• TuplePtr understands schema and points to memory that contains
the tuple.
• Linked-list of blocks of memory each holding fixed number of tuples.
• Variable sized columns stored as tuples pointing to heap-allocated
memory.
Logical to Physical Schema
CREATE TABLE order
(
user_id BIGINT,
item_code VARCHAR(20 BYTES),
price FLOAT,
qty SMALLINT,
ts TIMESTAMP,
notes VARCHAR(1024)
);
user_id item_code price qty ts notes (ptr)
notes (value up to 1k on heap)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
Perfectly packed
blocks of fixed
size tuples
Variable sized
stuff is heap-
allocated by
malloc (glibc)
Free Space
Free Space
Free Space
Block List as std::vector
Free Space
Free Space
Free Space
Block List as std::vector
Free List as std::vector
Indexes
• std::map is usually a
Red-Black tree
• Balanced with O(log n)
operations
• Predictable memory
allocation
CREATE INDEX usertime ON order (user_id, ts);
Block Storage
Left Ptr Right PtrKey (user_id, ts) Value PtrC
user_id item_code price qty ts notes (ptr)
Tree Node or NIL Tree Node or NIL
Parent Node or NIL if Root
Non-Unique Indexes
• Users will put a secondary index on low cardinality things like “country”
• If we want to delete a record, we also have to delete index entry
• But “country” might have millions of entries per key, and we only want to delete
the one that points to the deleted record.

This is stop-the-world pain for deleting a single record!
• Worse, even high-cardinality columns can have a sneaky repeated value like
NULL.
• Lesson learned, make all index keys unique, even if you have to jam a pointer
value in there.
Snapshots & Durability Logs
MVCC where M = 2
Tuples
COW Ptr
Tuples
COW Ptr
Change in place
COW Tuples
Change here moves old
version to COW Tuples
A Compaction Story
Customer Complains of Memory Use
• Allocated Memory Constant
• RSS increases over time
• Deleting doesn’t reduce RSS
MemoryUse(RSS)
Time
Malloc has a hard job…
Start with some memory…
48 bytes free
malloc three times (still
17 bytes 15 bytes 11 bytes
5 bytes
free
Free the middle allocation
17 bytes 11 bytes15 bytes free
5 bytes
free
Malloc has a hard job…
Start with some memory…
48 bytes free
malloc three times (still
17 bytes 15 bytes 11 bytes
5 bytes
free
Now allocate 18 bytes? 8 bytes?
Free the middle allocation
17 bytes 11 bytes15 bytes free
5 bytes
free
Malloc has a hard job…
17 bytes 11 bytes15 bytes free
5 bytes
free
Can’t move this 11 bytes
because malloc doesn’t
know what points there
Some malloc better than others
• GLibC malloc on linux is not great.*
• Use jemalloc, tcmalloc, etc…
• Default OSX malloc is much smarter than GLibC.
• We built VoltDB to run on GLibC malloc because we are masochists?

Because we hate dependencies?

Because some of the others are tricky to use with JNI/JVM?
*unless it got great recently
GLibC Memory Tip
• If you mmap chunks of memory over 2MB, GlibC doesn’t try to be
smart.
• Allocation increases RSS by mmap size.
• Free decreases RSS by mmap size.
• This is awesome!
• Only catch is that mmap sometimes has a limit of 64k allocations
before this breaks down (sometimes).
Need Defrag/Compaction
• Tuple Storage
• String Storage
• Indexes
Free Space
Free Space
Free Space
Block List as std::vector
Free List as std::vector
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
user_id item_code price qty ts notes (ptr)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
notes (value up to 1k on heap)
Block Storage
Left Ptr Right PtrKey (user_id, ts) Value PtrC
user_id item_code price qty ts notes (ptr)
Tree Node or NIL Tree Node or NIL
Parent Node or NIL if Root
Options:
On-Demand Defrag Active Defrag
When the free list becomes 20% of
the table, move tuples (and update
and indexes that point to them) to
use fewer blocks.
Whenever a tuple is deleted that
isn’t the last tuple in the physical
storage of the table, move the
actual last tuple into the spot the
deleted tuple was using and
update any indexes that point to it.
Lessons?
• For delayed defrag, many workloads never hit defrag at all.
• For active defrag, you pay a cost on delete every time.
No: Active Wins Definitively
• The cost of delete is less expensive than you would expect in practice,
somewhat offset by not maintaining free lists at all.
• When you do need to defrag, you have increased processing costs that
need to be absorbed.

These costs are higher than the active option:

Affects peak throughput and peak latency
• Customers might hit compaction rarely, like after they’ve been in production
for a while. Surprises are bad.
• Code that executes only sometimes is WAY harder to test than code that
executes all the time. Surprises are bad.
Changes to VoltDB
• Adding compacting (active-defrag) pools that replace heap-based
string and blob storage.

Pools allocate in >=2mb mmap slabs.
• Rewrote std::map indexes as our own Red-Black Tree that allocates
in >2MB mmap slabs, and actively fills holes.

We added more customizations like rank support.
• Table blocks now mmaped, but use periodic defrag. We’ve made
this better over the years, but eventually we will switch to active,
leave-no-holes defrag.
Results
• VoltDB often uses half as much memory for the same data as other
systems we’ve compared ourselves to:
• Primarily defrag/compaction and smart layout.
• Also less overhead for concurrency and for MVCC.
• Tight layout helps performance too!

Memory-locality is good.
Don’t be afraid to be boring
and predictable!
Other Tricks
& Topics
Lazy Deserialization
• When moving tuples, we tightly pack
rows together in an append-only
immutable structure.
• We combine that with lazy-
deserialization. We don’t deserialize a
column value from bytes until you ask
for the value.
• This works really well for moving data
around.
The Block Whisperer
• Transient:

VoltTable code takes blocks and lets
you read schema, iterate rows, and
read values.
• Persistent:

TuplePtr is a similar “Block Whisperer”
for interpreting big blocks of memory
filled with tuples.
Compression
BLOCK VALUE
Block Compression
• The more data you have, the better the compression ratio you get.

You don’t want to compress one smallish row using LZW or similar.
• The problem for in-memory storage is increased speed makes it
hard to hide block compression. 

Best approach might be to compress blocks that haven’t been
touched in a while…
• But this kind of scheme gets complex quickly, and has the similar
problems to defrag. 

Sometimes too smart isn’t worth being less boring.
Value Compression
• Column stores are fantastic at this, but row stores less so.
• Variable-integer compression.

Now all rows are the same size => complexity
• String dictionaries.
• Blob compression.
Experiments!
• This may be grossly unsatisfying, but I want you to know we thought
about this stuff.
• Most of what I’m describing happened in the past, so things might
be different if we tried again with newer hardware, approaches or
tools.
• This is a subset of experiments for time pressure and business
reasons.
NUMA
• We tried using system-level NUMA tools and OS APIs to make a
VoltDB process faster.
• We failed, but that doesn’t mean we can’t do better.
• Running two VoltDB processes on two NUMA nodes typically runs
faster than one process spread across nodes.
• What now? 

Better support for multiple processes? 

Architecture changes for NUMA?
Indexes
• In-memory indexes are a fun research topic, mostly because they
are easy to microbenchmark













Micro-Benchmarks are Fantasy
Mutable data means holes;
deal with them somehow!
Micro-Benchmarks
are useful, but are easy to drift
from production workloads and
concerns.
In-Memory Index
• In-Memory B-Trees do well on micro-benchmarks, but less well in
VoltDB. What’s up with that?
• Some benchmarks use 4-byte keys. So if you make your B-Tree page
one or more cache lines wide, you can fit a lot of 4-byte keys in a
page.
• Some papers even use 32-bit memory addresses!
• But most VoltDB indexes are at least 8-bytes and many are multi-
column. Many have keys that are most of a cache line (64B).

This is what we measure.
In-Memory Index
• We could probably do better in a lot of ways, especially on space
usage.
• A trie-index might be nice for string-based keys.
• A custom B-tree might save some space if tweaked a bit.
• BUT: Note our current RB-Tree index has ranking support and active
compaction support, so any replacement may need to support those.
• Persistent-Indexes are also interesting for MVCC and long running
queries.
Covering Indexes
• A covering index is an index where the index node has ALL the
column data for the tuple, and you don’t need blocks of tuples.
• This would save a lot of memory in VoltDB.
• Would mess up snapshot’s based on simple Copy-On-Write.
• Need multi-version indexes all of the sudden? Or some hybrid?
• I’m interested in this!
MVCC
• Long running, consistent queries in VoltDB are hard because we
don’t really do MVCC.
• MVCC has overhead. Andy Pavlo at CMU seems to think what
MySQL does works best (paper citation), which is similar to how we
do snapshots.
• Index maintenance is a nightmare. 

Garbage collection violates the BE BORING rule.
• Still, we are still experimenting here.
Caching
• When does it make sense to cache an in-memory database?
• Blanket caching of hot data is pretty dumb.
• Caching results in the DB seems lousy to me. If you have slow to
compute results that you can use stale, cache in app-server, proxy or
client application, when the rules of how it is used are clear.
• VoltDB caches the partition hash ring in clients, so requests can be
sent to the best node to process the data.
• VoltDB also caches query plans and other such things.
Key Takeaways
• Think about what you want to measure, how you want to measure it,
then actually measure it.
• Even when added complexity makes it faster or smaller, sometimes
it’s not worth the cost in engineering.
• The biggest waste we’ve seen in in-memory layout is memory
fragmentation. Leave-no-holes FTW.
• Some things we’ve read in research papers are true as measured,
but less useful for us. Verify things with your assumptions.
chat.voltdb.com
jhugg
@voltdb.com
@johnhugg
@voltdb
Thank You!
• Please ask me questions now or later.
• Feedback on what was interesting,
helpful, confusing, boring is ALWAYS
welcome.
• Happy to talk about:

Data management

Systems software dev

Distributed systems

Japanese preschools
BS
Stuff I Don't Know
Stuff I Know
T H I S TA L K

Weitere ähnliche Inhalte

Was ist angesagt?

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Change Data Capture
Change Data CaptureChange Data Capture
Change Data Capturearnoud_otte
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationCambridge Semantics
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Cathrine Wilhelmsen
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
MirrorMaker: Beyond the Basics with Mickael Maison
MirrorMaker: Beyond the Basics with Mickael MaisonMirrorMaker: Beyond the Basics with Mickael Maison
MirrorMaker: Beyond the Basics with Mickael MaisonHostedbyConfluent
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Back to the future with C++ and Seastar
Back to the future with C++ and SeastarBack to the future with C++ and Seastar
Back to the future with C++ and SeastarTzach Livyatan
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021StreamNative
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur TechnologieData Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur TechnologieJens Albrecht
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 

Was ist angesagt? (20)

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Change Data Capture
Change Data CaptureChange Data Capture
Change Data Capture
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
Table Partitioning in SQL Server: A Magic Solution for Better Performance? (P...
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
MirrorMaker: Beyond the Basics with Mickael Maison
MirrorMaker: Beyond the Basics with Mickael MaisonMirrorMaker: Beyond the Basics with Mickael Maison
MirrorMaker: Beyond the Basics with Mickael Maison
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Back to the future with C++ and Seastar
Back to the future with C++ and SeastarBack to the future with C++ and Seastar
Back to the future with C++ and Seastar
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur TechnologieData Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur Technologie
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Ähnlich wie Everything We Learned About In-Memory Data Layout While Building VoltDB

Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
The End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael StonebrakerThe End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael Stonebrakerugur candan
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc DifferenceSehrish Asif
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)Ben Stopford
 
JasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max SchiresonJasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max SchiresonMongoDB
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksDatabricks
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...Simon Lia-Jonassen
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]Huy Do
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQLUlf Wendel
 

Ähnlich wie Everything We Learned About In-Memory Data Layout While Building VoltDB (20)

Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
The End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael StonebrakerThe End of an Architectural Era Michael Stonebraker
The End of an Architectural Era Michael Stonebraker
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc Difference
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
JasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max SchiresonJasperWorld 2012: Reinventing Data Management by Max Schireson
JasperWorld 2012: Reinventing Data Management by Max Schireson
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 

Kürzlich hochgeladen

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 

Kürzlich hochgeladen (20)

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Everything We Learned About In-Memory Data Layout While Building VoltDB

  • 1. Everything We Learned About In- Memory Data Layout While Building John Hugg June 20th, 2017 @johnhugg / jhugg@voltdb.com
  • 2. Who Am I? • Engineer #1 at • Responsible for many poor decisions and even a few good ones. • jhugg@voltdb.com • @johnhugg • http://chat.voltdb.com
  • 4.
  • 5.
  • 6. VoltDB History The End of an Architectural Era (It’s Time for a Complete Rewrite) Michael Stonebraker Samuel Madden Daniel J. Abadi Stavros Harizopoulos MIT CSAIL {stonebraker, madden, dna, stavros}@csail.mit.edu Nabil Hachem AvantGarde Consulting, LLC nhachem@agdba.com Pat Helland Microsoft Corporation phelland@microsoft.com ABSTRACT In previous papers [SC05, SBC+07], some of us predicted the end of “one size fits all” as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets. Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T., to a popular RDBMS on the standard transactional benchmark, TPC-C. We conclude that the current RDBMS code lines, while attempting to be a “one size fits all” solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of “from scratch” specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs. 1. INTRODUCTION The popular relational DBMSs all trace their roots to System R from the 1970s. For example, DB2 is a direct descendent of System R, having used the RDS portion of System R intact in their first release. Similarly, SQL Server is a direct descendent of Sybase System 5, which borrowed heavily from System R. Lastly, the first release of Oracle implemented the user interface All three systems were architected more than 25 years ago, when hardware characteristics were much different than today. Processors are thousands of times faster and memories are thousands of times larger. Disk volumes have increased enormously, making it possible to keep essentially everything, if one chooses to. However, the bandwidth between disk and main memory has increased much more slowly. One would expect this relentless pace of technology to have changed the architecture of database systems dramatically over the last quarter of a century, but surprisingly the architecture of most DBMSs is essentially identical to that of System R. Moreover, at the time relational DBMSs were conceived, there was only a single DBMS market, business data processing. In the last 25 years, a number of other markets have evolved, including data warehouses, text management, and stream processing. These markets have very different requirements than business data processing. Lastly, the main user interface device at the time RDBMSs were architected was the dumb terminal, and vendors imagined operators inputting queries through an interactive terminal prompt. Now it is a powerful personal computer connected to the World Wide Web. Web sites that use OLTP DBMSs rarely run interactive transactions or present users with direct SQL interfaces. In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:  Disk oriented storage and indexing structures  Multithreading to hide latency  Locking-based concurrency control mechanisms  Log-based recovery
  • 7. Buffer Pool Management Concurrency Use Main Memory Single Threaded Waiting on users leaves CPU idle Single threaded doesn’t jive with multicore world
  • 8. Waiting on Users • Don’t. • External transactions control and performance are not friends. • Use server side transactional logic. • Move the logic to the data, not the other way around. • (Server side processing using Java/JVM)
  • 9. Using *ALL* the Cores • Partitioning data is a requirement for scale-out. • Single-threaded is desired for efficiency. • Why not partition to the core instead of the node? • Concurrency via scheduling, not shared memory.
  • 10. Important Point • VoltDB is not just a traditional RDBMS with some tweaks sitting on RAM rather than spindles. • VoltDB is weird and new and exciting and not very compatible with Hibernate. • VoltDB sounds like MemSQL, NuoDB, Clustrix, HANA or whomever on first blush, but has a really really different architecture.
  • 11. Open Source on Github Commodity Cluster Scale Out / Active-Active HA Millions of ACID serializable operations per second Synchronous Disk Persistence Avg latency under 1ms, 99.999 under 50ms. Multi-TB customers in production.
  • 12. To make good architecture choices, you need to decide what success looks like first!
  • 13. What kind of workload? • OLTP vs OLAP? Search/Fulltext? Graph? Machine Learning? • Ratio of reads to writes? • CRUD vs scans? • Bursty? Batch? Regular? None of the above? • Concurrency needed?
  • 14. Focus on VoltDB Assumptions • Mutable data. • Many parallel shorter operations. • Lots of read queries supported by indexes, materialized views, etc… • Few full table scans.
  • 15. What’s the goal? SPEED! WHOOSH! SIZE! SO CUTE!
  • 16. WE WANT THE FASTEST DATABASE! YOLO
  • 17. Need to define the problem a bit…
  • 18. What’s an Operational Event? • Something happened and it’s the databases job to understand and respond to it. • Examples: • User is trying to make a call. Should we let the call though? • User is swiping a credit card. Should we process the charge? • User is loading a web page, which ad should we show?
  • 21. How to Measure Throughput & Latency? • Average measurement • Min and Max values • Percentile Measurement (99%, 99.9%, 99.99%… etc) • What is the latency measurement of the 99.9% operation? • What percent of operations fail to meet SLA? • What percent of wall-clock time will hypothetical requests fail SLA? 
 ¯_( )_/¯
  • 22. How VoltDB does Data Layout
  • 23. Literally how we started: • “NValue” is a variant type • typedef std::vector<NValue> Tuple; • typedef std::vector<Tuple> Table; • typedef std::vector<NValue> Key; • typedef std::map<Key, Tuple> Index; • typedef std::vector<int64_t> FreeList; 4 TPS
  • 24. 1.0 • “NValue” is a variant type • TuplePtr understands schema and points to memory that contains the tuple. • Linked-list of blocks of memory each holding fixed number of tuples. • Variable sized columns stored as tuples pointing to heap-allocated memory.
  • 25. Logical to Physical Schema CREATE TABLE order ( user_id BIGINT, item_code VARCHAR(20 BYTES), price FLOAT, qty SMALLINT, ts TIMESTAMP, notes VARCHAR(1024) ); user_id item_code price qty ts notes (ptr) notes (value up to 1k on heap)
  • 26. user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) Perfectly packed blocks of fixed size tuples Variable sized stuff is heap- allocated by malloc (glibc)
  • 27. Free Space Free Space Free Space Block List as std::vector
  • 28. Free Space Free Space Free Space Block List as std::vector Free List as std::vector
  • 29. Indexes • std::map is usually a Red-Black tree • Balanced with O(log n) operations • Predictable memory allocation
  • 30. CREATE INDEX usertime ON order (user_id, ts); Block Storage Left Ptr Right PtrKey (user_id, ts) Value PtrC user_id item_code price qty ts notes (ptr) Tree Node or NIL Tree Node or NIL Parent Node or NIL if Root
  • 31. Non-Unique Indexes • Users will put a secondary index on low cardinality things like “country” • If we want to delete a record, we also have to delete index entry • But “country” might have millions of entries per key, and we only want to delete the one that points to the deleted record.
 This is stop-the-world pain for deleting a single record! • Worse, even high-cardinality columns can have a sneaky repeated value like NULL. • Lesson learned, make all index keys unique, even if you have to jam a pointer value in there.
  • 33. MVCC where M = 2 Tuples COW Ptr Tuples COW Ptr Change in place COW Tuples Change here moves old version to COW Tuples
  • 35. Customer Complains of Memory Use • Allocated Memory Constant • RSS increases over time • Deleting doesn’t reduce RSS MemoryUse(RSS) Time
  • 36. Malloc has a hard job… Start with some memory… 48 bytes free malloc three times (still 17 bytes 15 bytes 11 bytes 5 bytes free Free the middle allocation 17 bytes 11 bytes15 bytes free 5 bytes free
  • 37. Malloc has a hard job… Start with some memory… 48 bytes free malloc three times (still 17 bytes 15 bytes 11 bytes 5 bytes free Now allocate 18 bytes? 8 bytes? Free the middle allocation 17 bytes 11 bytes15 bytes free 5 bytes free
  • 38. Malloc has a hard job… 17 bytes 11 bytes15 bytes free 5 bytes free Can’t move this 11 bytes because malloc doesn’t know what points there
  • 39. Some malloc better than others • GLibC malloc on linux is not great.* • Use jemalloc, tcmalloc, etc… • Default OSX malloc is much smarter than GLibC. • We built VoltDB to run on GLibC malloc because we are masochists?
 Because we hate dependencies?
 Because some of the others are tricky to use with JNI/JVM? *unless it got great recently
  • 40. GLibC Memory Tip • If you mmap chunks of memory over 2MB, GlibC doesn’t try to be smart. • Allocation increases RSS by mmap size. • Free decreases RSS by mmap size. • This is awesome! • Only catch is that mmap sometimes has a limit of 64k allocations before this breaks down (sometimes).
  • 41. Need Defrag/Compaction • Tuple Storage • String Storage • Indexes Free Space Free Space Free Space Block List as std::vector Free List as std::vector user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) user_id item_code price qty ts notes (ptr) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) notes (value up to 1k on heap) Block Storage Left Ptr Right PtrKey (user_id, ts) Value PtrC user_id item_code price qty ts notes (ptr) Tree Node or NIL Tree Node or NIL Parent Node or NIL if Root
  • 42. Options: On-Demand Defrag Active Defrag When the free list becomes 20% of the table, move tuples (and update and indexes that point to them) to use fewer blocks. Whenever a tuple is deleted that isn’t the last tuple in the physical storage of the table, move the actual last tuple into the spot the deleted tuple was using and update any indexes that point to it.
  • 43. Lessons? • For delayed defrag, many workloads never hit defrag at all. • For active defrag, you pay a cost on delete every time.
  • 44. No: Active Wins Definitively • The cost of delete is less expensive than you would expect in practice, somewhat offset by not maintaining free lists at all. • When you do need to defrag, you have increased processing costs that need to be absorbed.
 These costs are higher than the active option:
 Affects peak throughput and peak latency • Customers might hit compaction rarely, like after they’ve been in production for a while. Surprises are bad. • Code that executes only sometimes is WAY harder to test than code that executes all the time. Surprises are bad.
  • 45. Changes to VoltDB • Adding compacting (active-defrag) pools that replace heap-based string and blob storage.
 Pools allocate in >=2mb mmap slabs. • Rewrote std::map indexes as our own Red-Black Tree that allocates in >2MB mmap slabs, and actively fills holes.
 We added more customizations like rank support. • Table blocks now mmaped, but use periodic defrag. We’ve made this better over the years, but eventually we will switch to active, leave-no-holes defrag.
  • 46. Results • VoltDB often uses half as much memory for the same data as other systems we’ve compared ourselves to: • Primarily defrag/compaction and smart layout. • Also less overhead for concurrency and for MVCC. • Tight layout helps performance too!
 Memory-locality is good.
  • 47. Don’t be afraid to be boring and predictable!
  • 49. Lazy Deserialization • When moving tuples, we tightly pack rows together in an append-only immutable structure. • We combine that with lazy- deserialization. We don’t deserialize a column value from bytes until you ask for the value. • This works really well for moving data around.
  • 50. The Block Whisperer • Transient:
 VoltTable code takes blocks and lets you read schema, iterate rows, and read values. • Persistent:
 TuplePtr is a similar “Block Whisperer” for interpreting big blocks of memory filled with tuples.
  • 52. Block Compression • The more data you have, the better the compression ratio you get.
 You don’t want to compress one smallish row using LZW or similar. • The problem for in-memory storage is increased speed makes it hard to hide block compression. 
 Best approach might be to compress blocks that haven’t been touched in a while… • But this kind of scheme gets complex quickly, and has the similar problems to defrag. 
 Sometimes too smart isn’t worth being less boring.
  • 53. Value Compression • Column stores are fantastic at this, but row stores less so. • Variable-integer compression.
 Now all rows are the same size => complexity • String dictionaries. • Blob compression.
  • 54. Experiments! • This may be grossly unsatisfying, but I want you to know we thought about this stuff. • Most of what I’m describing happened in the past, so things might be different if we tried again with newer hardware, approaches or tools. • This is a subset of experiments for time pressure and business reasons.
  • 55. NUMA • We tried using system-level NUMA tools and OS APIs to make a VoltDB process faster. • We failed, but that doesn’t mean we can’t do better. • Running two VoltDB processes on two NUMA nodes typically runs faster than one process spread across nodes. • What now? 
 Better support for multiple processes? 
 Architecture changes for NUMA?
  • 56. Indexes • In-memory indexes are a fun research topic, mostly because they are easy to microbenchmark
 
 
 
 
 
 

  • 58. Mutable data means holes; deal with them somehow! Micro-Benchmarks are useful, but are easy to drift from production workloads and concerns.
  • 59. In-Memory Index • In-Memory B-Trees do well on micro-benchmarks, but less well in VoltDB. What’s up with that? • Some benchmarks use 4-byte keys. So if you make your B-Tree page one or more cache lines wide, you can fit a lot of 4-byte keys in a page. • Some papers even use 32-bit memory addresses! • But most VoltDB indexes are at least 8-bytes and many are multi- column. Many have keys that are most of a cache line (64B).
 This is what we measure.
  • 60. In-Memory Index • We could probably do better in a lot of ways, especially on space usage. • A trie-index might be nice for string-based keys. • A custom B-tree might save some space if tweaked a bit. • BUT: Note our current RB-Tree index has ranking support and active compaction support, so any replacement may need to support those. • Persistent-Indexes are also interesting for MVCC and long running queries.
  • 61. Covering Indexes • A covering index is an index where the index node has ALL the column data for the tuple, and you don’t need blocks of tuples. • This would save a lot of memory in VoltDB. • Would mess up snapshot’s based on simple Copy-On-Write. • Need multi-version indexes all of the sudden? Or some hybrid? • I’m interested in this!
  • 62. MVCC • Long running, consistent queries in VoltDB are hard because we don’t really do MVCC. • MVCC has overhead. Andy Pavlo at CMU seems to think what MySQL does works best (paper citation), which is similar to how we do snapshots. • Index maintenance is a nightmare. 
 Garbage collection violates the BE BORING rule. • Still, we are still experimenting here.
  • 63. Caching • When does it make sense to cache an in-memory database? • Blanket caching of hot data is pretty dumb. • Caching results in the DB seems lousy to me. If you have slow to compute results that you can use stale, cache in app-server, proxy or client application, when the rules of how it is used are clear. • VoltDB caches the partition hash ring in clients, so requests can be sent to the best node to process the data. • VoltDB also caches query plans and other such things.
  • 64. Key Takeaways • Think about what you want to measure, how you want to measure it, then actually measure it. • Even when added complexity makes it faster or smaller, sometimes it’s not worth the cost in engineering. • The biggest waste we’ve seen in in-memory layout is memory fragmentation. Leave-no-holes FTW. • Some things we’ve read in research papers are true as measured, but less useful for us. Verify things with your assumptions.
  • 65. chat.voltdb.com jhugg @voltdb.com @johnhugg @voltdb Thank You! • Please ask me questions now or later. • Feedback on what was interesting, helpful, confusing, boring is ALWAYS welcome. • Happy to talk about:
 Data management
 Systems software dev
 Distributed systems
 Japanese preschools BS Stuff I Don't Know Stuff I Know T H I S TA L K