Everything We Learned About In-Memory Data Layout While Building VoltDB

Everything We Learned About In-
Memory Data Layout While Building
John Hugg
June 20th, 2017
@johnhugg / jhugg@voltdb.com

Who Am I?
• Engineer #1 at
• Responsible for many poor
decisions and even a few
good ones.
• jhugg@voltdb.com
• @johnhugg
• http://chat.voltdb.com

VoltDB History
The End of an Architectural Era
(It’s Time for a Complete Rewrite)
Michael Stonebraker
Samuel Madden
Daniel J. Abadi
Stavros Harizopoulos
MIT CSAIL
{stonebraker, madden, dna,
stavros}@csail.mit.edu
Nabil Hachem
AvantGarde Consulting, LLC
nhachem@agdba.com
Pat Helland
Microsoft Corporation
phelland@microsoft.com
ABSTRACT
In previous papers [SC05, SBC+07], some of us predicted the end
of “one size fits all” as a commercial relational DBMS paradigm.
These papers presented reasons and experimental evidence that
showed that the major RDBMS vendors can be outperformed by
1-2 orders of magnitude by specialized engines in the data
warehouse, stream processing, text, and scientific database
markets.
Assuming that specialized engines dominate these markets over
time, the current relational DBMS code lines will be left with the
business data processing (OLTP) market and hybrid markets
where more than one kind of capability is required. In this paper
we show that current RDBMSs can be beaten by nearly two
orders of magnitude in the OLTP market as well. The
experimental evidence comes from comparing a new OLTP
prototype, H-Store, which we have built at M.I.T., to a popular
RDBMS on the standard transactional benchmark, TPC-C.
We conclude that the current RDBMS code lines, while
attempting to be a “one size fits all” solution, in fact, excel at
nothing. Hence, they are 25 year old legacy code lines that should
be retired in favor of a collection of “from scratch” specialized
engines. The DBMS vendors (and the research community)
should start with a clean sheet of paper and design systems for
tomorrow’s requirements, not continue to push code lines and
architectures designed for yesterday’s needs.
1. INTRODUCTION
The popular relational DBMSs all trace their roots to System R
from the 1970s. For example, DB2 is a direct descendent of
System R, having used the RDS portion of System R intact in
their first release. Similarly, SQL Server is a direct descendent of
Sybase System 5, which borrowed heavily from System R.
Lastly, the first release of Oracle implemented the user interface
All three systems were architected more than 25 years ago, when
hardware characteristics were much different than today.
Processors are thousands of times faster and memories are
thousands of times larger. Disk volumes have increased
enormously, making it possible to keep essentially everything, if
one chooses to. However, the bandwidth between disk and main
memory has increased much more slowly. One would expect this
relentless pace of technology to have changed the architecture of
database systems dramatically over the last quarter of a century,
but surprisingly the architecture of most DBMSs is essentially
identical to that of System R.
Moreover, at the time relational DBMSs were conceived, there
was only a single DBMS market, business data processing. In the
last 25 years, a number of other markets have evolved, including
data warehouses, text management, and stream processing. These
markets have very different requirements than business data
processing.
Lastly, the main user interface device at the time RDBMSs were
architected was the dumb terminal, and vendors imagined
operators inputting queries through an interactive terminal
prompt. Now it is a powerful personal computer connected to the
World Wide Web. Web sites that use OLTP DBMSs rarely run
interactive transactions or present users with direct SQL
interfaces.
In summary, the current RDBMSs were architected for the
business data processing market in a time of different user
interfaces and different hardware characteristics. Hence, they all
include the following System R architectural features:
 Disk oriented storage and indexing structures
 Multithreading to hide latency
 Locking-based concurrency control mechanisms
 Log-based recovery

Buffer Pool Management
Concurrency
Use Main Memory
Single Threaded
Waiting on users leaves CPU idle
Single threaded doesn’t jive with multicore world

Waiting on Users
• Don’t.
• External transactions control and performance are not friends.
• Use server side transactional logic.
• Move the logic to the data, not the other way around.
• (Server side processing using Java/JVM)

Using *ALL* the Cores
• Partitioning data is a requirement for scale-out.
• Single-threaded is desired for efﬁciency.
• Why not partition to the core instead of the node?
• Concurrency via scheduling, not shared memory.

Important Point
• VoltDB is not just a traditional RDBMS
with some tweaks sitting on RAM rather
than spindles.
• VoltDB is weird and new and exciting
and not very compatible with Hibernate.
• VoltDB sounds like MemSQL, NuoDB,
Clustrix, HANA or whomever on ﬁrst
blush, but has a really really different
architecture.

Open Source on Github
Commodity Cluster Scale Out / Active-Active HA
Millions of ACID serializable operations per second
Synchronous Disk Persistence
Avg latency under 1ms, 99.999 under 50ms.
Multi-TB customers in production.

To make good architecture
choices, you need to decide
what success looks like ﬁrst!

What kind of workload?
• OLTP vs OLAP? Search/Fulltext? Graph? Machine Learning?
• Ratio of reads to writes?
• CRUD vs scans?
• Bursty? Batch? Regular? None of the above?
• Concurrency needed?

Focus on VoltDB Assumptions
• Mutable data.
• Many parallel shorter operations.
• Lots of read queries supported by indexes, materialized views,
etc…
• Few full table scans.

What’s the goal?
SPEED! WHOOSH! SIZE! SO CUTE!

WE WANT THE FASTEST DATABASE!
YOLO

Need to deﬁne the problem a bit…

What’s an Operational Event?
• Something happened and it’s the databases job to understand and
respond to it.
• Examples:
• User is trying to make a call. Should we let the call though?
• User is swiping a credit card. Should we process the charge?
• User is loading a web page, which ad should we show?

Measurements
Throughput
Events per Second
Latency
Time per Event

How to Measure Throughput & Latency?
• Average measurement
• Min and Max values
• Percentile Measurement (99%, 99.9%, 99.99%… etc)
• What is the latency measurement of the 99.9% operation?
• What percent of operations fail to meet SLA?
• What percent of wall-clock time will hypothetical requests fail SLA?  
¯_( )_/¯

Literally how we started:
• “NValue” is a variant type
• typedef std::vector<NValue> Tuple;
• typedef std::vector<Tuple> Table;
• typedef std::vector<NValue> Key;
• typedef std::map<Key, Tuple> Index;
• typedef std::vector<int64_t> FreeList;
4 TPS

1.0
• “NValue” is a variant type
• TuplePtr understands schema and points to memory that contains
the tuple.
• Linked-list of blocks of memory each holding ﬁxed number of tuples.
• Variable sized columns stored as tuples pointing to heap-allocated
memory.

Logical to Physical Schema
CREATE TABLE order
(
user_id BIGINT,
item_code VARCHAR(20 BYTES),
price FLOAT,
qty SMALLINT,
ts TIMESTAMP,
notes VARCHAR(1024)
);
user_id item_code price qty ts notes (ptr)
notes (value up to 1k on heap)

Perfectly packed
blocks of ﬁxed
size tuples
Variable sized
stuff is heap-
allocated by
malloc (glibc)

Free Space
Free Space
Free Space
Block List as std::vector

Free Space
Free Space
Free Space
Free List as std::vector

Indexes
• std::map is usually a
Red-Black tree
• Balanced with O(log n)
operations
• Predictable memory
allocation

CREATE INDEX usertime ON order (user_id, ts);
Block Storage
Left Ptr Right PtrKey (user_id, ts) Value PtrC
Tree Node or NIL Tree Node or NIL
Parent Node or NIL if Root

Non-Unique Indexes
• Users will put a secondary index on low cardinality things like “country”
• If we want to delete a record, we also have to delete index entry
• But “country” might have millions of entries per key, and we only want to delete
the one that points to the deleted record. 
This is stop-the-world pain for deleting a single record!
• Worse, even high-cardinality columns can have a sneaky repeated value like
NULL.
• Lesson learned, make all index keys unique, even if you have to jam a pointer
value in there.

MVCC where M = 2
Tuples
COW Ptr
Tuples
COW Ptr
Change in place
COW Tuples
Change here moves old
version to COW Tuples

Customer Complains of Memory Use
• Allocated Memory Constant
• RSS increases over time
• Deleting doesn’t reduce RSS
MemoryUse(RSS)
Time

Malloc has a hard job…
Start with some memory…
48 bytes free
malloc three times (still
17 bytes 15 bytes 11 bytes
5 bytes
free
Free the middle allocation
17 bytes 11 bytes15 bytes free
5 bytes
free

Start with some memory…
48 bytes free
malloc three times (still
17 bytes 15 bytes 11 bytes
5 bytes
free
Now allocate 18 bytes? 8 bytes?
Free the middle allocation
5 bytes
free

5 bytes
free
Can’t move this 11 bytes
because malloc doesn’t
know what points there

Some malloc better than others
• GLibC malloc on linux is not great.*
• Use jemalloc, tcmalloc, etc…
• Default OSX malloc is much smarter than GLibC.
• We built VoltDB to run on GLibC malloc because we are masochists? 
Because we hate dependencies? 
Because some of the others are tricky to use with JNI/JVM?
*unless it got great recently

GLibC Memory Tip
• If you mmap chunks of memory over 2MB, GlibC doesn’t try to be
smart.
• Allocation increases RSS by mmap size.
• Free decreases RSS by mmap size.
• This is awesome!
• Only catch is that mmap sometimes has a limit of 64k allocations
before this breaks down (sometimes).

Need Defrag/Compaction
• Tuple Storage
• String Storage
• Indexes
Free Space
Free Space
Free Space
Free List as std::vector
Block Storage
Left Ptr Right PtrKey (user_id, ts) Value PtrC
Tree Node or NIL Tree Node or NIL
Parent Node or NIL if Root

Options:
On-Demand Defrag Active Defrag
When the free list becomes 20% of
the table, move tuples (and update
and indexes that point to them) to
use fewer blocks.
Whenever a tuple is deleted that
isn’t the last tuple in the physical
storage of the table, move the
actual last tuple into the spot the
deleted tuple was using and
update any indexes that point to it.

Lessons?
• For delayed defrag, many workloads never hit defrag at all.
• For active defrag, you pay a cost on delete every time.

No: Active Wins Deﬁnitively
• The cost of delete is less expensive than you would expect in practice,
somewhat offset by not maintaining free lists at all.
• When you do need to defrag, you have increased processing costs that
need to be absorbed. 
These costs are higher than the active option: 
Affects peak throughput and peak latency
• Customers might hit compaction rarely, like after they’ve been in production
for a while. Surprises are bad.
• Code that executes only sometimes is WAY harder to test than code that
executes all the time. Surprises are bad.

Changes to VoltDB
• Adding compacting (active-defrag) pools that replace heap-based
string and blob storage. 
Pools allocate in >=2mb mmap slabs.
• Rewrote std::map indexes as our own Red-Black Tree that allocates
in >2MB mmap slabs, and actively ﬁlls holes. 
We added more customizations like rank support.
• Table blocks now mmaped, but use periodic defrag. We’ve made
this better over the years, but eventually we will switch to active,
leave-no-holes defrag.

Results
• VoltDB often uses half as much memory for the same data as other
systems we’ve compared ourselves to:
• Primarily defrag/compaction and smart layout.
• Also less overhead for concurrency and for MVCC.
• Tight layout helps performance too! 
Memory-locality is good.

Don’t be afraid to be boring
and predictable!

Lazy Deserialization
• When moving tuples, we tightly pack
rows together in an append-only
immutable structure.
• We combine that with lazy-
deserialization. We don’t deserialize a
column value from bytes until you ask
for the value.
• This works really well for moving data
around.

The Block Whisperer
• Transient: 
VoltTable code takes blocks and lets
you read schema, iterate rows, and
read values.
• Persistent: 
TuplePtr is a similar “Block Whisperer”
for interpreting big blocks of memory
ﬁlled with tuples.

Block Compression
• The more data you have, the better the compression ratio you get. 
You don’t want to compress one smallish row using LZW or similar.
• The problem for in-memory storage is increased speed makes it
hard to hide block compression.  
Best approach might be to compress blocks that haven’t been
touched in a while…
• But this kind of scheme gets complex quickly, and has the similar
problems to defrag.  
Sometimes too smart isn’t worth being less boring.

Value Compression
• Column stores are fantastic at this, but row stores less so.
• Variable-integer compression. 
Now all rows are the same size => complexity
• String dictionaries.
• Blob compression.

Experiments!
• This may be grossly unsatisfying, but I want you to know we thought
about this stuff.
• Most of what I’m describing happened in the past, so things might
be different if we tried again with newer hardware, approaches or
tools.
• This is a subset of experiments for time pressure and business
reasons.

NUMA
• We tried using system-level NUMA tools and OS APIs to make a
VoltDB process faster.
• We failed, but that doesn’t mean we can’t do better.
• Running two VoltDB processes on two NUMA nodes typically runs
faster than one process spread across nodes.
• What now?  
Better support for multiple processes?  
Architecture changes for NUMA?

Indexes
• In-memory indexes are a fun research topic, mostly because they
are easy to microbenchmark

Mutable data means holes;
deal with them somehow!
Micro-Benchmarks
are useful, but are easy to drift
from production workloads and
concerns.

In-Memory Index
• In-Memory B-Trees do well on micro-benchmarks, but less well in
VoltDB. What’s up with that?
• Some benchmarks use 4-byte keys. So if you make your B-Tree page
one or more cache lines wide, you can ﬁt a lot of 4-byte keys in a
page.
• Some papers even use 32-bit memory addresses!
• But most VoltDB indexes are at least 8-bytes and many are multi-
column. Many have keys that are most of a cache line (64B). 
This is what we measure.

In-Memory Index
• We could probably do better in a lot of ways, especially on space
usage.
• A trie-index might be nice for string-based keys.
• A custom B-tree might save some space if tweaked a bit.
• BUT: Note our current RB-Tree index has ranking support and active
compaction support, so any replacement may need to support those.
• Persistent-Indexes are also interesting for MVCC and long running
queries.

Covering Indexes
• A covering index is an index where the index node has ALL the
column data for the tuple, and you don’t need blocks of tuples.
• This would save a lot of memory in VoltDB.
• Would mess up snapshot’s based on simple Copy-On-Write.
• Need multi-version indexes all of the sudden? Or some hybrid?
• I’m interested in this!

MVCC
• Long running, consistent queries in VoltDB are hard because we
don’t really do MVCC.
• MVCC has overhead. Andy Pavlo at CMU seems to think what
MySQL does works best (paper citation), which is similar to how we
do snapshots.
• Index maintenance is a nightmare.  
Garbage collection violates the BE BORING rule.
• Still, we are still experimenting here.

Caching
• When does it make sense to cache an in-memory database?
• Blanket caching of hot data is pretty dumb.
• Caching results in the DB seems lousy to me. If you have slow to
compute results that you can use stale, cache in app-server, proxy or
client application, when the rules of how it is used are clear.
• VoltDB caches the partition hash ring in clients, so requests can be
sent to the best node to process the data.
• VoltDB also caches query plans and other such things.

Key Takeaways
• Think about what you want to measure, how you want to measure it,
then actually measure it.
• Even when added complexity makes it faster or smaller, sometimes
it’s not worth the cost in engineering.
• The biggest waste we’ve seen in in-memory layout is memory
fragmentation. Leave-no-holes FTW.
• Some things we’ve read in research papers are true as measured,
but less useful for us. Verify things with your assumptions.

chat.voltdb.com
jhugg
@voltdb.com
@johnhugg
@voltdb
Thank You!
• Please ask me questions now or later.
• Feedback on what was interesting,
helpful, confusing, boring is ALWAYS
welcome.
• Happy to talk about: 
Data management 
Systems software dev 
Distributed systems 
Japanese preschools
BS
Stuff I Don't Know
Stuff I Know
T H I S TA L K

Everything We Learned About In-Memory Data Layout While Building VoltDB

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Everything We Learned About In-Memory Data Layout While Building VoltDB

Ähnlich wie Everything We Learned About In-Memory Data Layout While Building VoltDB (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Everything We Learned About In-Memory Data Layout While Building VoltDB