The talk cover concepts and internal mechanisms of how PostgreSQL, a popular open-source database, operates. While doing so, I'll also draw similarities to other RDBMS like Oracle, MySQL or SQL Server.
Some topics to touch during this presentation:
- PostgreSQL internal concepts: table, index, page, heap, vacuum, toast, etc.
- MVCC and relational transactions
- Indexes and how they affect performance
- Discuss on Uber's blog post about moving from PostgreSQL to MySQL
The talk is suitable for technical audience who has worked with databases before (software engineers/data analysts) and want to learn about its internal mechanism.
Speaker: Huy Nguyen, CTO & Cofounder, Holistics Software
Huy's currently CTO of Holistics, a Business Intelligence (BI) and Data Infrastructure product. Holistics helps customers generate reports and insights from their data. Holistics customers include tech companies like Grab, Traveloka, The Coffee House, Tech In Asia and e27.
Before Holistics, Huy worked at Viki, helping build their end-to-end data platform that scale to over 100M records a day. Previously, Huy spent a year writing medical simulation in Europe, and did an internship with Facebook HQ working for their growth team.
Huy's proudest achievement is 251 scores on Flappy Bird.
Language: Vietnamese, with slides in English.
2. About Me
Education:
● Pho Thong Nang Khieu, Tin 04-07
● National University of Singapore (NUS), Computer Science Major.
Work:
● Software Engineer Intern, SenseGraphics (Stockholm, Sweden)
● Software Engineer Intern, Facebook (California, US)
● Data Infrastructure Engineer, Viki (Singapore)
Now:
● Co-founder & CTO, Holistics Software
● Co-founder, Grokking Vietnam
huy@holistics.io facebook.com/huy bit.ly/huy-linkedin
3. ● This talk covers a very small part of
PostgreSQL concepts/internals
● As with any RDBMS, PostgreSQL is a
complex system, and it’s still evolving.
● Mainly revolve around explaining
“Uber’s MySQL vs PostgreSQL”
article.
● Not Covered: Memory Management,
Query Planning, Replication, etc...
Agenda
● Uber’s Article
● Table Heap
● B-Tree Index
● MVCC
● MySQL Structure
● PostgreSQL vs MySQL
(Uber Use-case)
● Index-only Scan
● Heap-only Tuple (HOT)
5. Uber’s Use Case
● Table with lots of indexes (cover almost/all columns)
● Lots of UPDATEs
⇒ MySQL handles this better than PostgreSQL
● Read more here
6. ● Everything is under base
directory ($PGDATA).
/var/lib/postgresql/
9.x/main
● Each database is a folder
name after its oid
Physical Structure
http://www.interdb.jp/pg/pgsql01.html
7. demodb=# select oid, relname, relfilenode
from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416854
(1 row)
Physical Structure
Each table’s data is in 1 or multiple files (max 1GB each)
9. demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416854
(1 row)
demodb=# truncate test;
TRUNCATE TABLE
INSERT 0 1
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416857
(1 row)
10. Tuple Address (ctid)
ctid id name
(0, 2) 1 Alice
(0, 5) 2 Bob
(1, 3) 3 Charlie
ctid (tuple ID): a pair of (block,
location) to position the tuple in the
data file.
11. Heap Table Structure
Page: a block of content, default to 8KB
each.
Line pointers: 4-byte number address,
holds pointer to each tuple.
For tuple with size > 2KB, a special
storage method called TOAST is used.
12. ● Problem: Someone reading data, while someone else is
writing to it
● Reader might see inconsistent piece of data
● MVCC: Allow reads and writes to happen concurrently
MVCC - Multi-version Concurrency Control
13. MVCC - Table
xmin xmax id name
1 5 1 Alice
2 3 2 Bob
3 2 Robert
4 3 Charlie
1. INSERT Alice
2. INSERT Bob
3. UPDATE Bob → Robert
4. INSERT Charlie
5. DELETE Alice
● xmin: transaction ID that inserts this tuple
● xmax: transaction that removes this tuple
17. Because each UPDATE creates new tuple (and marks old tuple
deleted), lots of UPDATEs will soon increase the table’s physical
size.
Table Bloat
18. Index (B-tree)
H
B
A C
Balanced search tree.
Root node and inner nodes
contain keys and pointers to lower
level nodes
Leaf nodes contain keys and
pointers to the heap (ctid)
When table has new tuples, new
tuple is added to index tree.
Heap
ctid
D
A1
…. ….
19. Write Amplifications
● Each UPDATE inserts new
tuple.
● New index tuples
● ⇒ multiple writes
● Extra overhead to
Write-ahead Log (WAL)
● Carried over through
network
● Applied on Slave
H
B
A C
Heap
ctid
D
A1
…. ….
20. MySQL / InnoDB
● MVCC: Inline update of tuples
● Table Layout: B+ tree on Primary Key
● Index: points to primary key
21. MySQL data is B+ Tree (on
primary key)
Leaf nodes contain actual rows
data
MySQL Table (B+ tree)
H
B
A C
row
data
...
primary key
22. MySQL Index
● MySQL: the node’s value
store primary key
● A lookup on secondary
index requires 2 index
traversals: secondary index
+ primary index.
H
B
A C
Table
D
A1
…. ….
primary key
24. PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
MVCC New Tuple Per UPDATE Inline update of tuple (with
rollback segments)
Index Lookup Store physical address (ctid) By primary key
Table Layout Heap-table structure Primary-key table structure
25. PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
select on primary key log(N) + heap read log(n) + direct read
update Update all indexes;
1 data write
Do not update indexes;
2 data writes
select on index key log(n) + O(1) heap read log(n) + log(n) primary index
read
sequential scan Page sequential scan Index-order scan
26. Index-only Scan (Covering Index)
Index on (product_id, revenue)
SELECT SUM(revenue) FROM table WHERE product_id = 123
If the index itself has all the data needed,
no Heap Table lookup is required.
27. Visibility Map
Per table’s page
VM[i] is set: all tuples in page i are
visible to current transactions
VM is only updated by VACUUM
https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
28. Heap-only Tuple (HOT)
● No new index needs to be updated
Conditions:
● Must not update a column that’s
indexed
● New tuple must be in the same
page
http://slideplayer.com/slide/9883483/
29. ● Clean up dead tuples
● Freeze old tuples (prevent
transactions wraparound)
● VACUUM only frees old tuples
● VACUUM FULL reclaims old disk
spaces, but blocks writes
VACUUM
30. ● Add a new column (safe)
● Add a column with a default (unsafe)
● Add a column that is non-nullable (unsafe)
● Drop a column (safe)
● Add a default value to an existing column (safe)
● Add an index (unsafe)
Safe & Unsafe Operations In PostgreSQL
http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
31. References
● Why Uber Engineering switched from PostgreSQL to MySQL -
https://eng.uber.com/mysql-migration/
● PostgreSQL Documentations -
https://www.postgresql.org/docs/current/static/
● The Internals of PostgreSQL
http://www.interdb.jp/pg/
● http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
● http://slideplayer.com/slide/9883483/
● https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and
-xid
34. Transaction Isolation
BEGIN TRANSACTION;
SELECT * FROM table;
SELECT pg_sleep(10);
SELECT * FROM table;
COMMIT;
under READ COMMITTED, the second SELECT may return any data. A
concurrent transaction may update the record, delete it, insert new records.
The second select will always see the new data.
under REPEATABLE READ the second SELECT is guaranteed to see the
rows that has seen at first select unchanged. New rows may be added by a
concurrent transaction in that one minute, but the existing rows cannot be
deleted nor changed.
under SERIALIZABLE reads the second select is guaranteed to see exactly
the same rows as the first. No row can change, nor deleted, nor new rows
could be inserted by a concurrent transaction.
https://stackoverflow.com/questions/4034976/difference-between-read-commit-and-repeatable-read
35. PostgreSQL Processes
There are multiple processes handling different
use cases.
● postmaster process: handles database
cluster management.
● Many backend processes (one for each
connection)
● Background processes: stats collector,
autovacuum, checkpoint, WAL writer, etc.
http://www.interdb.jp/pg/pgsql02.html
36. Database Cluster
● database cluster: a database
instance in a single machine.
● A database contains many
database objects (schema, table,
index, view, function, etc)
● Each object is represented by an
oid
Database Cluster
Database 1 Database 2 Database n...
tables indexes
views,
materialized
views
functions
schema
sequences
...
role
(user/group