3. HANS-JÜRGEN
SCHÖNIG
CEO & SENIOR DATABASE CONSULTANT
■ PostgreSQL since 1999
■ author of various database books
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
W E B www.cybertec-postgresql.com
4. DATABASE SERVICES
DATA Science
▪ Artificial Intelligence
▪ Machine Learning
▪ Big Data
▪ Business Intelligence
▪ Data Mining
▪ etc.
POSTGRESQL Services
▪ 24/7 Support
▪ Training
▪ Consulting
▪ Performance Tuning
▪ Clustering
▪ etc.
5.
6. ▪ ICT
▪ University
▪ Government
▪ Automotive
▪ Industry
▪ Trade
▪ Finance
▪ etc.
CLIENT
SECTORS
7. AGENDA
■ traditional tables
■ table bloat and VACUUM
■ Why a new storage system?
■ zheap: the goal
■ zheap: basic architecture
■ zheap: transaction slots, etc.
■ performance impacts
■ roadmap
13. MAIN ISSUE: TABLE BLOAT
test=# CREATE TABLE a (aid int) WITH (autovacuum_enabled = off);
CREATE TABLE
test=# INSERT INTO a SELECT * FROM generate_series(1, 1000000);
INSERT 0 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
35 MB
(1 row)
14. MAIN ISSUE: TABLE BLOAT
test=# UPDATE a SET aid = aid + 1;
UPDATE 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
69 MB
(1 row)
15. MAIN ISSUE: TABLE BLOAT
test=# VACUUM VERBOSE a;
INFO: vacuuming "public.a"
INFO: "a": removed 1000000 row versions in 4425 pages
INFO: "a": found 1000000 removable, 1000000 nonremovable row versions in 8850
out of 8850 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 539
...
VACUUM
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
69 MB
(1 row)
16. ONE WORD ABOUT VACUUM
■ VACUUM is not always allowed to
reallocate dead rows
■ A row must be REALLY dead for VACUUM
to do its job
■ Long transactions can be an enemy
→ Once you are in pain it tends not to go away
17. WAYS OUT
■ VACUUM FULL: Needs a table lock
■ pg_squeeze:
■ Shrinking tables with less locking
■ Move between tablespaces
■ Index organize tables
HINT: Try to avoid bloat in the first place!
19. ZHEAP: DESIGN GOALS
■ Perform UPDATE in place
■ Have smaller tables
■ smaller tuple headers
■ improved alignment
■ Reduce writes as much as possible
■ avoid dirtying pages unless data is modified
■ normal heaps dirty pages in some cases during reads
■ Reuse space more quickly
■ Get rid of VACUUM
21. ZHEAP: TUPLE HEADERS
■ Heap: 20+ bytes per row
■ Zheap: 5 bytes per row
How can this be achieved?
■ The tuple header controls “visibility”
■ “Normalize tuple header”
■ Move visibility info to the page level
23. ZHEAP: TRANSACTION SLOTS
Transaction slots:
■ 16 bytes of storage
■ contains the following information
■ transaction id
■ epoch
■ latest undo record pointer of that transaction
What if we need more slots?
24. ZHEAP: TPD PAGES
■ TPD: Store additional transaction slots if “4” is not enough
■ TPD pages are interleaved with normal pages
■
26. OPERATION: INSERT
■ Allocate a transaction slot
■ Emit an undo entry to fix things on error
■ Space can be reclaimed instantly after a ROLLBACK
→ Most simplistic operation
27. OPERATION: UPDATE
■ More complicated:
■ The new row fits into the old space
■ The new row does not fit into the old space
28. OPERATION: UPDATE FITS
■ If the row is shorter:
■ We can overwrite it
■ Emit undo record
In short: We hold the new row in zheap and a copy of the old row in undo so
that we can copy it back to the old structure in case it is needed.
29. OPERATION: UPDATE DOESN’T FIT
■ Will be worse
■ DELETE old row
■ INSERT new row in a different place
■ Less efficient
Space can instantly be reclaimed in the following cases:
■ When updating a row to a shorter version
■ When non-inplace UPDATEs are performed
30. OPERATION: DELETE
■ How it works
■ Emit undo record
■ DELETE row from zheap
Old row can be moved back into zheap during ROLLBACK.
33. ROLLBACK
■ In case a ROLLBACK happens:
■ undo has to make sure that the old state of the table is restored.
■ Old rows have to be copied back
■ ROLLBACK takes longer !
Undo itself can be removed in three cases:
■ as soon as there are no transactions anymore that can see the data.
■ as soon as all undo action has been completed
■ For committed transactions till the time they are all-visible
34. UNDO WORKERS
■ Discarding the undo logs is performed by discard worker
■ Undo launcher checks the rollback_hash_table periodically
■ Spawn new undo workers to perform the rollback
■ Each spawned undo worker processes the rollback requests for a
particular database.
37. PREPARING DATA
■ Creating some random data
test=# SET temp_buffers TO '1 GB';
SET
test=# CREATE TEMP TABLE raw AS
SELECT id,
hashtext(id::text) as name,
random() * 10000 AS n, true AS b
FROM generate_series(1, 10000000) AS id;
SELECT 10000000
38. LOADING A HEAP
■ Populating a normal table
test=# timing
Timing is on.
test=# CREATE TABLE h1 (LIKE raw) USING heap;
CREATE TABLE
Time: 7.836 ms
test=# INSERT INTO h1 SELECT * FROM raw;
INSERT 0 10000000
Time: 7495.798 ms (00:07.496)
39. LOADING A ZHEAP
■ Mind the runtime
test=# CREATE TABLE z1 (LIKE raw) USING zheap;
CREATE TABLE
Time: 8.045 ms
test=# INSERT INTO z1 SELECT * FROM raw;
INSERT 0 10000000
Time: 27947.516 ms (00:27.948)
40. ZHEAP IS SMALLER
■ Smaller tuple headers make a difference
test=# d+
List of relations
Schema | Name | Type | Owner | Persistence | Size | ...
-----------+------+-------+-------+-------------+--------+----
pg_temp_5 | raw | table | hs | temporary | 498 MB |
public | h1 | table | hs | permanent | 498 MB |
public | z1 | table | hs | permanent | 251 MB |
41. ZHEAP IN ACTION
test=# BEGIN;
BEGIN
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
test=*# UPDATE z1 SET id = id + 1;
UPDATE 10000000
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
42. UNDO IN ACTION
[hs@hs-MS-7817 undo]$ pwd
/home/hs/db13/base/undo
[hs@hs-MS-7817 undo]$ ls -l | tail -n 10
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EC00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003ED00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EE00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EF00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F000000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F100000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F200000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F300000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F400000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F500000
44. WHAT WE ARE WORKING ON
■ agree on final design issues
■ fix bugs in current code
■ large code base
■ not easy to handle
■ preparing a patch to move “undo” to core
■ “undo” is core infrastructure
We hope to bring this into core some day.
45. QUESTIONS?
Feel free to contact me!
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
T W I T T E R @postgresql_007