SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
HOW ZHEAP WORKS
REINVENTING POSTGRESQL STORAGE
BY HANS-JÜRGEN SCHÖNIG
ABOUT
ME AND MY
COMPANY
■ Who is the guy?
■ Who is CYBERTEC?
HANS-JÜRGEN
SCHÖNIG
CEO & SENIOR DATABASE CONSULTANT
■ PostgreSQL since 1999
■ author of various database books
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
W E B www.cybertec-postgresql.com
DATABASE SERVICES
DATA Science
▪ Artificial Intelligence
▪ Machine Learning
▪ Big Data
▪ Business Intelligence
▪ Data Mining
▪ etc.
POSTGRESQL Services
▪ 24/7 Support
▪ Training
▪ Consulting
▪ Performance Tuning
▪ Clustering
▪ etc.
▪ ICT
▪ University
▪ Government
▪ Automotive
▪ Industry
▪ Trade
▪ Finance
▪ etc.
CLIENT
SECTORS
AGENDA
■ traditional tables
■ table bloat and VACUUM
■ Why a new storage system?
■ zheap: the goal
■ zheap: basic architecture
■ zheap: transaction slots, etc.
■ performance impacts
■ roadmap
TRADITIONAL TABLES
HEAP: STANDARD TABLES
■ Data structure looks as follows:
■ Data structure looks as follows:
HEAP: STANDARD TABLES
HEAP AND TRANSACTIONS
UPDATES AND VISIBILITY
PROBLEMS WITH HEAP
MAIN ISSUE: TABLE BLOAT
test=# CREATE TABLE a (aid int) WITH (autovacuum_enabled = off);
CREATE TABLE
test=# INSERT INTO a SELECT * FROM generate_series(1, 1000000);
INSERT 0 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
35 MB
(1 row)
MAIN ISSUE: TABLE BLOAT
test=# UPDATE a SET aid = aid + 1;
UPDATE 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
69 MB
(1 row)
MAIN ISSUE: TABLE BLOAT
test=# VACUUM VERBOSE a;
INFO: vacuuming "public.a"
INFO: "a": removed 1000000 row versions in 4425 pages
INFO: "a": found 1000000 removable, 1000000 nonremovable row versions in 8850
out of 8850 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 539
...
VACUUM
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
69 MB
(1 row)
ONE WORD ABOUT VACUUM
■ VACUUM is not always allowed to
reallocate dead rows
■ A row must be REALLY dead for VACUUM
to do its job
■ Long transactions can be an enemy
→ Once you are in pain it tends not to go away
WAYS OUT
■ VACUUM FULL: Needs a table lock
■ pg_squeeze:
■ Shrinking tables with less locking
■ Move between tablespaces
■ Index organize tables
HINT: Try to avoid bloat in the first place!
ZHEAP
COMING TO THE RESCUE
ZHEAP: DESIGN GOALS
■ Perform UPDATE in place
■ Have smaller tables
■ smaller tuple headers
■ improved alignment
■ Reduce writes as much as possible
■ avoid dirtying pages unless data is modified
■ normal heaps dirty pages in some cases during reads
■ Reuse space more quickly
■ Get rid of VACUUM
ZHEAP: TUPLE HEADERS
ZHEAP: TUPLE HEADERS
■ Heap: 20+ bytes per row
■ Zheap: 5 bytes per row
How can this be achieved?
■ The tuple header controls “visibility”
■ “Normalize tuple header”
■ Move visibility info to the page level
ZHEAP: TRANSACTION SLOTS
Transaction slots hold transactional visibility
ZHEAP: TRANSACTION SLOTS
Transaction slots:
■ 16 bytes of storage
■ contains the following information
■ transaction id
■ epoch
■ latest undo record pointer of that transaction
What if we need more slots?
ZHEAP: TPD PAGES
■ TPD: Store additional transaction slots if “4” is not enough
■ TPD pages are interleaved with normal pages
■
UNDO: HANDLING
STALE DATA
OPERATION: INSERT
■ Allocate a transaction slot
■ Emit an undo entry to fix things on error
■ Space can be reclaimed instantly after a ROLLBACK
→ Most simplistic operation
OPERATION: UPDATE
■ More complicated:
■ The new row fits into the old space
■ The new row does not fit into the old space
OPERATION: UPDATE FITS
■ If the row is shorter:
■ We can overwrite it
■ Emit undo record
In short: We hold the new row in zheap and a copy of the old row in undo so
that we can copy it back to the old structure in case it is needed.
OPERATION: UPDATE DOESN’T FIT
■ Will be worse
■ DELETE old row
■ INSERT new row in a different place
■ Less efficient
Space can instantly be reclaimed in the following cases:
■ When updating a row to a shorter version
■ When non-inplace UPDATEs are performed
OPERATION: DELETE
■ How it works
■ Emit undo record
■ DELETE row from zheap
Old row can be moved back into zheap during ROLLBACK.
UNDO PAGE FORMAT
ROLLBACK
ROLLBACK
■ In case a ROLLBACK happens:
■ undo has to make sure that the old state of the table is restored.
■ Old rows have to be copied back
■ ROLLBACK takes longer !
Undo itself can be removed in three cases:
■ as soon as there are no transactions anymore that can see the data.
■ as soon as all undo action has been completed
■ For committed transactions till the time they are all-visible
UNDO WORKERS
■ Discarding the undo logs is performed by discard worker
■ Undo launcher checks the rollback_hash_table periodically
■ Spawn new undo workers to perform the rollback
■ Each spawned undo worker processes the rollback requests for a
particular database.
UNDO LOG PROCESSING
OBSERVATIONS
PREPARING DATA
■ Creating some random data
test=# SET temp_buffers TO '1 GB';
SET
test=# CREATE TEMP TABLE raw AS
SELECT id,
hashtext(id::text) as name,
random() * 10000 AS n, true AS b
FROM generate_series(1, 10000000) AS id;
SELECT 10000000
LOADING A HEAP
■ Populating a normal table
test=# timing
Timing is on.
test=# CREATE TABLE h1 (LIKE raw) USING heap;
CREATE TABLE
Time: 7.836 ms
test=# INSERT INTO h1 SELECT * FROM raw;
INSERT 0 10000000
Time: 7495.798 ms (00:07.496)
LOADING A ZHEAP
■ Mind the runtime
test=# CREATE TABLE z1 (LIKE raw) USING zheap;
CREATE TABLE
Time: 8.045 ms
test=# INSERT INTO z1 SELECT * FROM raw;
INSERT 0 10000000
Time: 27947.516 ms (00:27.948)
ZHEAP IS SMALLER
■ Smaller tuple headers make a difference
test=# d+
List of relations
Schema | Name | Type | Owner | Persistence | Size | ...
-----------+------+-------+-------+-------------+--------+----
pg_temp_5 | raw | table | hs | temporary | 498 MB |
public | h1 | table | hs | permanent | 498 MB |
public | z1 | table | hs | permanent | 251 MB |
ZHEAP IN ACTION
test=# BEGIN;
BEGIN
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
test=*# UPDATE z1 SET id = id + 1;
UPDATE 10000000
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
UNDO IN ACTION
[hs@hs-MS-7817 undo]$ pwd
/home/hs/db13/base/undo
[hs@hs-MS-7817 undo]$ ls -l | tail -n 10
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EC00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003ED00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EE00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EF00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F000000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F100000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F200000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F300000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F400000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F500000
ROADMAP
WHAT WE ARE WORKING ON
■ agree on final design issues
■ fix bugs in current code
■ large code base
■ not easy to handle
■ preparing a patch to move “undo” to core
■ “undo” is core infrastructure
We hope to bring this into core some day.
QUESTIONS?
Feel free to contact me!
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
T W I T T E R @postgresql_007

Weitere ähnliche Inhalte

Was ist angesagt?

What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
jbellis
 

Was ist angesagt? (19)

M|18 How DBAs at TradingScreen Make Life Easier With Automation
M|18 How DBAs at TradingScreen Make Life Easier With AutomationM|18 How DBAs at TradingScreen Make Life Easier With Automation
M|18 How DBAs at TradingScreen Make Life Easier With Automation
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
 
Real-world Experiences in Scala
Real-world Experiences in ScalaReal-world Experiences in Scala
Real-world Experiences in Scala
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
Get More Out of MySQL with TokuDB
Get More Out of MySQL with TokuDBGet More Out of MySQL with TokuDB
Get More Out of MySQL with TokuDB
 
SQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISSQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSIS
 
Avoiding Data Hotspots at Scale
Avoiding Data Hotspots at ScaleAvoiding Data Hotspots at Scale
Avoiding Data Hotspots at Scale
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
 
M|18 Scalability via Expendable Resources: Containers at BlaBlaCar
M|18 Scalability via Expendable Resources: Containers at BlaBlaCarM|18 Scalability via Expendable Resources: Containers at BlaBlaCar
M|18 Scalability via Expendable Resources: Containers at BlaBlaCar
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Cloud DWH deep dive
Cloud DWH deep diveCloud DWH deep dive
Cloud DWH deep dive
 
Shard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stackShard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stack
 
Performance tuning ColumnStore
Performance tuning ColumnStorePerformance tuning ColumnStore
Performance tuning ColumnStore
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 

Ähnlich wie Learn how zheap works

Optimizing E-Business Suite Storage Using Oracle Advanced Compression
Optimizing E-Business Suite Storage Using Oracle Advanced CompressionOptimizing E-Business Suite Storage Using Oracle Advanced Compression
Optimizing E-Business Suite Storage Using Oracle Advanced Compression
Andrejs Karpovs
 

Ähnlich wie Learn how zheap works (20)

Optimizing E-Business Suite Storage Using Oracle Advanced Compression
Optimizing E-Business Suite Storage Using Oracle Advanced CompressionOptimizing E-Business Suite Storage Using Oracle Advanced Compression
Optimizing E-Business Suite Storage Using Oracle Advanced Compression
 
Redis Beyond
Redis BeyondRedis Beyond
Redis Beyond
 
Redis Beyond
Redis BeyondRedis Beyond
Redis Beyond
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2How MySQL can boost (or kill) your application v2
How MySQL can boost (or kill) your application v2
 
Engineers guide to data analysis
Engineers guide to data analysisEngineers guide to data analysis
Engineers guide to data analysis
 
OSMC 2016 | The Engineer's guide to Data Analysis by Avishai Ish-Shalom
OSMC 2016 | The Engineer's guide to Data Analysis by Avishai Ish-ShalomOSMC 2016 | The Engineer's guide to Data Analysis by Avishai Ish-Shalom
OSMC 2016 | The Engineer's guide to Data Analysis by Avishai Ish-Shalom
 
OSMC 2016 - The Engineer's guide to Data Analysis by Avishai Ish-Shalom
OSMC 2016 - The Engineer's guide to Data Analysis by Avishai Ish-ShalomOSMC 2016 - The Engineer's guide to Data Analysis by Avishai Ish-Shalom
OSMC 2016 - The Engineer's guide to Data Analysis by Avishai Ish-Shalom
 
Vacuum in PostgreSQL
Vacuum in PostgreSQLVacuum in PostgreSQL
Vacuum in PostgreSQL
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuning
 
MySQL Performance Schema in Action
MySQL Performance Schema in ActionMySQL Performance Schema in Action
MySQL Performance Schema in Action
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
The Future of zHeap
The Future of zHeapThe Future of zHeap
The Future of zHeap
 
Oracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyOracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the ugly
 
Patterns in the cloud
Patterns in the cloudPatterns in the cloud
Patterns in the cloud
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 

Mehr von EDB

EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021
EDB
 
Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?
EDB
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
 

Mehr von EDB (20)

Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaSCloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 
EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021
 
Benchmarking Cloud Native PostgreSQL
Benchmarking Cloud Native PostgreSQLBenchmarking Cloud Native PostgreSQL
Benchmarking Cloud Native PostgreSQL
 
Las Variaciones de la Replicación de PostgreSQL
Las Variaciones de la Replicación de PostgreSQLLas Variaciones de la Replicación de PostgreSQL
Las Variaciones de la Replicación de PostgreSQL
 
NoSQL and Spatial Database Capabilities using PostgreSQL
NoSQL and Spatial Database Capabilities using PostgreSQLNoSQL and Spatial Database Capabilities using PostgreSQL
NoSQL and Spatial Database Capabilities using PostgreSQL
 
Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with Postgres
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
A Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQLA Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQL
 
Psql is awesome!
Psql is awesome!Psql is awesome!
Psql is awesome!
 
EDB 13 - New Enhancements for Security and Usability - APJ
EDB 13 - New Enhancements for Security and Usability - APJEDB 13 - New Enhancements for Security and Usability - APJ
EDB 13 - New Enhancements for Security and Usability - APJ
 
Comment sauvegarder correctement vos données
Comment sauvegarder correctement vos donnéesComment sauvegarder correctement vos données
Comment sauvegarder correctement vos données
 
Cloud Native PostgreSQL - Italiano
Cloud Native PostgreSQL - ItalianoCloud Native PostgreSQL - Italiano
Cloud Native PostgreSQL - Italiano
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
 
Cloud Native PostgreSQL - APJ
Cloud Native PostgreSQL - APJCloud Native PostgreSQL - APJ
Cloud Native PostgreSQL - APJ
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Learn how zheap works

  • 1. HOW ZHEAP WORKS REINVENTING POSTGRESQL STORAGE BY HANS-JÜRGEN SCHÖNIG
  • 2. ABOUT ME AND MY COMPANY ■ Who is the guy? ■ Who is CYBERTEC?
  • 3. HANS-JÜRGEN SCHÖNIG CEO & SENIOR DATABASE CONSULTANT ■ PostgreSQL since 1999 ■ author of various database books M A I L hs@cybertec.at P H O N E +43 2622 930 22-2 W E B www.cybertec-postgresql.com
  • 4. DATABASE SERVICES DATA Science ▪ Artificial Intelligence ▪ Machine Learning ▪ Big Data ▪ Business Intelligence ▪ Data Mining ▪ etc. POSTGRESQL Services ▪ 24/7 Support ▪ Training ▪ Consulting ▪ Performance Tuning ▪ Clustering ▪ etc.
  • 5.
  • 6. ▪ ICT ▪ University ▪ Government ▪ Automotive ▪ Industry ▪ Trade ▪ Finance ▪ etc. CLIENT SECTORS
  • 7. AGENDA ■ traditional tables ■ table bloat and VACUUM ■ Why a new storage system? ■ zheap: the goal ■ zheap: basic architecture ■ zheap: transaction slots, etc. ■ performance impacts ■ roadmap
  • 9. HEAP: STANDARD TABLES ■ Data structure looks as follows:
  • 10. ■ Data structure looks as follows: HEAP: STANDARD TABLES
  • 13. MAIN ISSUE: TABLE BLOAT test=# CREATE TABLE a (aid int) WITH (autovacuum_enabled = off); CREATE TABLE test=# INSERT INTO a SELECT * FROM generate_series(1, 1000000); INSERT 0 1000000 test=# SELECT pg_size_pretty(pg_relation_size('a')); pg_size_pretty ---------------- 35 MB (1 row)
  • 14. MAIN ISSUE: TABLE BLOAT test=# UPDATE a SET aid = aid + 1; UPDATE 1000000 test=# SELECT pg_size_pretty(pg_relation_size('a')); pg_size_pretty ---------------- 69 MB (1 row)
  • 15. MAIN ISSUE: TABLE BLOAT test=# VACUUM VERBOSE a; INFO: vacuuming "public.a" INFO: "a": removed 1000000 row versions in 4425 pages INFO: "a": found 1000000 removable, 1000000 nonremovable row versions in 8850 out of 8850 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 539 ... VACUUM test=# SELECT pg_size_pretty(pg_relation_size('a')); pg_size_pretty ---------------- 69 MB (1 row)
  • 16. ONE WORD ABOUT VACUUM ■ VACUUM is not always allowed to reallocate dead rows ■ A row must be REALLY dead for VACUUM to do its job ■ Long transactions can be an enemy → Once you are in pain it tends not to go away
  • 17. WAYS OUT ■ VACUUM FULL: Needs a table lock ■ pg_squeeze: ■ Shrinking tables with less locking ■ Move between tablespaces ■ Index organize tables HINT: Try to avoid bloat in the first place!
  • 19. ZHEAP: DESIGN GOALS ■ Perform UPDATE in place ■ Have smaller tables ■ smaller tuple headers ■ improved alignment ■ Reduce writes as much as possible ■ avoid dirtying pages unless data is modified ■ normal heaps dirty pages in some cases during reads ■ Reuse space more quickly ■ Get rid of VACUUM
  • 21. ZHEAP: TUPLE HEADERS ■ Heap: 20+ bytes per row ■ Zheap: 5 bytes per row How can this be achieved? ■ The tuple header controls “visibility” ■ “Normalize tuple header” ■ Move visibility info to the page level
  • 22. ZHEAP: TRANSACTION SLOTS Transaction slots hold transactional visibility
  • 23. ZHEAP: TRANSACTION SLOTS Transaction slots: ■ 16 bytes of storage ■ contains the following information ■ transaction id ■ epoch ■ latest undo record pointer of that transaction What if we need more slots?
  • 24. ZHEAP: TPD PAGES ■ TPD: Store additional transaction slots if “4” is not enough ■ TPD pages are interleaved with normal pages ■
  • 26. OPERATION: INSERT ■ Allocate a transaction slot ■ Emit an undo entry to fix things on error ■ Space can be reclaimed instantly after a ROLLBACK → Most simplistic operation
  • 27. OPERATION: UPDATE ■ More complicated: ■ The new row fits into the old space ■ The new row does not fit into the old space
  • 28. OPERATION: UPDATE FITS ■ If the row is shorter: ■ We can overwrite it ■ Emit undo record In short: We hold the new row in zheap and a copy of the old row in undo so that we can copy it back to the old structure in case it is needed.
  • 29. OPERATION: UPDATE DOESN’T FIT ■ Will be worse ■ DELETE old row ■ INSERT new row in a different place ■ Less efficient Space can instantly be reclaimed in the following cases: ■ When updating a row to a shorter version ■ When non-inplace UPDATEs are performed
  • 30. OPERATION: DELETE ■ How it works ■ Emit undo record ■ DELETE row from zheap Old row can be moved back into zheap during ROLLBACK.
  • 33. ROLLBACK ■ In case a ROLLBACK happens: ■ undo has to make sure that the old state of the table is restored. ■ Old rows have to be copied back ■ ROLLBACK takes longer ! Undo itself can be removed in three cases: ■ as soon as there are no transactions anymore that can see the data. ■ as soon as all undo action has been completed ■ For committed transactions till the time they are all-visible
  • 34. UNDO WORKERS ■ Discarding the undo logs is performed by discard worker ■ Undo launcher checks the rollback_hash_table periodically ■ Spawn new undo workers to perform the rollback ■ Each spawned undo worker processes the rollback requests for a particular database.
  • 37. PREPARING DATA ■ Creating some random data test=# SET temp_buffers TO '1 GB'; SET test=# CREATE TEMP TABLE raw AS SELECT id, hashtext(id::text) as name, random() * 10000 AS n, true AS b FROM generate_series(1, 10000000) AS id; SELECT 10000000
  • 38. LOADING A HEAP ■ Populating a normal table test=# timing Timing is on. test=# CREATE TABLE h1 (LIKE raw) USING heap; CREATE TABLE Time: 7.836 ms test=# INSERT INTO h1 SELECT * FROM raw; INSERT 0 10000000 Time: 7495.798 ms (00:07.496)
  • 39. LOADING A ZHEAP ■ Mind the runtime test=# CREATE TABLE z1 (LIKE raw) USING zheap; CREATE TABLE Time: 8.045 ms test=# INSERT INTO z1 SELECT * FROM raw; INSERT 0 10000000 Time: 27947.516 ms (00:27.948)
  • 40. ZHEAP IS SMALLER ■ Smaller tuple headers make a difference test=# d+ List of relations Schema | Name | Type | Owner | Persistence | Size | ... -----------+------+-------+-------+-------------+--------+---- pg_temp_5 | raw | table | hs | temporary | 498 MB | public | h1 | table | hs | permanent | 498 MB | public | z1 | table | hs | permanent | 251 MB |
  • 41. ZHEAP IN ACTION test=# BEGIN; BEGIN test=*# SELECT pg_size_pretty(pg_relation_size('z1')); pg_size_pretty ---------------- 251 MB (1 row) test=*# UPDATE z1 SET id = id + 1; UPDATE 10000000 test=*# SELECT pg_size_pretty(pg_relation_size('z1')); pg_size_pretty ---------------- 251 MB (1 row)
  • 42. UNDO IN ACTION [hs@hs-MS-7817 undo]$ pwd /home/hs/db13/base/undo [hs@hs-MS-7817 undo]$ ls -l | tail -n 10 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EC00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003ED00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EE00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EF00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F000000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F100000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F200000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F300000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F400000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F500000
  • 44. WHAT WE ARE WORKING ON ■ agree on final design issues ■ fix bugs in current code ■ large code base ■ not easy to handle ■ preparing a patch to move “undo” to core ■ “undo” is core infrastructure We hope to bring this into core some day.
  • 45. QUESTIONS? Feel free to contact me! M A I L hs@cybertec.at P H O N E +43 2622 930 22-2 T W I T T E R @postgresql_007