The document discusses PostgreSQL's physical storage structure. It describes the various directories within the PGDATA directory that stores the database, including the global directory containing shared objects and the critical pg_control file, the base directory containing numeric files for each database, the pg_tblspc directory containing symbolic links to tablespaces, and the pg_xlog directory which contains write-ahead log (WAL) segments that are critical for database writes and recovery. It notes that tablespaces allow spreading database objects across different storage devices to optimize performance.
DevoxxFR 2024 Reproducible Builds with Apache Maven
A couple of things about PostgreSQL...
1. A couple of things to know about PostgreSQL...
(Before start coding)
Federico Campoli
9 July 2013
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 1 / 58
2. Introduction
What is blue, bigger in the inside and with time travel capabilities?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
3. Introduction
What is blue, bigger in the inside and with time travel capabilities?
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
4. Introduction
What is blue, bigger in the inside and with time travel capabilities?
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.
and regarding the couple of things....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
5. Introduction
What is blue, bigger in the inside and with time travel capabilities?
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.
and regarding the couple of things....
I lied.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
6. Introduction
PostgreSQL is a wild beast.
We’ll talk about the common mistakes, the confusing jargon, the on line manual’s
lost pages and the best practices to avoid headaches to your DBA (if you have
one of course).
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 58
7. Introduction
PostgreSQL is a wild beast.
We’ll talk about the common mistakes, the confusing jargon, the on line manual’s
lost pages and the best practices to avoid headaches to your DBA (if you have
one of course).
The major version used in this talk is the 9.2.
So ,let’s start with the TOC
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 58
8. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
9. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
10. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
11. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
12. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
13. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.
Why do we fall?
Crashing the most advanced open source database it’s easy...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
14. Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.
Why do we fall?
Crashing the most advanced open source database it’s easy...
And I thought my jokes were bad
And then I’ll need a back door to escape....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
15. A byte it’s a byte, it’s a byte it’s a byte
PostgreSQL stores the data in a dedicated directory identified by the environment
variable $PGDATA on unix and %PGDATA% on windows.
The location is initialized by the utility initdb and contains various subdirectories.
Each folder have a specific function. Also known as the cluster
Figure: PGDATA
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 5 / 58
16. A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
17. A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
and a small 8kb file, pg control, probably the most important file in the entire
system.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
18. A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
and a small 8kb file, pg control, probably the most important file in the entire
system.
pg control tracks the database vital activities
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
19. A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
and a small 8kb file, pg control, probably the most important file in the entire
system.
pg control tracks the database vital activities
with a corrupted or missing pg control the database cannot start
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
20. A byte it’s a byte, it’s a byte it’s a byte
The base directory
The default location when a new database is created without the TABLESPACE
clause.
Contains numeric subfolders, one for each database.
The number is the database object id and is stored in the pg database system
table.
Contains an optional folder pgsql tmp used for external sorts and temporary files.
The location is mapped in the pg tablespace system table with the pg default
name.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 7 / 58
21. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
22. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
23. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
24. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
25. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
26. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Any change to the block size require the build from source and a new initdb.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
27. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Any change to the block size require the build from source and a new initdb.
The data files are called nodes and mapped to the relations in the pg class system
table...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
28. A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Any change to the block size require the build from source and a new initdb.
The data files are called nodes and mapped to the relations in the pg class system
table...
And yes, we are dealing with an object relational database system.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
29. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
30. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
31. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
32. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
33. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
or drive the project to a complete failure
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
34. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
or drive the project to a complete failure
The objects tablespace location can be safely changed but this require an
exclusive lock on the affected object
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
35. A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
or drive the project to a complete failure
The objects tablespace location can be safely changed but this require an
exclusive lock on the affected object
the view pg tablespace maps the objects name and identifiers
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
36. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
37. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
38. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
39. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
40. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
41. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
42. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
43. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
44. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
A single block corruption results in a not recoverable instance
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
45. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
A single block corruption results in a not recoverable instance
The number of segments is automatically managed by the database
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
46. A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
A single block corruption results in a not recoverable instance
The number of segments is automatically managed by the database
Putting the location on a dedicated and high reliable device is vital
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
47. A byte it’s a byte, it’s a byte it’s a byte
Pages
Voyage to the centre of datafile
Each block is structured almost in the same way, for tables and indices.
Figure: Page schema
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 11 / 58
48. A byte it’s a byte, it’s a byte it’s a byte
Pages
Each page starts with a 24 bytes header followed by an optional bitmap to track
nulls.
After the header’s end resides, in the upper section the tuple pointers, usually 4
bytes each.
The physical tuples are stored in the page’s end.
Figure: Page headerFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 12 / 58
49. A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
50. A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
51. A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
pd tli is the page’s timeline id
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
52. A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
pd tli is the page’s timeline id
yes, PostgreSQL have timelines...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
53. A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
pd tli is the page’s timeline id
yes, PostgreSQL have timelines...
when a point in time recovery is performed a new timeline is created to avoid
conflicts and paradoxes
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
54. A byte it’s a byte, it’s a byte it’s a byte
Page header
People assume that transactions in PostgreSQL are a strict progression of xid, but
actually from a non-linear, non-subjective viewpoint and thanks to the timelines,
it’s more like a big ball of wibbly wobbly... timey wimey... stuff
Figure: Would you like a jelly baby?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 14 / 58
55. A byte it’s a byte, it’s a byte it’s a byte
The tuples
Now finally we can look to the physical tuples and discover another 27 bytes
header. The numbers are the bytes used by the single values.
Each tuple, even a simple boolean value, have a 27 bytes overhead.
The user data data can be the actual data stream or the pointer to the out of line
data stream.
Figure: Tuple structure
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 15 / 58
56. The magic of the MVCC
PostgreSQL consistency
Statements in PostgreSQL happens through transactions.
By default when a single statement is successfully completed the database
commits automatically the transaction.
It’s possible to wrap multiple statements in a single transaction using the
keywords [BEGIN;]....... [COMMIT; ROLLBACK]
The minimal possible level the transaction isolation is READ COMMITTED.
Only the committed changes becomes visible to other sessions.
Any error or rollback statement during the transaction will cancel the entire
operation leaving the data in a consistent state at any time during the database
activity.
PostgreSQL supports the savepoints to partially rollback a long transaction.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 16 / 58
57. The magic of the MVCC
How PostgreSQL keep things consistent
To keep everything consistent PostgreSQL uses the Multi Version Concurrency
Control also know as MVCC.
The base logic seems simple.
A 4 byte unsigned integer called xid is incremented by 1 and assigned to the
current transaction.
Every committed xid lesser than the current xid is in the past and then visible to
the current session.
Every xid greater than the current xid is in the future and then invisible to the
current session.
The commit status is managed in the $PGDATA using the directory pg clog where
small 8k files tracks the transaction statuses.
The the xid match is performed on the tuple’s header seen before.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 17 / 58
58. The magic of the MVCC
t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
59. The magic of the MVCC
t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction
there’s something missing, isn’t it?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
60. The magic of the MVCC
t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction
there’s something missing, isn’t it? Where is the field to store the UPDATE xid?
Figure: Tuple structure
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
61. The magic of the MVCC
Well, PostgreSQL actually NEVER performs an update.
When an UPDATE statement is issued the updated rows are inserted with t xmin
set to the current XID value.
The old rows versions are marked as dead writing the t xmax field with the
current transaction id.
The database manages the tuple’s visibility using this simple routine
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 19 / 58
62. The magic of the MVCC
Source code comment in src/backend/utils/time/tqual.c:
/*
*
* The satisfaction of "now" requires the following:
*
* ((Xmin == my-transaction && inserted by the current transaction
* Cmin < my-command && before this command, and
* (Xmax is null || the row has not been deleted, or
* (Xmax == my-transaction && it was deleted by the current transaction
* Cmax >= my-command))) but not before this command,
* || or
* (Xmin is committed && the row was inserted by a committed transaction, and
* (Xmax is null || the row has not been deleted, or
* (Xmax == my-transaction && the row is being deleted by this transaction
* Cmax >= my-command) || but it’s not deleted "yet", or
* (Xmax != my-transaction && the row was deleted by another transaction
* Xmax is not committed)))) that has not been committed
*
*/
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 20 / 58
63. The magic of the MVCC
Source code comment in src/backend/utils/time/tqual.c:
* HeapTupleSatisfiesNow
* True iff heap tuple is valid "now".
*
* Here, we consider the effects of:
* all committed transactions (as of the current instant)
* previous commands of this transaction
*
* Note we do _not_ include changes made by the current command. This
* solves the "Halloween problem" wherein an UPDATE might try to re-update
* its own output tuples, http://en.wikipedia.org/wiki/Halloween_Problem.
*
* Note:
* Assumes heap tuple is valid.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 21 / 58
64. The magic of the MVCC
The dead tuples are not immediately reclaimed and add overhead to any IO
operation as the block is accessed entirely to determine which is visible or not.
To free the space the VACUUM command should be used.
The command is absolutely safe.
It’s designed to have the minimal impact on the database normal activity.
VACUUM scans the relation and the indices for dead tuples no longer visible to
open transactions.
Is absolutely vital to run vacuum on each cluster’s database at least every 2
billions transactions.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 22 / 58
65. The magic of the MVCC
XID is a 4 byte unsigned integer.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
66. The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
67. The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
PostgreSQL uses the modulo − 231
comparison method
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
68. The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
PostgreSQL uses the modulo − 231
comparison method
For each value 2 billions XID are in the future and 2 billions in the past
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
69. The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
PostgreSQL uses the modulo − 231
comparison method
For each value 2 billions XID are in the future and 2 billions in the past
When a xid’s age becomes too close to 2 billions VACUUM freezes the value
to an hardcoded xid in the past by definition
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
70. The magic of the MVCC
If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
71. The magic of the MVCC
If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".
If an xid’s age reaches 1 million transactions from the wraparound failure the
database simply shut down and can be started only in single user mode to perform
the VACUUM.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
72. The magic of the MVCC
If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".
If an xid’s age reaches 1 million transactions from the wraparound failure the
database simply shut down and can be started only in single user mode to perform
the VACUUM.
Anyway, the autovacuum deamon, even if turned off starts the required VACUUM
long before this catastrophic scenario happens.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
73. TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
74. TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
75. TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
76. TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique
The attribute is also known as field
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
77. TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique
The attribute is also known as field
The TOAST can store up to 1 GB in the out of line storage (free of charge)
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
78. TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
79. TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
80. TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.
The storage technique used depends from the data stream size, and the storage
method assigned to the attribute.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
81. TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.
The storage technique used depends from the data stream size, and the storage
method assigned to the attribute.
Depending from the chosen strategy is possible to store the data in external
relations or compressed using the fast zlib algorithm.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
82. TOAST Please, and don’t forget the Marmite
TOAST permits four storage strategies (shamelessy copied from the on line
manual).
PLAIN prevents either compression or out-of-line storage; This is the only
possible strategy for columns of non-TOAST-able data types.
EXTENDED allows both compression and out-of-line storage. This is the
default for most TOAST-able data types. Compression will be attempted
first, then out-of-line storage if the row is still too big.
EXTERNAL allows out-of-line storage but not compression. Use of
EXTERNAL will make substring operations on wide text and bytea columns
faster at the penalty of increased storage space.
MAIN allows compression but not out-of-line storage. Actually, out-of-line
storage will still be performed for such columns, but only as a last resort.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 27 / 58
83. TOAST Please, and don’t forget the Marmite
When the out of line storage is used the data is encoded in bytea and eventually
split in multiple chunks.
An unique index over the chunk id and chunk seq avoid either duplicate data and
speed up the look ups
Figure: Toast table
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 28 / 58
84. It’s bigger in the inside
A PostgreSQL instance is a memory segment shared between multiple processes
accessing the data directory.
When a new connection happens a new postgres is forked and attached to the
shared memory, also known as shared buffer.
PostgreSQL is a multiprocess database system but not multi threaded.
Each process can use only one processor or core.
To keep things consistent every single block, for read or for write purposes must
pass through the shared buffer.
As the shared buffer is smaller than the database size, and often smaller than a
single table size, the blocks in memory shall be managed and the space allocation
must adapt the required usage.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 29 / 58
85. It’s bigger in the inside
Jargon
backend process: a postgres process attached to the shared buffer
heap page: a table’s data page
index page: an index data page
buffer: a page, index or heap loaded in the shared buffer
dirty buffer: a buffer wal logged but not yet written on disk
clean buffer: a buffer written consolidated on disk
pinned buffer: buffer held by a backend process
unpinned buffer: buffer released and available to be pinned again
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 30 / 58
86. It’s bigger in the inside
In the earlies days of PostgreSQL 7.x a simple most recently used buffer was used.
The simple algorithm, after the unpin moved the buffers on the top of a simpe
FIFO list.
During the revolutionary 8.0 development, a new powerful algorithm was
introduced.
The Adaptive Replacement Cache capable to self adapt the size of two pools
dedicated to the most recently used and most recently used buffers.
This algorithm was removed few weeks before the stable release because a
software patent.
An emergency two queue algorithm was adopted making the memory
management not brilliant as expected.
The next year, the release 8.1 adopted the clock sweep memory manager.
The algorithm is still in use with few improvements, simple, flexible and free.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 31 / 58
87. It’s bigger in the inside
The buffer manager’s main goal is to keep cached in memory the most recently
used blocks and adapt dynamically for the most frequently used blocks.
To do this a small memory portion is used as free list for the buffers available for
memory eviction.
Figure: Free list
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 32 / 58
88. It’s bigger in the inside
The buffers have an reference counter (pin counter). Every time a buffer is pinned
the counter is incremented by one up to a small value.
Figure: Block usage counter
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 33 / 58
89. It’s bigger in the inside
Shamelessy copied from the file src/backend/storage/buffer/README
There is a ”free list” of buffers that are prime candidates for replacement. In
particular, buffers that are completely free (contain no valid page) are always in
this list.
To choose a victim buffer to recycle when there are no free buffers available, we
use a simple clock-sweep algorithm, which avoids the need to take system-wide
locks during common operations.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 34 / 58
90. It’s bigger in the inside
It works like this:
Each buffer header contains a usage counter, which is incremented (up to a small
limit value) whenever the buffer is pinned. (This requires only the buffer header
spinlock, which would have to be taken anyway to increment the buffer reference
count, so it’s nearly free.)
The ”clock hand” is a buffer index, NextVictimBuffer, that moves circularly
through all the available buffers. NextVictimBuffer is protected by the
BufFreelistLock.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 35 / 58
91. It’s bigger in the inside
The algorithm for a process that needs to obtain a victim buffer is:
1 Obtain BufFreelistLock.
2 If buffer free list is nonempty, remove its head buffer. If the buffer is pinned
or has a nonzero usage count, it cannot be used; ignore it and return to the
start of step 2. Otherwise, pin the buffer, release BufFreelistLock, and return
the buffer.
3 Otherwise, select the buffer pointed to by NextVictimBuffer, and circularly
advance NextVictimBuffer for next time.
4 If the selected buffer is pinned or has a nonzero usage count, it cannot be
used. Decrement its usage count (if nonzero) and return to step 3 to
examine the next buffer.
5 Pin the selected buffer, release BufFreelistLock, and return the buffer.
(Note that if the selected buffer is dirty, we will have to write it out before we can
recycle it; if someone else pins the buffer meanwhile we will have to give up and
try another buffer. This however is not a concern of the basic
select-a-victim-buffer algorithm.)
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 36 / 58
92. It’s bigger in the inside
Figure: The NextVictimBufferFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 37 / 58
93. It’s bigger in the inside
Since the version 8.3 the buffer manager have the ring buffer strategy.
Operations which require a large amount of buffers in memory, like VACUUM or
large tables sequential scans, have a dedicated 256kb ring buffer, small enough to
fit in the processor’s L2.
The strategy improves buffer’s load and eviction and protects the remaining
shared buffer.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 38 / 58
94. The answer is 42
How PostgreSQL executes a query
After the physical storage and the memory let’s take a look how the database
interacts with the backends from the logical point of view.
Jargon
OID: Object ID, 4 byte unsigned used to map any system object to a unique
value
class: any relational object, table, index, view, sequence...
attribute: basically table fields
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 39 / 58
95. The answer is 42
The parser stage
When a query is sent for processing PostgreSQL executes at first a syntax analysis
using the query parser.
Any error in this phase will stop the execution throwing a syntax error.
As this stage doesn’t require access to the system catalogue there’s no wasted xid.
If the syntax is correct the parser will return a parse tree ready for the next step.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 40 / 58
96. The answer is 42
The query tree
The second stage is still managed by the parser which access the system catalogue
and from the parse tree generates a query tree.
This is a logical representation of the language where any object and attribute is
unique.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 41 / 58
97. The answer is 42
The query tree
Figure: A simple query tree
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 42 / 58
98. The answer is 42
The query tree
To generate the query tree the parser access the system catalogue and retrieve the
corresponding OID for each class and attribute in the query.
Ambiguous names will generate an error.
In the query tree the optional filtering elements are translated as well.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 43 / 58
99. The answer is 42
The planner stage
The query tree is then sent to the query planner which transverse the tree and
generates all the possible execution plans with the arbitrary cost estimated from
the database collected statistics.
The estimated plan with minimum cost is chosen for the processing and sent to
the executor.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 58
100. The answer is 42
The planner stage
The query tree is then sent to the query planner which transverse the tree and
generates all the possible execution plans with the arbitrary cost estimated from
the database collected statistics.
The estimated plan with minimum cost is chosen for the processing and sent to
the executor.
Let me stress again the word estimate.
A database with old or missing statistics will generate not efficient plans resulting
in slow queries.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 58
101. The answer is 42
The executor
The planner returns then the execution plan, a sequence of steps to retrieve the
requested data, to manipulate the data or change the database structure.
The last stage is the executor. The execution plan steps are executed, then the
eventual output is returned to the backend.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 45 / 58
102. The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
103. The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
The optional clause ANALYZE actually executes the query, discard the
results and return the real execution plan.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
104. The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
The optional clause ANALYZE actually executes the query, discard the
results and return the real execution plan.
DML queries with EXPLAIN ANALYZE will change the data. Should be
wrapped between BEGIN; ROLLBACK; to avoid unwanted results.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
105. The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
The optional clause ANALYZE actually executes the query, discard the
results and return the real execution plan.
DML queries with EXPLAIN ANALYZE will change the data. Should be
wrapped between BEGIN; ROLLBACK; to avoid unwanted results.
Let’s see EXPLAIN in action.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
106. The answer is 42
For our purpose we’ll create a test table with two fields.
An identifier, integer 4 bytes, with an auto incremental value and a character
varying where to store md5 values.
The serial pseudo type is short for CREATE SEQUENCE t test i id seq (self
generated name);
then
integer NOT NULL DEFAULT default nextval(’t test i id seq’::regclass)
Listing 1: Create table
test =# CREATE TABLE t_test
(
i_id serial ,
v_value character varying (50)
)
;
NOTICE: CREATE TABLE will create implicit sequence " t_test_i_id_seq " for serial column "t_test.i_id"
CREATE TABLE
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 47 / 58
107. The answer is 42
Now let’s add some rows to our table
Listing 2: Insert in table
test =# INSERT INTO t_test
(v_value)
SELECT
v_value
FROM
(
SELECT
generate_series (1 ,1000) as i_cnt ,
md5(random ():: text) as v_value
) t_gen
;
INSERT 0 1000
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 48 / 58
108. The answer is 42
Let’s generate the estimated plan for one row result
Listing 3: EXPLAIN
test =# EXPLAIN SELECT * FROM t_test WHERE i_id =20;
QUERY PLAN
--------------------------------------------------------
Seq Scan on t_test (cost =0.00..21.50 rows =1 width =37)
Filter: (i_id = 20)
(2 rows)
As the table have no indices the only action possible is the table’s sequential scan.
The cost is an arbitrary value.
The first number is the startup cost to delivery the first row to the next operator
or the backend.
The second number is the total cost needed to complete the operation.
The rows says how many rows the database is expecting to get from the operation
and the width is the estimated average row width in bytes.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 49 / 58
109. The answer is 42
Let’s generate the real execution plan for one row result
Listing 4: EXPLAIN ANALYZE
test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id =20;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t_test (cost =0.00..21.50 rows =1 width =37) (actual time =0.021..0.198 rows =1 loops =1)
Filter: (i_id = 20)
Rows Removed by Filter: 999
Total runtime: 0.235 ms
(4 rows)
The second group of values give the real time, in milliseconds, for the startup and
total cost.
The loops value shows how many times the operator is executed.
In the bottom the total runtime tell the real execution time for the query.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 50 / 58
110. The answer is 42
Let’s add an index on the id field...
Listing 5: CREATE INDEX
test =# CREATE INDEX idx_i_id ON t_test (i_id);
CREATE INDEX
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 51 / 58
111. The answer is 42
and generate a new execution plan
Listing 6: EXPLAIN ANALYZE WITH INDEX
test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id =20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Index Scan using idx_i_id on t_test (cost =0.00..8.27 rows =1 width =37) (actual time =0.019..0.020 rows =1
loops =1)
Index Cond: (i_id = 20)
Total runtime: 0.055 ms
(3 rows)
The runtime is ten times faster.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 52 / 58
112. The answer is 42
The cost based optimizer becomes constantly clever. For example, if we ask for
more than half estimated table the database will chose the cheaper execution
plan.
Listing 7: EXPLAIN ANALYZE WITH INDEX
test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id >20;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on t_test (cost =0.00..21.50 rows =980 width =37) (actual time =0.013..0.148 rows =980 loops =1)
Filter: (i_id > 20)
Rows Removed by Filter: 20
Total runtime: 0.209 ms
(4 rows)
test =# SET enable_seqscan =’off ’;
SET
test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id >20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Index Scan using idx_i_id on t_test (cost =0.00..49.40 rows =980 width =37) (actual time =0.042..0.390 rows
=980 loops =1)
Index Cond: (i_id > 20)
Total runtime: 0.507 ms
(3 rows)
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 53 / 58
113. The answer is 42
Scan nodes
seq scan: scan sequentially all the blocks in the table and discard the not
matching rows
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 58
114. The answer is 42
Scan nodes
seq scan: scan sequentially all the blocks in the table and discard the not
matching rows
index scan: read the index tree with random disk read. it does returns
ordered data
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 58
115. The answer is 42
Scan nodes
seq scan: scan sequentially all the blocks in the table and discard the not
matching rows
index scan: read the index tree with random disk read. it does returns
ordered data
bitmap index/heap scan: read the index sequentially generating a bitmap to
recheck on the table. it doesn’t return ordered data. it’s a good compromise
between seq scan and a full index scan
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 58
116. The answer is 42
Join nodes
nested loop: for each row on the relation on the left apply the filter to the
relation on the right
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 58
117. The answer is 42
Join nodes
nested loop: for each row on the relation on the left apply the filter to the
relation on the right
hash join:
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 58
118. The answer is 42
Join nodes
nested loop: for each row on the relation on the left apply the filter to the
relation on the right
hash join:
merge join:
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 58
119. Why do we fall?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 56 / 58
120. And I thought my jokes were bad
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 57 / 58
121. A couple of things to know about PostgreSQL...
(Before start coding)
Federico Campoli
9 July 2013
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 58 / 58