SlideShare ist ein Scribd-Unternehmen logo
1 von 103
A couple of things to know about PostgreSQL...
(Before start coding)
Federico Campoli
9 July 2013
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 1 / 58
Introduction
What is blue, bigger in the inside and with time travel capabilities?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
Introduction
What is blue, bigger in the inside and with time travel capabilities?
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
Introduction
What is blue, bigger in the inside and with time travel capabilities?
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.
and regarding the couple of things....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
Introduction
What is blue, bigger in the inside and with time travel capabilities?
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.
and regarding the couple of things....
I lied.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
Introduction
PostgreSQL is a wild beast.
We’ll talk about the common mistakes, the confusing jargon, the on line manual’s
lost pages and the best practices to avoid headaches to your DBA (if you have
one of course).
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 58
Introduction
PostgreSQL is a wild beast.
We’ll talk about the common mistakes, the confusing jargon, the on line manual’s
lost pages and the best practices to avoid headaches to your DBA (if you have
one of course).
The major version used in this talk is the 9.2.
So ,let’s start with the TOC
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.
Why do we fall?
Crashing the most advanced open source database it’s easy...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.
The magic of the MVCC
How PostgreSQL keep things consistent.
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.
Why do we fall?
Crashing the most advanced open source database it’s easy...
And I thought my jokes were bad
And then I’ll need a back door to escape....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
A byte it’s a byte, it’s a byte it’s a byte
PostgreSQL stores the data in a dedicated directory identified by the environment
variable $PGDATA on unix and %PGDATA% on windows.
The location is initialized by the utility initdb and contains various subdirectories.
Each folder have a specific function. Also known as the cluster
Figure: PGDATA
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 5 / 58
A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
and a small 8kb file, pg control, probably the most important file in the entire
system.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
and a small 8kb file, pg control, probably the most important file in the entire
system.
pg control tracks the database vital activities
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
A byte it’s a byte, it’s a byte it’s a byte
The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......
and a small 8kb file, pg control, probably the most important file in the entire
system.
pg control tracks the database vital activities
with a corrupted or missing pg control the database cannot start
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
The default location when a new database is created without the TABLESPACE
clause.
Contains numeric subfolders, one for each database.
The number is the database object id and is stored in the pg database system
table.
Contains an optional folder pgsql tmp used for external sorts and temporary files.
The location is mapped in the pg tablespace system table with the pg default
name.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 7 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess....
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Any change to the block size require the build from source and a new initdb.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Any change to the block size require the build from source and a new initdb.
The data files are called nodes and mapped to the relations in the pg class system
table...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The base directory
Each subdirectory contains.....just guess.... numeric files
Each file can grow at max 1 Gb, then a new chunk is generated with a sequence
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.
Any change to the block size require the build from source and a new initdb.
The data files are called nodes and mapped to the relations in the pg class system
table...
And yes, we are dealing with an object relational database system.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
or drive the project to a complete failure
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
or drive the project to a complete failure
The objects tablespace location can be safely changed but this require an
exclusive lock on the affected object
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg tblspc directory
Contains the symbolic links to the tablespaces.
A tablespace is a logical location for physical objects
Useful to spread tables and indices on different physical devices
Combined with the logical volume management can improve dramatically the
performance...
or drive the project to a complete failure
The objects tablespace location can be safely changed but this require an
exclusive lock on the affected object
the view pg tablespace maps the objects name and identifiers
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
A single block corruption results in a not recoverable instance
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
A single block corruption results in a not recoverable instance
The number of segments is automatically managed by the database
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
The pg xlog directory
WARNING INCOMING AIRSTRIKE
Also known as the write ahead log directory, WAL
Is the most important and critical directory in the cluster
Contains 16 Mb segments used by the database to save the block changes
Each segment contains the blocks changed in the volatile memory
Not used when the database is stopped cleanly
Is absolutely critical when a crash or unclean shutdown happens
A single block corruption results in a not recoverable instance
The number of segments is automatically managed by the database
Putting the location on a dedicated and high reliable device is vital
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
A byte it’s a byte, it’s a byte it’s a byte
Pages
Voyage to the centre of datafile
Each block is structured almost in the same way, for tables and indices.
Figure: Page schema
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 11 / 58
A byte it’s a byte, it’s a byte it’s a byte
Pages
Each page starts with a 24 bytes header followed by an optional bitmap to track
nulls.
After the header’s end resides, in the upper section the tuple pointers, usually 4
bytes each.
The physical tuples are stored in the page’s end.
Figure: Page headerFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 12 / 58
A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
pd tli is the page’s timeline id
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
pd tli is the page’s timeline id
yes, PostgreSQL have timelines...
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
A byte it’s a byte, it’s a byte it’s a byte
Page header
The page header contains a couple of interesting things...
pd lsn is the most recent sequence number on the WAL for the page
pd tli is the page’s timeline id
yes, PostgreSQL have timelines...
when a point in time recovery is performed a new timeline is created to avoid
conflicts and paradoxes
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
A byte it’s a byte, it’s a byte it’s a byte
Page header
People assume that transactions in PostgreSQL are a strict progression of xid, but
actually from a non-linear, non-subjective viewpoint and thanks to the timelines,
it’s more like a big ball of wibbly wobbly... timey wimey... stuff
Figure: Would you like a jelly baby?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 14 / 58
A byte it’s a byte, it’s a byte it’s a byte
The tuples
Now finally we can look to the physical tuples and discover another 27 bytes
header. The numbers are the bytes used by the single values.
Each tuple, even a simple boolean value, have a 27 bytes overhead.
The user data data can be the actual data stream or the pointer to the out of line
data stream.
Figure: Tuple structure
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 15 / 58
The magic of the MVCC
PostgreSQL consistency
Statements in PostgreSQL happens through transactions.
By default when a single statement is successfully completed the database
commits automatically the transaction.
It’s possible to wrap multiple statements in a single transaction using the
keywords [BEGIN;]....... [COMMIT; ROLLBACK]
The minimal possible level the transaction isolation is READ COMMITTED.
Only the committed changes becomes visible to other sessions.
Any error or rollback statement during the transaction will cancel the entire
operation leaving the data in a consistent state at any time during the database
activity.
PostgreSQL supports the savepoints to partially rollback a long transaction.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 16 / 58
The magic of the MVCC
How PostgreSQL keep things consistent
To keep everything consistent PostgreSQL uses the Multi Version Concurrency
Control also know as MVCC.
The base logic seems simple.
A 4 byte unsigned integer called xid is incremented by 1 and assigned to the
current transaction.
Every committed xid lesser than the current xid is in the past and then visible to
the current session.
Every xid greater than the current xid is in the future and then invisible to the
current session.
The commit status is managed in the $PGDATA using the directory pg clog where
small 8k files tracks the transaction statuses.
The the xid match is performed on the tuple’s header seen before.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 17 / 58
The magic of the MVCC
t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
The magic of the MVCC
t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction
there’s something missing, isn’t it?
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
The magic of the MVCC
t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction
there’s something missing, isn’t it? Where is the field to store the UPDATE xid?
Figure: Tuple structure
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
The magic of the MVCC
Well, PostgreSQL actually NEVER performs an update.
When an UPDATE statement is issued the updated rows are inserted with t xmin
set to the current XID value.
The old rows versions are marked as dead writing the t xmax field with the
current transaction id.
The database manages the tuple’s visibility using this simple routine
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 19 / 58
The magic of the MVCC
Source code comment in src/backend/utils/time/tqual.c:
/*
*
* The satisfaction of "now" requires the following:
*
* ((Xmin == my-transaction && inserted by the current transaction
* Cmin < my-command && before this command, and
* (Xmax is null || the row has not been deleted, or
* (Xmax == my-transaction && it was deleted by the current transaction
* Cmax >= my-command))) but not before this command,
* || or
* (Xmin is committed && the row was inserted by a committed transaction, and
* (Xmax is null || the row has not been deleted, or
* (Xmax == my-transaction && the row is being deleted by this transaction
* Cmax >= my-command) || but it’s not deleted "yet", or
* (Xmax != my-transaction && the row was deleted by another transaction
* Xmax is not committed)))) that has not been committed
*
*/
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 20 / 58
The magic of the MVCC
Source code comment in src/backend/utils/time/tqual.c:
* HeapTupleSatisfiesNow
* True iff heap tuple is valid "now".
*
* Here, we consider the effects of:
* all committed transactions (as of the current instant)
* previous commands of this transaction
*
* Note we do _not_ include changes made by the current command. This
* solves the "Halloween problem" wherein an UPDATE might try to re-update
* its own output tuples, http://en.wikipedia.org/wiki/Halloween_Problem.
*
* Note:
* Assumes heap tuple is valid.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 21 / 58
The magic of the MVCC
The dead tuples are not immediately reclaimed and add overhead to any IO
operation as the block is accessed entirely to determine which is visible or not.
To free the space the VACUUM command should be used.
The command is absolutely safe.
It’s designed to have the minimal impact on the database normal activity.
VACUUM scans the relation and the indices for dead tuples no longer visible to
open transactions.
Is absolutely vital to run vacuum on each cluster’s database at least every 2
billions transactions.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 22 / 58
The magic of the MVCC
XID is a 4 byte unsigned integer.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
PostgreSQL uses the modulo − 231
comparison method
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
PostgreSQL uses the modulo − 231
comparison method
For each value 2 billions XID are in the future and 2 billions in the past
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
The magic of the MVCC
XID is a 4 byte unsigned integer.
Every 4 billions transactions the value wraps
PostgreSQL uses the modulo − 231
comparison method
For each value 2 billions XID are in the future and 2 billions in the past
When a xid’s age becomes too close to 2 billions VACUUM freezes the value
to an hardcoded xid in the past by definition
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
The magic of the MVCC
If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
The magic of the MVCC
If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".
If an xid’s age reaches 1 million transactions from the wraparound failure the
database simply shut down and can be started only in single user mode to perform
the VACUUM.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
The magic of the MVCC
If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".
If an xid’s age reaches 1 million transactions from the wraparound failure the
database simply shut down and can be started only in single user mode to perform
the VACUUM.
Anyway, the autovacuum deamon, even if turned off starts the required VACUUM
long before this catastrophic scenario happens.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique
The attribute is also known as field
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
TOAST Please, and don’t forget the Marmite
TOAST, the best thing since sliced bread
Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique
The attribute is also known as field
The TOAST can store up to 1 GB in the out of line storage (free of charge)
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.
The storage technique used depends from the data stream size, and the storage
method assigned to the attribute.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
TOAST Please, and don’t forget the Marmite
Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.
Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.
The storage technique used depends from the data stream size, and the storage
method assigned to the attribute.
Depending from the chosen strategy is possible to store the data in external
relations or compressed using the fast zlib algorithm.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
TOAST Please, and don’t forget the Marmite
TOAST permits four storage strategies (shamelessy copied from the on line
manual).
PLAIN prevents either compression or out-of-line storage; This is the only
possible strategy for columns of non-TOAST-able data types.
EXTENDED allows both compression and out-of-line storage. This is the
default for most TOAST-able data types. Compression will be attempted
first, then out-of-line storage if the row is still too big.
EXTERNAL allows out-of-line storage but not compression. Use of
EXTERNAL will make substring operations on wide text and bytea columns
faster at the penalty of increased storage space.
MAIN allows compression but not out-of-line storage. Actually, out-of-line
storage will still be performed for such columns, but only as a last resort.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 27 / 58
TOAST Please, and don’t forget the Marmite
When the out of line storage is used the data is encoded in bytea and eventually
split in multiple chunks.
An unique index over the chunk id and chunk seq avoid either duplicate data and
speed up the look ups
Figure: Toast table
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 28 / 58
It’s bigger in the inside
A PostgreSQL instance is a memory segment shared between multiple processes
accessing the data directory.
When a new connection happens a new postgres is forked and attached to the
shared memory, also known as shared buffer.
PostgreSQL is a multiprocess database system but not multi threaded.
Each process can use only one processor or core.
To keep things consistent every single block, for read or for write purposes must
pass through the shared buffer.
As the shared buffer is smaller than the database size, and often smaller than a
single table size, the blocks in memory shall be managed and the space allocation
must adapt the required usage.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 29 / 58
It’s bigger in the inside
Jargon
backend process: a postgres process attached to the shared buffer
heap page: a table’s data page
index page: an index data page
buffer: a page, index or heap loaded in the shared buffer
dirty buffer: a buffer wal logged but not yet written on disk
clean buffer: a buffer written consolidated on disk
pinned buffer: buffer held by a backend process
unpinned buffer: buffer released and available to be pinned again
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 30 / 58
It’s bigger in the inside
In the earlies days of PostgreSQL 7.x a simple most recently used buffer was used.
The simple algorithm, after the unpin moved the buffers on the top of a simpe
FIFO list.
During the revolutionary 8.0 development, a new powerful algorithm was
introduced.
The Adaptive Replacement Cache capable to self adapt the size of two pools
dedicated to the most recently used and most recently used buffers.
This algorithm was removed few weeks before the stable release because a
software patent.
An emergency two queue algorithm was adopted making the memory
management not brilliant as expected.
The next year, the release 8.1 adopted the clock sweep memory manager.
The algorithm is still in use with few improvements, simple, flexible and free.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 31 / 58
It’s bigger in the inside
The buffer manager’s main goal is to keep cached in memory the most recently
used blocks and adapt dynamically for the most frequently used blocks.
To do this a small memory portion is used as free list for the buffers available for
memory eviction.
Figure: Free list
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 32 / 58
It’s bigger in the inside
The buffers have an reference counter (pin counter). Every time a buffer is pinned
the counter is incremented by one up to a small value.
Figure: Block usage counter
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 33 / 58
It’s bigger in the inside
Shamelessy copied from the file src/backend/storage/buffer/README
There is a ”free list” of buffers that are prime candidates for replacement. In
particular, buffers that are completely free (contain no valid page) are always in
this list.
To choose a victim buffer to recycle when there are no free buffers available, we
use a simple clock-sweep algorithm, which avoids the need to take system-wide
locks during common operations.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 34 / 58
It’s bigger in the inside
It works like this:
Each buffer header contains a usage counter, which is incremented (up to a small
limit value) whenever the buffer is pinned. (This requires only the buffer header
spinlock, which would have to be taken anyway to increment the buffer reference
count, so it’s nearly free.)
The ”clock hand” is a buffer index, NextVictimBuffer, that moves circularly
through all the available buffers. NextVictimBuffer is protected by the
BufFreelistLock.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 35 / 58
It’s bigger in the inside
The algorithm for a process that needs to obtain a victim buffer is:
1 Obtain BufFreelistLock.
2 If buffer free list is nonempty, remove its head buffer. If the buffer is pinned
or has a nonzero usage count, it cannot be used; ignore it and return to the
start of step 2. Otherwise, pin the buffer, release BufFreelistLock, and return
the buffer.
3 Otherwise, select the buffer pointed to by NextVictimBuffer, and circularly
advance NextVictimBuffer for next time.
4 If the selected buffer is pinned or has a nonzero usage count, it cannot be
used. Decrement its usage count (if nonzero) and return to step 3 to
examine the next buffer.
5 Pin the selected buffer, release BufFreelistLock, and return the buffer.
(Note that if the selected buffer is dirty, we will have to write it out before we can
recycle it; if someone else pins the buffer meanwhile we will have to give up and
try another buffer. This however is not a concern of the basic
select-a-victim-buffer algorithm.)
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 36 / 58
It’s bigger in the inside
Figure: The NextVictimBufferFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 37 / 58
It’s bigger in the inside
Since the version 8.3 the buffer manager have the ring buffer strategy.
Operations which require a large amount of buffers in memory, like VACUUM or
large tables sequential scans, have a dedicated 256kb ring buffer, small enough to
fit in the processor’s L2.
The strategy improves buffer’s load and eviction and protects the remaining
shared buffer.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 38 / 58
The answer is 42
How PostgreSQL executes a query
After the physical storage and the memory let’s take a look how the database
interacts with the backends from the logical point of view.
Jargon
OID: Object ID, 4 byte unsigned used to map any system object to a unique
value
class: any relational object, table, index, view, sequence...
attribute: basically table fields
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 39 / 58
The answer is 42
The parser stage
When a query is sent for processing PostgreSQL executes at first a syntax analysis
using the query parser.
Any error in this phase will stop the execution throwing a syntax error.
As this stage doesn’t require access to the system catalogue there’s no wasted xid.
If the syntax is correct the parser will return a parse tree ready for the next step.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 40 / 58
The answer is 42
The query tree
The second stage is still managed by the parser which access the system catalogue
and from the parse tree generates a query tree.
This is a logical representation of the language where any object and attribute is
unique.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 41 / 58
The answer is 42
The query tree
Figure: A simple query tree
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 42 / 58
The answer is 42
The query tree
To generate the query tree the parser access the system catalogue and retrieve the
corresponding OID for each class and attribute in the query.
Ambiguous names will generate an error.
In the query tree the optional filtering elements are translated as well.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 43 / 58
The answer is 42
The planner stage
The query tree is then sent to the query planner which transverse the tree and
generates all the possible execution plans with the arbitrary cost estimated from
the database collected statistics.
The estimated plan with minimum cost is chosen for the processing and sent to
the executor.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 58
The answer is 42
The planner stage
The query tree is then sent to the query planner which transverse the tree and
generates all the possible execution plans with the arbitrary cost estimated from
the database collected statistics.
The estimated plan with minimum cost is chosen for the processing and sent to
the executor.
Let me stress again the word estimate.
A database with old or missing statistics will generate not efficient plans resulting
in slow queries.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 58
The answer is 42
The executor
The planner returns then the execution plan, a sequence of steps to retrieve the
requested data, to manipulate the data or change the database structure.
The last stage is the executor. The execution plan steps are executed, then the
eventual output is returned to the backend.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 45 / 58
The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
The optional clause ANALYZE actually executes the query, discard the
results and return the real execution plan.
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58

Weitere ähnliche Inhalte

Was ist angesagt?

Backup recovery with PostgreSQL
Backup recovery with PostgreSQLBackup recovery with PostgreSQL
Backup recovery with PostgreSQLFederico Campoli
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuningFederico Campoli
 
Pg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replicaPg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replicaFederico Campoli
 
The ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in TranswerwiseThe ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLMark Wong
 
pg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replicapg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replicaFederico Campoli
 
pg_chameleon MySQL to PostgreSQL replica made easy
pg_chameleon  MySQL to PostgreSQL replica made easypg_chameleon  MySQL to PostgreSQL replica made easy
pg_chameleon MySQL to PostgreSQL replica made easyFederico Campoli
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportAlexander Korotkov
 
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBPuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBWalter Heck
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Jamey Hanson
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
RDFox Poster
RDFox PosterRDFox Poster
RDFox PosterDBOnto
 

Was ist angesagt? (20)

Backup recovery with PostgreSQL
Backup recovery with PostgreSQLBackup recovery with PostgreSQL
Backup recovery with PostgreSQL
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
 
Hitchikers guide handout
Hitchikers guide handoutHitchikers guide handout
Hitchikers guide handout
 
Pg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replicaPg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replica
 
The ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in TranswerwiseThe ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in Transwerwise
 
Streaming replication
Streaming replicationStreaming replication
Streaming replication
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
pg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replicapg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replica
 
pg_chameleon MySQL to PostgreSQL replica made easy
pg_chameleon  MySQL to PostgreSQL replica made easypg_chameleon  MySQL to PostgreSQL replica made easy
pg_chameleon MySQL to PostgreSQL replica made easy
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Our answer to Uber
Our answer to UberOur answer to Uber
Our answer to Uber
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction support
 
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBPuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
 
The future is CSN
The future is CSNThe future is CSN
The future is CSN
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
RDFox Poster
RDFox PosterRDFox Poster
RDFox Poster
 

Andere mochten auch

Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuningelliando dias
 
Data Processing Inside PostgreSQL
Data Processing Inside PostgreSQLData Processing Inside PostgreSQL
Data Processing Inside PostgreSQLEDB
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep InternalEXEM
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Tension superficial de liquidos
Tension superficial de liquidosTension superficial de liquidos
Tension superficial de liquidosSoldado Aliado<3
 
Programa y fomentacion sensibilizacion
Programa y fomentacion sensibilizacionPrograma y fomentacion sensibilizacion
Programa y fomentacion sensibilizacionSoldado Aliado<3
 
Van ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras joseVan ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras joseSoldado Aliado<3
 
Van ness problemas termo cap 1 orihuela contreras jose
Van ness problemas termo cap 1 orihuela contreras joseVan ness problemas termo cap 1 orihuela contreras jose
Van ness problemas termo cap 1 orihuela contreras joseSoldado Aliado<3
 
Interpretación topográfica y elementos básicos de foto interpretación
Interpretación topográfica y elementos básicos de foto interpretaciónInterpretación topográfica y elementos básicos de foto interpretación
Interpretación topográfica y elementos básicos de foto interpretaciónSoldado Aliado<3
 
Motor stirling de combustion externa
Motor stirling de combustion externaMotor stirling de combustion externa
Motor stirling de combustion externaSoldado Aliado<3
 
Van ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras joseVan ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras joseSoldado Aliado<3
 
Motores de combustion interna de cuatro tiempos
Motores de combustion interna de cuatro tiemposMotores de combustion interna de cuatro tiempos
Motores de combustion interna de cuatro tiemposSoldado Aliado<3
 
RESUME Robert Hall
RESUME Robert HallRESUME Robert Hall
RESUME Robert HallBob Hall
 
Ensayo de la inteligencia artificial distribuida
Ensayo de la inteligencia artificial distribuidaEnsayo de la inteligencia artificial distribuida
Ensayo de la inteligencia artificial distribuidaJhon andrés Gracia cusme
 
Stories of an Oracle DBA
Stories of an Oracle DBAStories of an Oracle DBA
Stories of an Oracle DBAJamel Farissi
 

Andere mochten auch (20)

Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
 
Data Processing Inside PostgreSQL
Data Processing Inside PostgreSQLData Processing Inside PostgreSQL
Data Processing Inside PostgreSQL
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Tension superficial de liquidos
Tension superficial de liquidosTension superficial de liquidos
Tension superficial de liquidos
 
Combustion caldera
Combustion  calderaCombustion  caldera
Combustion caldera
 
Programa y fomentacion sensibilizacion
Programa y fomentacion sensibilizacionPrograma y fomentacion sensibilizacion
Programa y fomentacion sensibilizacion
 
Van ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras joseVan ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras jose
 
Van ness problemas termo cap 1 orihuela contreras jose
Van ness problemas termo cap 1 orihuela contreras joseVan ness problemas termo cap 1 orihuela contreras jose
Van ness problemas termo cap 1 orihuela contreras jose
 
Interpretación topográfica y elementos básicos de foto interpretación
Interpretación topográfica y elementos básicos de foto interpretaciónInterpretación topográfica y elementos básicos de foto interpretación
Interpretación topográfica y elementos básicos de foto interpretación
 
Motor stirling de combustion externa
Motor stirling de combustion externaMotor stirling de combustion externa
Motor stirling de combustion externa
 
Van ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras joseVan ness capitulo 3 orihuela contreras jose
Van ness capitulo 3 orihuela contreras jose
 
Dibujo tecnico 2
Dibujo tecnico 2Dibujo tecnico 2
Dibujo tecnico 2
 
Motores de combustion interna de cuatro tiempos
Motores de combustion interna de cuatro tiemposMotores de combustion interna de cuatro tiempos
Motores de combustion interna de cuatro tiempos
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
RESUME Robert Hall
RESUME Robert HallRESUME Robert Hall
RESUME Robert Hall
 
Ensayo de la inteligencia artificial distribuida
Ensayo de la inteligencia artificial distribuidaEnsayo de la inteligencia artificial distribuida
Ensayo de la inteligencia artificial distribuida
 
Stories of an Oracle DBA
Stories of an Oracle DBAStories of an Oracle DBA
Stories of an Oracle DBA
 

Ähnlich wie A couple of things about PostgreSQL...

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2ice799
 
BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx
BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docxBNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx
BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docxhartrobert670
 
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesMax De Marzi
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetAppAshwin Pawar
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data storeJ On The Beach
 
Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.
Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.
Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.Peter Hurford
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
 
Caring for file formats
Caring for file formatsCaring for file formats
Caring for file formatsAnge Albertini
 
Funky file formats - 31c3
Funky file formats - 31c3Funky file formats - 31c3
Funky file formats - 31c3Ange Albertini
 
Introduction to Memoria
Introduction to MemoriaIntroduction to Memoria
Introduction to MemoriaVictor Smirnov
 
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)South Tyrol Free Software Conference
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiGluster.org
 
Drizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's ConferenceDrizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's ConferenceBrian Aker
 
DjangoCon Lightning Talk: Hello from Hubble
DjangoCon Lightning Talk: Hello from HubbleDjangoCon Lightning Talk: Hello from Hubble
DjangoCon Lightning Talk: Hello from HubbleAlex Viana
 

Ähnlich wie A couple of things about PostgreSQL... (20)

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx
BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docxBNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx
BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx
 
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - Notes
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
FAQ on Dedupe NetApp
FAQ on Dedupe NetAppFAQ on Dedupe NetApp
FAQ on Dedupe NetApp
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 
Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.
Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.
Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python.
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
Caring for file formats
Caring for file formatsCaring for file formats
Caring for file formats
 
Funky file formats - 31c3
Funky file formats - 31c3Funky file formats - 31c3
Funky file formats - 31c3
 
Introduction to Memoria
Introduction to MemoriaIntroduction to Memoria
Introduction to Memoria
 
Os Krug
Os KrugOs Krug
Os Krug
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
 
Data analysis with pandas
Data analysis with pandasData analysis with pandas
Data analysis with pandas
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
Drizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's ConferenceDrizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's Conference
 
Assignment 2 Theoretical
Assignment 2 TheoreticalAssignment 2 Theoretical
Assignment 2 Theoretical
 
DjangoCon Lightning Talk: Hello from Hubble
DjangoCon Lightning Talk: Hello from HubbleDjangoCon Lightning Talk: Hello from Hubble
DjangoCon Lightning Talk: Hello from Hubble
 

Kürzlich hochgeladen

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

A couple of things about PostgreSQL...

  • 1. A couple of things to know about PostgreSQL... (Before start coding) Federico Campoli 9 July 2013 Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 1 / 58
  • 2. Introduction What is blue, bigger in the inside and with time travel capabilities? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
  • 3. Introduction What is blue, bigger in the inside and with time travel capabilities? If your answer is the TARDIS, then, yes you’re close enough, but the correct answer is PostgreSQL. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
  • 4. Introduction What is blue, bigger in the inside and with time travel capabilities? If your answer is the TARDIS, then, yes you’re close enough, but the correct answer is PostgreSQL. and regarding the couple of things.... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
  • 5. Introduction What is blue, bigger in the inside and with time travel capabilities? If your answer is the TARDIS, then, yes you’re close enough, but the correct answer is PostgreSQL. and regarding the couple of things.... I lied. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 2 / 58
  • 6. Introduction PostgreSQL is a wild beast. We’ll talk about the common mistakes, the confusing jargon, the on line manual’s lost pages and the best practices to avoid headaches to your DBA (if you have one of course). Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 58
  • 7. Introduction PostgreSQL is a wild beast. We’ll talk about the common mistakes, the confusing jargon, the on line manual’s lost pages and the best practices to avoid headaches to your DBA (if you have one of course). The major version used in this talk is the 9.2. So ,let’s start with the TOC Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 3 / 58
  • 8. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 9. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keep things consistent. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 10. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keep things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage up to 1 GB and free of charge. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 11. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keep things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage up to 1 GB and free of charge. It’s bigger in the inside The database memory, how to stick an elephant in a smart car. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 12. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keep things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage up to 1 GB and free of charge. It’s bigger in the inside The database memory, how to stick an elephant in a smart car. The answer is 42 Explaining the unexplainable, the CBO and the execution plan. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 13. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keep things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage up to 1 GB and free of charge. It’s bigger in the inside The database memory, how to stick an elephant in a smart car. The answer is 42 Explaining the unexplainable, the CBO and the execution plan. Why do we fall? Crashing the most advanced open source database it’s easy... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 14. Table of contents A byte it’s a byte, it’s a byte it’s a byte The database physical storage. The magic of the MVCC How PostgreSQL keep things consistent. TOAST Please, and don’t forget the Marmite The power of the out of line storage up to 1 GB and free of charge. It’s bigger in the inside The database memory, how to stick an elephant in a smart car. The answer is 42 Explaining the unexplainable, the CBO and the execution plan. Why do we fall? Crashing the most advanced open source database it’s easy... And I thought my jokes were bad And then I’ll need a back door to escape.... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 4 / 58
  • 15. A byte it’s a byte, it’s a byte it’s a byte PostgreSQL stores the data in a dedicated directory identified by the environment variable $PGDATA on unix and %PGDATA% on windows. The location is initialized by the utility initdb and contains various subdirectories. Each folder have a specific function. Also known as the cluster Figure: PGDATA Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 5 / 58
  • 16. A byte it’s a byte, it’s a byte it’s a byte The global directory Contains the cluster’s shared objects like pg database,pg tablespace...... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
  • 17. A byte it’s a byte, it’s a byte it’s a byte The global directory Contains the cluster’s shared objects like pg database,pg tablespace...... and a small 8kb file, pg control, probably the most important file in the entire system. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
  • 18. A byte it’s a byte, it’s a byte it’s a byte The global directory Contains the cluster’s shared objects like pg database,pg tablespace...... and a small 8kb file, pg control, probably the most important file in the entire system. pg control tracks the database vital activities Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
  • 19. A byte it’s a byte, it’s a byte it’s a byte The global directory Contains the cluster’s shared objects like pg database,pg tablespace...... and a small 8kb file, pg control, probably the most important file in the entire system. pg control tracks the database vital activities with a corrupted or missing pg control the database cannot start Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 6 / 58
  • 20. A byte it’s a byte, it’s a byte it’s a byte The base directory The default location when a new database is created without the TABLESPACE clause. Contains numeric subfolders, one for each database. The number is the database object id and is stored in the pg database system table. Contains an optional folder pgsql tmp used for external sorts and temporary files. The location is mapped in the pg tablespace system table with the pg default name. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 7 / 58
  • 21. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains..... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 22. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 23. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... numeric files Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 24. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... numeric files Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 25. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... numeric files Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix The data files are organized in fixed size blocks, by default 8192 bytes. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 26. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... numeric files Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix The data files are organized in fixed size blocks, by default 8192 bytes. Any change to the block size require the build from source and a new initdb. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 27. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... numeric files Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix The data files are organized in fixed size blocks, by default 8192 bytes. Any change to the block size require the build from source and a new initdb. The data files are called nodes and mapped to the relations in the pg class system table... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 28. A byte it’s a byte, it’s a byte it’s a byte The base directory Each subdirectory contains.....just guess.... numeric files Each file can grow at max 1 Gb, then a new chunk is generated with a sequence number suffix The data files are organized in fixed size blocks, by default 8192 bytes. Any change to the block size require the build from source and a new initdb. The data files are called nodes and mapped to the relations in the pg class system table... And yes, we are dealing with an object relational database system. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 8 / 58
  • 29. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 30. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. A tablespace is a logical location for physical objects Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 31. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. A tablespace is a logical location for physical objects Useful to spread tables and indices on different physical devices Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 32. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. A tablespace is a logical location for physical objects Useful to spread tables and indices on different physical devices Combined with the logical volume management can improve dramatically the performance... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 33. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. A tablespace is a logical location for physical objects Useful to spread tables and indices on different physical devices Combined with the logical volume management can improve dramatically the performance... or drive the project to a complete failure Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 34. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. A tablespace is a logical location for physical objects Useful to spread tables and indices on different physical devices Combined with the logical volume management can improve dramatically the performance... or drive the project to a complete failure The objects tablespace location can be safely changed but this require an exclusive lock on the affected object Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 35. A byte it’s a byte, it’s a byte it’s a byte The pg tblspc directory Contains the symbolic links to the tablespaces. A tablespace is a logical location for physical objects Useful to spread tables and indices on different physical devices Combined with the logical volume management can improve dramatically the performance... or drive the project to a complete failure The objects tablespace location can be safely changed but this require an exclusive lock on the affected object the view pg tablespace maps the objects name and identifiers Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 9 / 58
  • 36. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 37. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 38. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 39. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 40. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 41. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 42. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Not used when the database is stopped cleanly Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 43. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Not used when the database is stopped cleanly Is absolutely critical when a crash or unclean shutdown happens Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 44. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Not used when the database is stopped cleanly Is absolutely critical when a crash or unclean shutdown happens A single block corruption results in a not recoverable instance Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 45. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Not used when the database is stopped cleanly Is absolutely critical when a crash or unclean shutdown happens A single block corruption results in a not recoverable instance The number of segments is automatically managed by the database Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 46. A byte it’s a byte, it’s a byte it’s a byte The pg xlog directory WARNING INCOMING AIRSTRIKE Also known as the write ahead log directory, WAL Is the most important and critical directory in the cluster Contains 16 Mb segments used by the database to save the block changes Each segment contains the blocks changed in the volatile memory Not used when the database is stopped cleanly Is absolutely critical when a crash or unclean shutdown happens A single block corruption results in a not recoverable instance The number of segments is automatically managed by the database Putting the location on a dedicated and high reliable device is vital Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 10 / 58
  • 47. A byte it’s a byte, it’s a byte it’s a byte Pages Voyage to the centre of datafile Each block is structured almost in the same way, for tables and indices. Figure: Page schema Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 11 / 58
  • 48. A byte it’s a byte, it’s a byte it’s a byte Pages Each page starts with a 24 bytes header followed by an optional bitmap to track nulls. After the header’s end resides, in the upper section the tuple pointers, usually 4 bytes each. The physical tuples are stored in the page’s end. Figure: Page headerFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 12 / 58
  • 49. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
  • 50. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
  • 51. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page pd tli is the page’s timeline id Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
  • 52. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page pd tli is the page’s timeline id yes, PostgreSQL have timelines... Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
  • 53. A byte it’s a byte, it’s a byte it’s a byte Page header The page header contains a couple of interesting things... pd lsn is the most recent sequence number on the WAL for the page pd tli is the page’s timeline id yes, PostgreSQL have timelines... when a point in time recovery is performed a new timeline is created to avoid conflicts and paradoxes Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 13 / 58
  • 54. A byte it’s a byte, it’s a byte it’s a byte Page header People assume that transactions in PostgreSQL are a strict progression of xid, but actually from a non-linear, non-subjective viewpoint and thanks to the timelines, it’s more like a big ball of wibbly wobbly... timey wimey... stuff Figure: Would you like a jelly baby? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 14 / 58
  • 55. A byte it’s a byte, it’s a byte it’s a byte The tuples Now finally we can look to the physical tuples and discover another 27 bytes header. The numbers are the bytes used by the single values. Each tuple, even a simple boolean value, have a 27 bytes overhead. The user data data can be the actual data stream or the pointer to the out of line data stream. Figure: Tuple structure Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 15 / 58
  • 56. The magic of the MVCC PostgreSQL consistency Statements in PostgreSQL happens through transactions. By default when a single statement is successfully completed the database commits automatically the transaction. It’s possible to wrap multiple statements in a single transaction using the keywords [BEGIN;]....... [COMMIT; ROLLBACK] The minimal possible level the transaction isolation is READ COMMITTED. Only the committed changes becomes visible to other sessions. Any error or rollback statement during the transaction will cancel the entire operation leaving the data in a consistent state at any time during the database activity. PostgreSQL supports the savepoints to partially rollback a long transaction. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 16 / 58
  • 57. The magic of the MVCC How PostgreSQL keep things consistent To keep everything consistent PostgreSQL uses the Multi Version Concurrency Control also know as MVCC. The base logic seems simple. A 4 byte unsigned integer called xid is incremented by 1 and assigned to the current transaction. Every committed xid lesser than the current xid is in the past and then visible to the current session. Every xid greater than the current xid is in the future and then invisible to the current session. The commit status is managed in the $PGDATA using the directory pg clog where small 8k files tracks the transaction statuses. The the xid match is performed on the tuple’s header seen before. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 17 / 58
  • 58. The magic of the MVCC t xmin contains the xid generated at tuple insert t xmax contains the xid generated at tuple delete t cid contains the internal command id to track the sequence inside the same transaction Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
  • 59. The magic of the MVCC t xmin contains the xid generated at tuple insert t xmax contains the xid generated at tuple delete t cid contains the internal command id to track the sequence inside the same transaction there’s something missing, isn’t it? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
  • 60. The magic of the MVCC t xmin contains the xid generated at tuple insert t xmax contains the xid generated at tuple delete t cid contains the internal command id to track the sequence inside the same transaction there’s something missing, isn’t it? Where is the field to store the UPDATE xid? Figure: Tuple structure Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 18 / 58
  • 61. The magic of the MVCC Well, PostgreSQL actually NEVER performs an update. When an UPDATE statement is issued the updated rows are inserted with t xmin set to the current XID value. The old rows versions are marked as dead writing the t xmax field with the current transaction id. The database manages the tuple’s visibility using this simple routine Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 19 / 58
  • 62. The magic of the MVCC Source code comment in src/backend/utils/time/tqual.c: /* * * The satisfaction of "now" requires the following: * * ((Xmin == my-transaction && inserted by the current transaction * Cmin < my-command && before this command, and * (Xmax is null || the row has not been deleted, or * (Xmax == my-transaction && it was deleted by the current transaction * Cmax >= my-command))) but not before this command, * || or * (Xmin is committed && the row was inserted by a committed transaction, and * (Xmax is null || the row has not been deleted, or * (Xmax == my-transaction && the row is being deleted by this transaction * Cmax >= my-command) || but it’s not deleted "yet", or * (Xmax != my-transaction && the row was deleted by another transaction * Xmax is not committed)))) that has not been committed * */ Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 20 / 58
  • 63. The magic of the MVCC Source code comment in src/backend/utils/time/tqual.c: * HeapTupleSatisfiesNow * True iff heap tuple is valid "now". * * Here, we consider the effects of: * all committed transactions (as of the current instant) * previous commands of this transaction * * Note we do _not_ include changes made by the current command. This * solves the "Halloween problem" wherein an UPDATE might try to re-update * its own output tuples, http://en.wikipedia.org/wiki/Halloween_Problem. * * Note: * Assumes heap tuple is valid. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 21 / 58
  • 64. The magic of the MVCC The dead tuples are not immediately reclaimed and add overhead to any IO operation as the block is accessed entirely to determine which is visible or not. To free the space the VACUUM command should be used. The command is absolutely safe. It’s designed to have the minimal impact on the database normal activity. VACUUM scans the relation and the indices for dead tuples no longer visible to open transactions. Is absolutely vital to run vacuum on each cluster’s database at least every 2 billions transactions. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 22 / 58
  • 65. The magic of the MVCC XID is a 4 byte unsigned integer. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
  • 66. The magic of the MVCC XID is a 4 byte unsigned integer. Every 4 billions transactions the value wraps Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
  • 67. The magic of the MVCC XID is a 4 byte unsigned integer. Every 4 billions transactions the value wraps PostgreSQL uses the modulo − 231 comparison method Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
  • 68. The magic of the MVCC XID is a 4 byte unsigned integer. Every 4 billions transactions the value wraps PostgreSQL uses the modulo − 231 comparison method For each value 2 billions XID are in the future and 2 billions in the past Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
  • 69. The magic of the MVCC XID is a 4 byte unsigned integer. Every 4 billions transactions the value wraps PostgreSQL uses the modulo − 231 comparison method For each value 2 billions XID are in the future and 2 billions in the past When a xid’s age becomes too close to 2 billions VACUUM freezes the value to an hardcoded xid in the past by definition Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 23 / 58
  • 70. The magic of the MVCC If for any reason an xid reaches 10 millions transactions from the wraparound failure the database starts emitting scary messages WARNING: database "mydb" must be vacuumed within 177009986 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
  • 71. The magic of the MVCC If for any reason an xid reaches 10 millions transactions from the wraparound failure the database starts emitting scary messages WARNING: database "mydb" must be vacuumed within 177009986 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". If an xid’s age reaches 1 million transactions from the wraparound failure the database simply shut down and can be started only in single user mode to perform the VACUUM. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
  • 72. The magic of the MVCC If for any reason an xid reaches 10 millions transactions from the wraparound failure the database starts emitting scary messages WARNING: database "mydb" must be vacuumed within 177009986 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". If an xid’s age reaches 1 million transactions from the wraparound failure the database simply shut down and can be started only in single user mode to perform the VACUUM. Anyway, the autovacuum deamon, even if turned off starts the required VACUUM long before this catastrophic scenario happens. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 24 / 58
  • 73. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
  • 74. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Funny people indeed Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
  • 75. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Funny people indeed TOAST is the acronym for The Overside Attribute Storage Technique Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
  • 76. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Funny people indeed TOAST is the acronym for The Overside Attribute Storage Technique The attribute is also known as field Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
  • 77. TOAST Please, and don’t forget the Marmite TOAST, the best thing since sliced bread Funny people indeed TOAST is the acronym for The Overside Attribute Storage Technique The attribute is also known as field The TOAST can store up to 1 GB in the out of line storage (free of charge) Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 25 / 58
  • 78. TOAST Please, and don’t forget the Marmite Fixed length data types like integer, date, timestamp do not are not TOASTable. The data is stored after the tuple header. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
  • 79. TOAST Please, and don’t forget the Marmite Fixed length data types like integer, date, timestamp do not are not TOASTable. The data is stored after the tuple header. Varlena data types as character varying without the upper bound, text or bytea are stored in line or out of line. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
  • 80. TOAST Please, and don’t forget the Marmite Fixed length data types like integer, date, timestamp do not are not TOASTable. The data is stored after the tuple header. Varlena data types as character varying without the upper bound, text or bytea are stored in line or out of line. The storage technique used depends from the data stream size, and the storage method assigned to the attribute. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
  • 81. TOAST Please, and don’t forget the Marmite Fixed length data types like integer, date, timestamp do not are not TOASTable. The data is stored after the tuple header. Varlena data types as character varying without the upper bound, text or bytea are stored in line or out of line. The storage technique used depends from the data stream size, and the storage method assigned to the attribute. Depending from the chosen strategy is possible to store the data in external relations or compressed using the fast zlib algorithm. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 26 / 58
  • 82. TOAST Please, and don’t forget the Marmite TOAST permits four storage strategies (shamelessy copied from the on line manual). PLAIN prevents either compression or out-of-line storage; This is the only possible strategy for columns of non-TOAST-able data types. EXTENDED allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big. EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster at the penalty of increased storage space. MAIN allows compression but not out-of-line storage. Actually, out-of-line storage will still be performed for such columns, but only as a last resort. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 27 / 58
  • 83. TOAST Please, and don’t forget the Marmite When the out of line storage is used the data is encoded in bytea and eventually split in multiple chunks. An unique index over the chunk id and chunk seq avoid either duplicate data and speed up the look ups Figure: Toast table Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 28 / 58
  • 84. It’s bigger in the inside A PostgreSQL instance is a memory segment shared between multiple processes accessing the data directory. When a new connection happens a new postgres is forked and attached to the shared memory, also known as shared buffer. PostgreSQL is a multiprocess database system but not multi threaded. Each process can use only one processor or core. To keep things consistent every single block, for read or for write purposes must pass through the shared buffer. As the shared buffer is smaller than the database size, and often smaller than a single table size, the blocks in memory shall be managed and the space allocation must adapt the required usage. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 29 / 58
  • 85. It’s bigger in the inside Jargon backend process: a postgres process attached to the shared buffer heap page: a table’s data page index page: an index data page buffer: a page, index or heap loaded in the shared buffer dirty buffer: a buffer wal logged but not yet written on disk clean buffer: a buffer written consolidated on disk pinned buffer: buffer held by a backend process unpinned buffer: buffer released and available to be pinned again Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 30 / 58
  • 86. It’s bigger in the inside In the earlies days of PostgreSQL 7.x a simple most recently used buffer was used. The simple algorithm, after the unpin moved the buffers on the top of a simpe FIFO list. During the revolutionary 8.0 development, a new powerful algorithm was introduced. The Adaptive Replacement Cache capable to self adapt the size of two pools dedicated to the most recently used and most recently used buffers. This algorithm was removed few weeks before the stable release because a software patent. An emergency two queue algorithm was adopted making the memory management not brilliant as expected. The next year, the release 8.1 adopted the clock sweep memory manager. The algorithm is still in use with few improvements, simple, flexible and free. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 31 / 58
  • 87. It’s bigger in the inside The buffer manager’s main goal is to keep cached in memory the most recently used blocks and adapt dynamically for the most frequently used blocks. To do this a small memory portion is used as free list for the buffers available for memory eviction. Figure: Free list Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 32 / 58
  • 88. It’s bigger in the inside The buffers have an reference counter (pin counter). Every time a buffer is pinned the counter is incremented by one up to a small value. Figure: Block usage counter Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 33 / 58
  • 89. It’s bigger in the inside Shamelessy copied from the file src/backend/storage/buffer/README There is a ”free list” of buffers that are prime candidates for replacement. In particular, buffers that are completely free (contain no valid page) are always in this list. To choose a victim buffer to recycle when there are no free buffers available, we use a simple clock-sweep algorithm, which avoids the need to take system-wide locks during common operations. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 34 / 58
  • 90. It’s bigger in the inside It works like this: Each buffer header contains a usage counter, which is incremented (up to a small limit value) whenever the buffer is pinned. (This requires only the buffer header spinlock, which would have to be taken anyway to increment the buffer reference count, so it’s nearly free.) The ”clock hand” is a buffer index, NextVictimBuffer, that moves circularly through all the available buffers. NextVictimBuffer is protected by the BufFreelistLock. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 35 / 58
  • 91. It’s bigger in the inside The algorithm for a process that needs to obtain a victim buffer is: 1 Obtain BufFreelistLock. 2 If buffer free list is nonempty, remove its head buffer. If the buffer is pinned or has a nonzero usage count, it cannot be used; ignore it and return to the start of step 2. Otherwise, pin the buffer, release BufFreelistLock, and return the buffer. 3 Otherwise, select the buffer pointed to by NextVictimBuffer, and circularly advance NextVictimBuffer for next time. 4 If the selected buffer is pinned or has a nonzero usage count, it cannot be used. Decrement its usage count (if nonzero) and return to step 3 to examine the next buffer. 5 Pin the selected buffer, release BufFreelistLock, and return the buffer. (Note that if the selected buffer is dirty, we will have to write it out before we can recycle it; if someone else pins the buffer meanwhile we will have to give up and try another buffer. This however is not a concern of the basic select-a-victim-buffer algorithm.) Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 36 / 58
  • 92. It’s bigger in the inside Figure: The NextVictimBufferFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 37 / 58
  • 93. It’s bigger in the inside Since the version 8.3 the buffer manager have the ring buffer strategy. Operations which require a large amount of buffers in memory, like VACUUM or large tables sequential scans, have a dedicated 256kb ring buffer, small enough to fit in the processor’s L2. The strategy improves buffer’s load and eviction and protects the remaining shared buffer. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 38 / 58
  • 94. The answer is 42 How PostgreSQL executes a query After the physical storage and the memory let’s take a look how the database interacts with the backends from the logical point of view. Jargon OID: Object ID, 4 byte unsigned used to map any system object to a unique value class: any relational object, table, index, view, sequence... attribute: basically table fields Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 39 / 58
  • 95. The answer is 42 The parser stage When a query is sent for processing PostgreSQL executes at first a syntax analysis using the query parser. Any error in this phase will stop the execution throwing a syntax error. As this stage doesn’t require access to the system catalogue there’s no wasted xid. If the syntax is correct the parser will return a parse tree ready for the next step. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 40 / 58
  • 96. The answer is 42 The query tree The second stage is still managed by the parser which access the system catalogue and from the parse tree generates a query tree. This is a logical representation of the language where any object and attribute is unique. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 41 / 58
  • 97. The answer is 42 The query tree Figure: A simple query tree Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 42 / 58
  • 98. The answer is 42 The query tree To generate the query tree the parser access the system catalogue and retrieve the corresponding OID for each class and attribute in the query. Ambiguous names will generate an error. In the query tree the optional filtering elements are translated as well. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 43 / 58
  • 99. The answer is 42 The planner stage The query tree is then sent to the query planner which transverse the tree and generates all the possible execution plans with the arbitrary cost estimated from the database collected statistics. The estimated plan with minimum cost is chosen for the processing and sent to the executor. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 58
  • 100. The answer is 42 The planner stage The query tree is then sent to the query planner which transverse the tree and generates all the possible execution plans with the arbitrary cost estimated from the database collected statistics. The estimated plan with minimum cost is chosen for the processing and sent to the executor. Let me stress again the word estimate. A database with old or missing statistics will generate not efficient plans resulting in slow queries. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 44 / 58
  • 101. The answer is 42 The executor The planner returns then the execution plan, a sequence of steps to retrieve the requested data, to manipulate the data or change the database structure. The last stage is the executor. The execution plan steps are executed, then the eventual output is returned to the backend. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 45 / 58
  • 102. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
  • 103. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. The optional clause ANALYZE actually executes the query, discard the results and return the real execution plan. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
  • 104. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. The optional clause ANALYZE actually executes the query, discard the results and return the real execution plan. DML queries with EXPLAIN ANALYZE will change the data. Should be wrapped between BEGIN; ROLLBACK; to avoid unwanted results. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
  • 105. The answer is 42 EXPLAIN (or EXTERMINATE) The EXPLAIN statement returns the estimated execution plan for the subsequent query. The optional clause ANALYZE actually executes the query, discard the results and return the real execution plan. DML queries with EXPLAIN ANALYZE will change the data. Should be wrapped between BEGIN; ROLLBACK; to avoid unwanted results. Let’s see EXPLAIN in action. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 46 / 58
  • 106. The answer is 42 For our purpose we’ll create a test table with two fields. An identifier, integer 4 bytes, with an auto incremental value and a character varying where to store md5 values. The serial pseudo type is short for CREATE SEQUENCE t test i id seq (self generated name); then integer NOT NULL DEFAULT default nextval(’t test i id seq’::regclass) Listing 1: Create table test =# CREATE TABLE t_test ( i_id serial , v_value character varying (50) ) ; NOTICE: CREATE TABLE will create implicit sequence " t_test_i_id_seq " for serial column "t_test.i_id" CREATE TABLE Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 47 / 58
  • 107. The answer is 42 Now let’s add some rows to our table Listing 2: Insert in table test =# INSERT INTO t_test (v_value) SELECT v_value FROM ( SELECT generate_series (1 ,1000) as i_cnt , md5(random ():: text) as v_value ) t_gen ; INSERT 0 1000 Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 48 / 58
  • 108. The answer is 42 Let’s generate the estimated plan for one row result Listing 3: EXPLAIN test =# EXPLAIN SELECT * FROM t_test WHERE i_id =20; QUERY PLAN -------------------------------------------------------- Seq Scan on t_test (cost =0.00..21.50 rows =1 width =37) Filter: (i_id = 20) (2 rows) As the table have no indices the only action possible is the table’s sequential scan. The cost is an arbitrary value. The first number is the startup cost to delivery the first row to the next operator or the backend. The second number is the total cost needed to complete the operation. The rows says how many rows the database is expecting to get from the operation and the width is the estimated average row width in bytes. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 49 / 58
  • 109. The answer is 42 Let’s generate the real execution plan for one row result Listing 4: EXPLAIN ANALYZE test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id =20; QUERY PLAN -------------------------------------------------------------------------------------------------- Seq Scan on t_test (cost =0.00..21.50 rows =1 width =37) (actual time =0.021..0.198 rows =1 loops =1) Filter: (i_id = 20) Rows Removed by Filter: 999 Total runtime: 0.235 ms (4 rows) The second group of values give the real time, in milliseconds, for the startup and total cost. The loops value shows how many times the operator is executed. In the bottom the total runtime tell the real execution time for the query. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 50 / 58
  • 110. The answer is 42 Let’s add an index on the id field... Listing 5: CREATE INDEX test =# CREATE INDEX idx_i_id ON t_test (i_id); CREATE INDEX Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 51 / 58
  • 111. The answer is 42 and generate a new execution plan Listing 6: EXPLAIN ANALYZE WITH INDEX test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id =20; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Index Scan using idx_i_id on t_test (cost =0.00..8.27 rows =1 width =37) (actual time =0.019..0.020 rows =1 loops =1) Index Cond: (i_id = 20) Total runtime: 0.055 ms (3 rows) The runtime is ten times faster. Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 52 / 58
  • 112. The answer is 42 The cost based optimizer becomes constantly clever. For example, if we ask for more than half estimated table the database will chose the cheaper execution plan. Listing 7: EXPLAIN ANALYZE WITH INDEX test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id >20; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan on t_test (cost =0.00..21.50 rows =980 width =37) (actual time =0.013..0.148 rows =980 loops =1) Filter: (i_id > 20) Rows Removed by Filter: 20 Total runtime: 0.209 ms (4 rows) test =# SET enable_seqscan =’off ’; SET test =# EXPLAIN ANALYZE SELECT * FROM t_test WHERE i_id >20; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Index Scan using idx_i_id on t_test (cost =0.00..49.40 rows =980 width =37) (actual time =0.042..0.390 rows =980 loops =1) Index Cond: (i_id > 20) Total runtime: 0.507 ms (3 rows) Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 53 / 58
  • 113. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the not matching rows Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 58
  • 114. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the not matching rows index scan: read the index tree with random disk read. it does returns ordered data Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 58
  • 115. The answer is 42 Scan nodes seq scan: scan sequentially all the blocks in the table and discard the not matching rows index scan: read the index tree with random disk read. it does returns ordered data bitmap index/heap scan: read the index sequentially generating a bitmap to recheck on the table. it doesn’t return ordered data. it’s a good compromise between seq scan and a full index scan Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 54 / 58
  • 116. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 58
  • 117. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right hash join: Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 58
  • 118. The answer is 42 Join nodes nested loop: for each row on the relation on the left apply the filter to the relation on the right hash join: merge join: Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 55 / 58
  • 119. Why do we fall? Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 56 / 58
  • 120. And I thought my jokes were bad Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 57 / 58
  • 121. A couple of things to know about PostgreSQL... (Before start coding) Federico Campoli 9 July 2013 Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 58 / 58