A couple of things about PostgreSQL...

A couple of things to know about PostgreSQL...
(Before start coding)
Federico Campoli
9 July 2013
Federico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 1 / 58

Introduction
What is blue, bigger in the inside and with time travel capabilities?

Introduction
If your answer is the TARDIS, then, yes you’re close enough, but the correct
answer is PostgreSQL.

Introduction
and regarding the couple of things....

Introduction
and regarding the couple of things....
I lied.

Introduction
PostgreSQL is a wild beast.
We’ll talk about the common mistakes, the confusing jargon, the on line manual’s
lost pages and the best practices to avoid headaches to your DBA (if you have
one of course).

Introduction
PostgreSQL is a wild beast.
We’ll talk about the common mistakes, the confusing jargon, the on line manual’s
lost pages and the best practices to avoid headaches to your DBA (if you have
one of course).
The major version used in this talk is the 9.2.
So ,let’s start with the TOC

Table of contents
A byte it’s a byte, it’s a byte it’s a byte
The database physical storage.

Table of contents
The magic of the MVCC
How PostgreSQL keep things consistent.

Table of contents
TOAST Please, and don’t forget the Marmite
The power of the out of line storage up to 1 GB and free of charge.

Table of contents
It’s bigger in the inside
The database memory, how to stick an elephant in a smart car.

Table of contents
The answer is 42
Explaining the unexplainable, the CBO and the execution plan.

Table of contents
The answer is 42
Why do we fall?
Crashing the most advanced open source database it’s easy...

Table of contents
The answer is 42
Why do we fall?
Crashing the most advanced open source database it’s easy...
And I thought my jokes were bad
And then I’ll need a back door to escape....

PostgreSQL stores the data in a dedicated directory identiﬁed by the environment
variable $PGDATA on unix and %PGDATA% on windows.
The location is initialized by the utility initdb and contains various subdirectories.
Each folder have a speciﬁc function. Also known as the cluster
Figure: PGDATA

The global directory
Contains the cluster’s shared objects like pg database,pg tablespace......

and a small 8kb ﬁle, pg control, probably the most important ﬁle in the entire
system.

system.
pg control tracks the database vital activities

system.
pg control tracks the database vital activities
with a corrupted or missing pg control the database cannot start

The base directory
The default location when a new database is created without the TABLESPACE
clause.
Contains numeric subfolders, one for each database.
The number is the database object id and is stored in the pg database system
table.
Contains an optional folder pgsql tmp used for external sorts and temporary ﬁles.
The location is mapped in the pg tablespace system table with the pg default
name.

The base directory
Each subdirectory contains.....

The base directory
Each subdirectory contains.....just guess....

The base directory
Each subdirectory contains.....just guess.... numeric ﬁles

The base directory
Each ﬁle can grow at max 1 Gb, then a new chunk is generated with a sequence
number suﬃx

The base directory
number suffix
The data files are organized in fixed size blocks, by default 8192 bytes.

The base directory
number suﬃx
Any change to the block size require the build from source and a new initdb.

The base directory
number suﬃx
The data ﬁles are called nodes and mapped to the relations in the pg class system
table...

The base directory
number suﬃx
The data ﬁles are called nodes and mapped to the relations in the pg class system
table...
And yes, we are dealing with an object relational database system.

The pg tblspc directory
Contains the symbolic links to the tablespaces.

A tablespace is a logical location for physical objects

Useful to spread tables and indices on diﬀerent physical devices

Combined with the logical volume management can improve dramatically the
performance...

performance...
or drive the project to a complete failure

performance...
The objects tablespace location can be safely changed but this require an
exclusive lock on the aﬀected object

performance...
The objects tablespace location can be safely changed but this require an
exclusive lock on the aﬀected object
the view pg tablespace maps the objects name and identiﬁers

The pg xlog directory

WARNING INCOMING AIRSTRIKE

Also known as the write ahead log directory, WAL

Is the most important and critical directory in the cluster

Contains 16 Mb segments used by the database to save the block changes

Each segment contains the blocks changed in the volatile memory

Not used when the database is stopped cleanly

Is absolutely critical when a crash or unclean shutdown happens

A single block corruption results in a not recoverable instance

The number of segments is automatically managed by the database

The number of segments is automatically managed by the database
Putting the location on a dedicated and high reliable device is vital

Pages
Voyage to the centre of dataﬁle
Each block is structured almost in the same way, for tables and indices.
Figure: Page schema

Pages
Each page starts with a 24 bytes header followed by an optional bitmap to track
nulls.
After the header’s end resides, in the upper section the tuple pointers, usually 4
bytes each.
The physical tuples are stored in the page’s end.
Figure: Page headerFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 12 / 58

Page header
The page header contains a couple of interesting things...

Page header
pd lsn is the most recent sequence number on the WAL for the page

Page header
pd tli is the page’s timeline id

Page header
yes, PostgreSQL have timelines...

Page header
yes, PostgreSQL have timelines...
when a point in time recovery is performed a new timeline is created to avoid
conﬂicts and paradoxes

Page header
People assume that transactions in PostgreSQL are a strict progression of xid, but
actually from a non-linear, non-subjective viewpoint and thanks to the timelines,
it’s more like a big ball of wibbly wobbly... timey wimey... stuﬀ
Figure: Would you like a jelly baby?

The tuples
Now ﬁnally we can look to the physical tuples and discover another 27 bytes
header. The numbers are the bytes used by the single values.
Each tuple, even a simple boolean value, have a 27 bytes overhead.
The user data data can be the actual data stream or the pointer to the out of line
data stream.
Figure: Tuple structure

PostgreSQL consistency
Statements in PostgreSQL happens through transactions.
By default when a single statement is successfully completed the database
commits automatically the transaction.
It’s possible to wrap multiple statements in a single transaction using the
keywords [BEGIN;]....... [COMMIT; ROLLBACK]
The minimal possible level the transaction isolation is READ COMMITTED.
Only the committed changes becomes visible to other sessions.
Any error or rollback statement during the transaction will cancel the entire
operation leaving the data in a consistent state at any time during the database
activity.
PostgreSQL supports the savepoints to partially rollback a long transaction.

How PostgreSQL keep things consistent
To keep everything consistent PostgreSQL uses the Multi Version Concurrency
Control also know as MVCC.
The base logic seems simple.
A 4 byte unsigned integer called xid is incremented by 1 and assigned to the
current transaction.
Every committed xid lesser than the current xid is in the past and then visible to
the current session.
Every xid greater than the current xid is in the future and then invisible to the
current session.
The commit status is managed in the $PGDATA using the directory pg clog where
small 8k ﬁles tracks the transaction statuses.
The the xid match is performed on the tuple’s header seen before.

t xmin contains the xid generated at tuple insert
t xmax contains the xid generated at tuple delete
t cid contains the internal command id to track the sequence inside the same
transaction

transaction
there’s something missing, isn’t it?

transaction
there’s something missing, isn’t it? Where is the ﬁeld to store the UPDATE xid?
Figure: Tuple structure

Well, PostgreSQL actually NEVER performs an update.
When an UPDATE statement is issued the updated rows are inserted with t xmin
set to the current XID value.
The old rows versions are marked as dead writing the t xmax ﬁeld with the
current transaction id.
The database manages the tuple’s visibility using this simple routine

Source code comment in src/backend/utils/time/tqual.c:
/*
*
* The satisfaction of "now" requires the following:
*
* ((Xmin == my-transaction && inserted by the current transaction
* Cmin < my-command && before this command, and
* (Xmax is null || the row has not been deleted, or
* (Xmax == my-transaction && it was deleted by the current transaction
* Cmax >= my-command))) but not before this command,
* || or
* (Xmin is committed && the row was inserted by a committed transaction, and
* (Xmax is null || the row has not been deleted, or
* (Xmax == my-transaction && the row is being deleted by this transaction
* Cmax >= my-command) || but it’s not deleted "yet", or
* (Xmax != my-transaction && the row was deleted by another transaction
* Xmax is not committed)))) that has not been committed
*
*/

Source code comment in src/backend/utils/time/tqual.c:
* HeapTupleSatisfiesNow
* True iff heap tuple is valid "now".
*
* Here, we consider the effects of:
* all committed transactions (as of the current instant)
* previous commands of this transaction
*
* Note we do _not_ include changes made by the current command. This
* solves the "Halloween problem" wherein an UPDATE might try to re-update
* its own output tuples, http://en.wikipedia.org/wiki/Halloween_Problem.
*
* Note:
* Assumes heap tuple is valid.

The dead tuples are not immediately reclaimed and add overhead to any IO
operation as the block is accessed entirely to determine which is visible or not.
To free the space the VACUUM command should be used.
The command is absolutely safe.
It’s designed to have the minimal impact on the database normal activity.
VACUUM scans the relation and the indices for dead tuples no longer visible to
open transactions.
Is absolutely vital to run vacuum on each cluster’s database at least every 2
billions transactions.

XID is a 4 byte unsigned integer.

Every 4 billions transactions the value wraps

PostgreSQL uses the modulo − 231
comparison method

comparison method
For each value 2 billions XID are in the future and 2 billions in the past

comparison method
For each value 2 billions XID are in the future and 2 billions in the past
When a xid’s age becomes too close to 2 billions VACUUM freezes the value
to an hardcoded xid in the past by deﬁnition

If for any reason an xid reaches 10 millions transactions from the wraparound
failure the database starts emitting scary messages
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".

If an xid’s age reaches 1 million transactions from the wraparound failure the
database simply shut down and can be started only in single user mode to perform
the VACUUM.

If an xid’s age reaches 1 million transactions from the wraparound failure the
database simply shut down and can be started only in single user mode to perform
the VACUUM.
Anyway, the autovacuum deamon, even if turned oﬀ starts the required VACUUM
long before this catastrophic scenario happens.

TOAST, the best thing since sliced bread

Funny people indeed

Funny people indeed
TOAST is the acronym for The Overside Attribute Storage Technique

Funny people indeed
The attribute is also known as ﬁeld

Funny people indeed
The attribute is also known as ﬁeld
The TOAST can store up to 1 GB in the out of line storage (free of charge)

Fixed length data types like integer, date, timestamp do not are not TOASTable.
The data is stored after the tuple header.

Varlena data types as character varying without the upper bound, text or bytea
are stored in line or out of line.

The storage technique used depends from the data stream size, and the storage
method assigned to the attribute.

The storage technique used depends from the data stream size, and the storage
method assigned to the attribute.
Depending from the chosen strategy is possible to store the data in external
relations or compressed using the fast zlib algorithm.

TOAST permits four storage strategies (shamelessy copied from the on line
manual).
PLAIN prevents either compression or out-of-line storage; This is the only
possible strategy for columns of non-TOAST-able data types.
EXTENDED allows both compression and out-of-line storage. This is the
default for most TOAST-able data types. Compression will be attempted
ﬁrst, then out-of-line storage if the row is still too big.
EXTERNAL allows out-of-line storage but not compression. Use of
EXTERNAL will make substring operations on wide text and bytea columns
faster at the penalty of increased storage space.
MAIN allows compression but not out-of-line storage. Actually, out-of-line
storage will still be performed for such columns, but only as a last resort.

When the out of line storage is used the data is encoded in bytea and eventually
split in multiple chunks.
An unique index over the chunk id and chunk seq avoid either duplicate data and
speed up the look ups
Figure: Toast table

A PostgreSQL instance is a memory segment shared between multiple processes
accessing the data directory.
When a new connection happens a new postgres is forked and attached to the
shared memory, also known as shared buffer.
PostgreSQL is a multiprocess database system but not multi threaded.
Each process can use only one processor or core.
To keep things consistent every single block, for read or for write purposes must
pass through the shared buffer.
As the shared buffer is smaller than the database size, and often smaller than a
single table size, the blocks in memory shall be managed and the space allocation
must adapt the required usage.

Jargon
backend process: a postgres process attached to the shared buffer
heap page: a table’s data page
index page: an index data page
buffer: a page, index or heap loaded in the shared buffer
dirty buffer: a buffer wal logged but not yet written on disk
clean buffer: a buffer written consolidated on disk
pinned buffer: buffer held by a backend process
unpinned buffer: buffer released and available to be pinned again

In the earlies days of PostgreSQL 7.x a simple most recently used buffer was used.
The simple algorithm, after the unpin moved the buffers on the top of a simpe
FIFO list.
During the revolutionary 8.0 development, a new powerful algorithm was
introduced.
The Adaptive Replacement Cache capable to self adapt the size of two pools
dedicated to the most recently used and most recently used buffers.
This algorithm was removed few weeks before the stable release because a
software patent.
An emergency two queue algorithm was adopted making the memory
management not brilliant as expected.
The next year, the release 8.1 adopted the clock sweep memory manager.
The algorithm is still in use with few improvements, simple, flexible and free.

The buﬀer manager’s main goal is to keep cached in memory the most recently
used blocks and adapt dynamically for the most frequently used blocks.
To do this a small memory portion is used as free list for the buﬀers available for
memory eviction.
Figure: Free list

The buﬀers have an reference counter (pin counter). Every time a buﬀer is pinned
the counter is incremented by one up to a small value.
Figure: Block usage counter

Shamelessy copied from the file src/backend/storage/buffer/README
There is a ”free list” of buffers that are prime candidates for replacement. In
particular, buffers that are completely free (contain no valid page) are always in
this list.
To choose a victim buffer to recycle when there are no free buffers available, we
use a simple clock-sweep algorithm, which avoids the need to take system-wide
locks during common operations.

It works like this:
Each buffer header contains a usage counter, which is incremented (up to a small
limit value) whenever the buffer is pinned. (This requires only the buffer header
spinlock, which would have to be taken anyway to increment the buffer reference
count, so it’s nearly free.)
The ”clock hand” is a buffer index, NextVictimBuffer, that moves circularly
through all the available buffers. NextVictimBuffer is protected by the
BufFreelistLock.

The algorithm for a process that needs to obtain a victim buffer is:
1 Obtain BufFreelistLock.
2 If buffer free list is nonempty, remove its head buffer. If the buffer is pinned
or has a nonzero usage count, it cannot be used; ignore it and return to the
start of step 2. Otherwise, pin the buffer, release BufFreelistLock, and return
the buffer.
3 Otherwise, select the buffer pointed to by NextVictimBuffer, and circularly
advance NextVictimBuffer for next time.
4 If the selected buffer is pinned or has a nonzero usage count, it cannot be
used. Decrement its usage count (if nonzero) and return to step 3 to
examine the next buffer.
5 Pin the selected buffer, release BufFreelistLock, and return the buffer.
(Note that if the selected buffer is dirty, we will have to write it out before we can
recycle it; if someone else pins the buffer meanwhile we will have to give up and
try another buffer. This however is not a concern of the basic
select-a-victim-buffer algorithm.)

Figure: The NextVictimBuﬀerFederico Campoli () A couple of things to know about PostgreSQL... 9 July 2013 37 / 58

Since the version 8.3 the buffer manager have the ring buffer strategy.
Operations which require a large amount of buffers in memory, like VACUUM or
large tables sequential scans, have a dedicated 256kb ring buffer, small enough to
fit in the processor’s L2.
The strategy improves buffer’s load and eviction and protects the remaining
shared buffer.

The answer is 42
How PostgreSQL executes a query
After the physical storage and the memory let’s take a look how the database
interacts with the backends from the logical point of view.
Jargon
OID: Object ID, 4 byte unsigned used to map any system object to a unique
value
class: any relational object, table, index, view, sequence...
attribute: basically table ﬁelds

The answer is 42
The parser stage
When a query is sent for processing PostgreSQL executes at ﬁrst a syntax analysis
using the query parser.
Any error in this phase will stop the execution throwing a syntax error.
As this stage doesn’t require access to the system catalogue there’s no wasted xid.
If the syntax is correct the parser will return a parse tree ready for the next step.

The answer is 42
The query tree
The second stage is still managed by the parser which access the system catalogue
and from the parse tree generates a query tree.
This is a logical representation of the language where any object and attribute is
unique.

The answer is 42
The query tree
Figure: A simple query tree

The answer is 42
The query tree
To generate the query tree the parser access the system catalogue and retrieve the
corresponding OID for each class and attribute in the query.
Ambiguous names will generate an error.
In the query tree the optional ﬁltering elements are translated as well.

The answer is 42
The planner stage
The query tree is then sent to the query planner which transverse the tree and
generates all the possible execution plans with the arbitrary cost estimated from
the database collected statistics.
The estimated plan with minimum cost is chosen for the processing and sent to
the executor.

The answer is 42
The planner stage
The query tree is then sent to the query planner which transverse the tree and
generates all the possible execution plans with the arbitrary cost estimated from
the database collected statistics.
The estimated plan with minimum cost is chosen for the processing and sent to
the executor.
Let me stress again the word estimate.
A database with old or missing statistics will generate not eﬃcient plans resulting
in slow queries.

The answer is 42
The executor
The planner returns then the execution plan, a sequence of steps to retrieve the
requested data, to manipulate the data or change the database structure.
The last stage is the executor. The execution plan steps are executed, then the
eventual output is returned to the backend.

The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.

The answer is 42
EXPLAIN (or EXTERMINATE)
The EXPLAIN statement returns the estimated execution plan for the
subsequent query.
The optional clause ANALYZE actually executes the query, discard the
results and return the real execution plan.

A couple of things about PostgreSQL...

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie A couple of things about PostgreSQL...

Ähnlich wie A couple of things about PostgreSQL... (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A couple of things about PostgreSQL...