SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Huy Nguyen
CTO, Cofounder - Holistics Software
Cofounder, Grokking Vietnam
PostgreSQL Internals 101
/post:gres:Q:L/
About Me
Education:
● Pho Thong Nang Khieu, Tin 04-07
● National University of Singapore (NUS), Computer Science Major.
Work:
● Software Engineer Intern, SenseGraphics (Stockholm, Sweden)
● Software Engineer Intern, Facebook (California, US)
● Data Infrastructure Engineer, Viki (Singapore)
Now:
● Co-founder & CTO, Holistics Software
● Co-founder, Grokking Vietnam
huy@holistics.io facebook.com/huy bit.ly/huy-linkedin
● This talk covers a very small part of
PostgreSQL concepts/internals
● As with any RDBMS, PostgreSQL is a
complex system, and it’s still evolving.
● Mainly revolve around explaining
“Uber’s MySQL vs PostgreSQL”
article.
● Not Covered: Memory Management,
Query Planning, Replication, etc...
Agenda
● Uber’s Article
● Table Heap
● B-Tree Index
● MVCC
● MySQL Structure
● PostgreSQL vs MySQL
(Uber Use-case)
● Index-only Scan
● Heap-only Tuple (HOT)
Uber migrating from PostgreSQL to MySQL
Uber’s Use Case
● Table with lots of indexes (cover almost/all columns)
● Lots of UPDATEs
⇒ MySQL handles this better than PostgreSQL
● Read more here
● Everything is under base
directory ($PGDATA).
/var/lib/postgresql/
9.x/main
● Each database is a folder
name after its oid
Physical Structure
http://www.interdb.jp/pg/pgsql01.html
demodb=# select oid, relname, relfilenode
from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416854
(1 row)
Physical Structure
Each table’s data is in 1 or multiple files (max 1GB each)
TRUNCATE table;
vs
DELETE FROM table;
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416854
(1 row)
demodb=# truncate test;
TRUNCATE TABLE
INSERT 0 1
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416857
(1 row)
Tuple Address (ctid)
ctid id name
(0, 2) 1 Alice
(0, 5) 2 Bob
(1, 3) 3 Charlie
ctid (tuple ID): a pair of (block,
location) to position the tuple in the
data file.
Heap Table Structure
Page: a block of content, default to 8KB
each.
Line pointers: 4-byte number address,
holds pointer to each tuple.
For tuple with size > 2KB, a special
storage method called TOAST is used.
● Problem: Someone reading data, while someone else is
writing to it
● Reader might see inconsistent piece of data
● MVCC: Allow reads and writes to happen concurrently
MVCC - Multi-version Concurrency Control
MVCC - Table
xmin xmax id name
1 5 1 Alice
2 3 2 Bob
3 2 Robert
4 3 Charlie
1. INSERT Alice
2. INSERT Bob
3. UPDATE Bob → Robert
4. INSERT Charlie
5. DELETE Alice
● xmin: transaction ID that inserts this tuple
● xmax: transaction that removes this tuple
INSERT
1
http://www.interdb.jp/pg/pgsql05.html
DELETE
1
http://www.interdb.jp/pg/pgsql05.html
UPDATE
http://www.interdb.jp/pg/pgsql05.html
Because each UPDATE creates new tuple (and marks old tuple
deleted), lots of UPDATEs will soon increase the table’s physical
size.
Table Bloat
Index (B-tree)
H
B
A C
Balanced search tree.
Root node and inner nodes
contain keys and pointers to lower
level nodes
Leaf nodes contain keys and
pointers to the heap (ctid)
When table has new tuples, new
tuple is added to index tree.
Heap
ctid
D
A1
…. ….
Write Amplifications
● Each UPDATE inserts new
tuple.
● New index tuples
● ⇒ multiple writes
● Extra overhead to
Write-ahead Log (WAL)
● Carried over through
network
● Applied on Slave
H
B
A C
Heap
ctid
D
A1
…. ….
MySQL / InnoDB
● MVCC: Inline update of tuples
● Table Layout: B+ tree on Primary Key
● Index: points to primary key
MySQL data is B+ Tree (on
primary key)
Leaf nodes contain actual rows
data
MySQL Table (B+ tree)
H
B
A C
row
data
...
primary key
MySQL Index
● MySQL: the node’s value
store primary key
● A lookup on secondary
index requires 2 index
traversals: secondary index
+ primary index.
H
B
A C
Table
D
A1
…. ….
primary key
https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/
PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
MVCC New Tuple Per UPDATE Inline update of tuple (with
rollback segments)
Index Lookup Store physical address (ctid) By primary key
Table Layout Heap-table structure Primary-key table structure
PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
select on primary key log(N) + heap read log(n) + direct read
update Update all indexes;
1 data write
Do not update indexes;
2 data writes
select on index key log(n) + O(1) heap read log(n) + log(n) primary index
read
sequential scan Page sequential scan Index-order scan
Index-only Scan (Covering Index)
Index on (product_id, revenue)
SELECT SUM(revenue) FROM table WHERE product_id = 123
If the index itself has all the data needed,
no Heap Table lookup is required.
Visibility Map
Per table’s page
VM[i] is set: all tuples in page i are
visible to current transactions
VM is only updated by VACUUM
https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
Heap-only Tuple (HOT)
● No new index needs to be updated
Conditions:
● Must not update a column that’s
indexed
● New tuple must be in the same
page
http://slideplayer.com/slide/9883483/
● Clean up dead tuples
● Freeze old tuples (prevent
transactions wraparound)
● VACUUM only frees old tuples
● VACUUM FULL reclaims old disk
spaces, but blocks writes
VACUUM
● Add a new column (safe)
● Add a column with a default (unsafe)
● Add a column that is non-nullable (unsafe)
● Drop a column (safe)
● Add a default value to an existing column (safe)
● Add an index (unsafe)
Safe & Unsafe Operations In PostgreSQL
http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
References
● Why Uber Engineering switched from PostgreSQL to MySQL -
https://eng.uber.com/mysql-migration/
● PostgreSQL Documentations -
https://www.postgresql.org/docs/current/static/
● The Internals of PostgreSQL
http://www.interdb.jp/pg/
● http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
● http://slideplayer.com/slide/9883483/
● https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and
-xid
Huy Nguyen
Physical Structure
https://www.postgresql.org/docs/current/static/storage-file-layout.html
Transaction Isolation
BEGIN TRANSACTION;
SELECT * FROM table;
SELECT pg_sleep(10);
SELECT * FROM table;
COMMIT;
under READ COMMITTED, the second SELECT may return any data. A
concurrent transaction may update the record, delete it, insert new records.
The second select will always see the new data.
under REPEATABLE READ the second SELECT is guaranteed to see the
rows that has seen at first select unchanged. New rows may be added by a
concurrent transaction in that one minute, but the existing rows cannot be
deleted nor changed.
under SERIALIZABLE reads the second select is guaranteed to see exactly
the same rows as the first. No row can change, nor deleted, nor new rows
could be inserted by a concurrent transaction.
https://stackoverflow.com/questions/4034976/difference-between-read-commit-and-repeatable-read
PostgreSQL Processes
There are multiple processes handling different
use cases.
● postmaster process: handles database
cluster management.
● Many backend processes (one for each
connection)
● Background processes: stats collector,
autovacuum, checkpoint, WAL writer, etc.
http://www.interdb.jp/pg/pgsql02.html
Database Cluster
● database cluster: a database
instance in a single machine.
● A database contains many
database objects (schema, table,
index, view, function, etc)
● Each object is represented by an
oid
Database Cluster
Database 1 Database 2 Database n...
tables indexes
views,
materialized
views
functions
schema
sequences
...
role
(user/group

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 

Was ist angesagt? (20)

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
 
Presto Meetup 2016 Small Start
Presto Meetup 2016 Small StartPresto Meetup 2016 Small Start
Presto Meetup 2016 Small Start
 
Query Parallelism in PostgreSQL: What's coming next?
Query Parallelism in PostgreSQL: What's coming next?Query Parallelism in PostgreSQL: What's coming next?
Query Parallelism in PostgreSQL: What's coming next?
 
Lightening Talk - PostgreSQL Worst Practices
Lightening Talk - PostgreSQL Worst PracticesLightening Talk - PostgreSQL Worst Practices
Lightening Talk - PostgreSQL Worst Practices
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Lessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tLessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’t
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Presto
PrestoPresto
Presto
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
 
Big Data and PostgreSQL
Big Data and PostgreSQLBig Data and PostgreSQL
Big Data and PostgreSQL
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 

Ähnlich wie Grokking TechTalk #20: PostgreSQL Internals 101

NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
Guillaume Lefranc
 

Ähnlich wie Grokking TechTalk #20: PostgreSQL Internals 101 (20)

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Postgres indexing and toward big data application
Postgres indexing and toward big data applicationPostgres indexing and toward big data application
Postgres indexing and toward big data application
 
GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
PgconfSV compression
PgconfSV compressionPgconfSV compression
PgconfSV compression
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
FOSDEM 2015: gdb tips and tricks for MySQL DBAs
FOSDEM 2015: gdb tips and tricks for MySQL DBAsFOSDEM 2015: gdb tips and tricks for MySQL DBAs
FOSDEM 2015: gdb tips and tricks for MySQL DBAs
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 

Mehr von Grokking VN

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 

Mehr von Grokking VN (20)

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Grokking TechTalk #20: PostgreSQL Internals 101

  • 1. Huy Nguyen CTO, Cofounder - Holistics Software Cofounder, Grokking Vietnam PostgreSQL Internals 101 /post:gres:Q:L/
  • 2. About Me Education: ● Pho Thong Nang Khieu, Tin 04-07 ● National University of Singapore (NUS), Computer Science Major. Work: ● Software Engineer Intern, SenseGraphics (Stockholm, Sweden) ● Software Engineer Intern, Facebook (California, US) ● Data Infrastructure Engineer, Viki (Singapore) Now: ● Co-founder & CTO, Holistics Software ● Co-founder, Grokking Vietnam huy@holistics.io facebook.com/huy bit.ly/huy-linkedin
  • 3. ● This talk covers a very small part of PostgreSQL concepts/internals ● As with any RDBMS, PostgreSQL is a complex system, and it’s still evolving. ● Mainly revolve around explaining “Uber’s MySQL vs PostgreSQL” article. ● Not Covered: Memory Management, Query Planning, Replication, etc... Agenda ● Uber’s Article ● Table Heap ● B-Tree Index ● MVCC ● MySQL Structure ● PostgreSQL vs MySQL (Uber Use-case) ● Index-only Scan ● Heap-only Tuple (HOT)
  • 4. Uber migrating from PostgreSQL to MySQL
  • 5. Uber’s Use Case ● Table with lots of indexes (cover almost/all columns) ● Lots of UPDATEs ⇒ MySQL handles this better than PostgreSQL ● Read more here
  • 6. ● Everything is under base directory ($PGDATA). /var/lib/postgresql/ 9.x/main ● Each database is a folder name after its oid Physical Structure http://www.interdb.jp/pg/pgsql01.html
  • 7. demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416854 (1 row) Physical Structure Each table’s data is in 1 or multiple files (max 1GB each)
  • 9. demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416854 (1 row) demodb=# truncate test; TRUNCATE TABLE INSERT 0 1 demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416857 (1 row)
  • 10. Tuple Address (ctid) ctid id name (0, 2) 1 Alice (0, 5) 2 Bob (1, 3) 3 Charlie ctid (tuple ID): a pair of (block, location) to position the tuple in the data file.
  • 11. Heap Table Structure Page: a block of content, default to 8KB each. Line pointers: 4-byte number address, holds pointer to each tuple. For tuple with size > 2KB, a special storage method called TOAST is used.
  • 12. ● Problem: Someone reading data, while someone else is writing to it ● Reader might see inconsistent piece of data ● MVCC: Allow reads and writes to happen concurrently MVCC - Multi-version Concurrency Control
  • 13. MVCC - Table xmin xmax id name 1 5 1 Alice 2 3 2 Bob 3 2 Robert 4 3 Charlie 1. INSERT Alice 2. INSERT Bob 3. UPDATE Bob → Robert 4. INSERT Charlie 5. DELETE Alice ● xmin: transaction ID that inserts this tuple ● xmax: transaction that removes this tuple
  • 17. Because each UPDATE creates new tuple (and marks old tuple deleted), lots of UPDATEs will soon increase the table’s physical size. Table Bloat
  • 18. Index (B-tree) H B A C Balanced search tree. Root node and inner nodes contain keys and pointers to lower level nodes Leaf nodes contain keys and pointers to the heap (ctid) When table has new tuples, new tuple is added to index tree. Heap ctid D A1 …. ….
  • 19. Write Amplifications ● Each UPDATE inserts new tuple. ● New index tuples ● ⇒ multiple writes ● Extra overhead to Write-ahead Log (WAL) ● Carried over through network ● Applied on Slave H B A C Heap ctid D A1 …. ….
  • 20. MySQL / InnoDB ● MVCC: Inline update of tuples ● Table Layout: B+ tree on Primary Key ● Index: points to primary key
  • 21. MySQL data is B+ Tree (on primary key) Leaf nodes contain actual rows data MySQL Table (B+ tree) H B A C row data ... primary key
  • 22. MySQL Index ● MySQL: the node’s value store primary key ● A lookup on secondary index requires 2 index traversals: secondary index + primary index. H B A C Table D A1 …. …. primary key
  • 24. PostgreSQL vs MySQL (Uber case) PostgreSQL MySQL MVCC New Tuple Per UPDATE Inline update of tuple (with rollback segments) Index Lookup Store physical address (ctid) By primary key Table Layout Heap-table structure Primary-key table structure
  • 25. PostgreSQL vs MySQL (Uber case) PostgreSQL MySQL select on primary key log(N) + heap read log(n) + direct read update Update all indexes; 1 data write Do not update indexes; 2 data writes select on index key log(n) + O(1) heap read log(n) + log(n) primary index read sequential scan Page sequential scan Index-order scan
  • 26. Index-only Scan (Covering Index) Index on (product_id, revenue) SELECT SUM(revenue) FROM table WHERE product_id = 123 If the index itself has all the data needed, no Heap Table lookup is required.
  • 27. Visibility Map Per table’s page VM[i] is set: all tuples in page i are visible to current transactions VM is only updated by VACUUM https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
  • 28. Heap-only Tuple (HOT) ● No new index needs to be updated Conditions: ● Must not update a column that’s indexed ● New tuple must be in the same page http://slideplayer.com/slide/9883483/
  • 29. ● Clean up dead tuples ● Freeze old tuples (prevent transactions wraparound) ● VACUUM only frees old tuples ● VACUUM FULL reclaims old disk spaces, but blocks writes VACUUM
  • 30. ● Add a new column (safe) ● Add a column with a default (unsafe) ● Add a column that is non-nullable (unsafe) ● Drop a column (safe) ● Add a default value to an existing column (safe) ● Add an index (unsafe) Safe & Unsafe Operations In PostgreSQL http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
  • 31. References ● Why Uber Engineering switched from PostgreSQL to MySQL - https://eng.uber.com/mysql-migration/ ● PostgreSQL Documentations - https://www.postgresql.org/docs/current/static/ ● The Internals of PostgreSQL http://www.interdb.jp/pg/ ● http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql ● http://slideplayer.com/slide/9883483/ ● https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and -xid
  • 34. Transaction Isolation BEGIN TRANSACTION; SELECT * FROM table; SELECT pg_sleep(10); SELECT * FROM table; COMMIT; under READ COMMITTED, the second SELECT may return any data. A concurrent transaction may update the record, delete it, insert new records. The second select will always see the new data. under REPEATABLE READ the second SELECT is guaranteed to see the rows that has seen at first select unchanged. New rows may be added by a concurrent transaction in that one minute, but the existing rows cannot be deleted nor changed. under SERIALIZABLE reads the second select is guaranteed to see exactly the same rows as the first. No row can change, nor deleted, nor new rows could be inserted by a concurrent transaction. https://stackoverflow.com/questions/4034976/difference-between-read-commit-and-repeatable-read
  • 35. PostgreSQL Processes There are multiple processes handling different use cases. ● postmaster process: handles database cluster management. ● Many backend processes (one for each connection) ● Background processes: stats collector, autovacuum, checkpoint, WAL writer, etc. http://www.interdb.jp/pg/pgsql02.html
  • 36. Database Cluster ● database cluster: a database instance in a single machine. ● A database contains many database objects (schema, table, index, view, function, etc) ● Each object is represented by an oid Database Cluster Database 1 Database 2 Database n... tables indexes views, materialized views functions schema sequences ... role (user/group