The document discusses several advanced features of PostgreSQL including:
1) Transactional DDL which allows DDL statements to be executed transactionally.
2) Cost-based query optimization and graphical EXPLAIN plans which help choose the most efficient query plan.
3) Features like partial indexes, function indexes, k-nearest search, views, and window functions which provide powerful ways to query and analyze data.
2. Why PostgreSQL?
The world’s most advanced open source database.
Features!
Transactional DDL
Cost-based query optimizer + Graphical explain
Partial indexes
Function indexes
K-nearest search
Views
Recursive Queries
Window Functions
3. Transactional DDL
class CreatePostsMigration < ActiveRecord::Migration
def change
create_table :posts do |t|
t.string :name, null: false
t.text :body, null: false
t.references :author, null: false
t.timestamps null: false
end
add_index :posts, :title, unique: true
end
end
Where is the problem?
4. Transactional DDL
class CreatePostsMigration < ActiveRecord::Migration
def change
create_table :posts do |t|
t.string :name, null: false
Column title does not exist!
t.text :body, null: false is created, index is not. Oops!
Table
t.references :author, null: false
Transactional DDL FTW!
t.timestamps null: false
end
add_index :posts, :title, unique: true
end
end
Where is the problem?
5. Cost-based query optimizer
What is the best plan to execute a given query?
Cost = I/O + CPU operations needed
Sequential vs. random seek
Join order
Join type (nested loop, hash join, merge join)
7. Partial indexes
Conditional indexes
Problem: Async job/queue table, find failed jobs
Create index on failed_at column
99% of index is never used
8. Partial indexes
Conditional indexes
Problem: Async job/queue table, find failed jobs
Create index on failed_at column
99% of index is never used
Solution:
CREATE INDEX idx_dj_only_failed ON delayed_jobs (failed_at)
WHERE failed_at IS NOT NULL;
smaller index
faster updates
10. Function Indexes
Problem: Suffix search
SELECT … WHERE code LIKE ‘%123’
“Solution”:
Add reverse_code column, populate, add triggers for updates,
create index on reverse_code column
reverse queries WHERE reverse_code LIKE “321%”
11. Function Indexes
Problem: Suffix search
SELECT … WHERE code LIKE ‘%123’
“Solution”:
Add reverse_code column, populate, add triggers for updates,
create index on reverse_code column,
reverse queries WHERE reverse_code LIKE “321%”
PostgreSQL solution:
CREATE INDEX idx_reversed ON projects
(reverse((code)::text) text_pattern_ops);
SELECT … WHERE reverse(code) LIKE
reverse(‘%123’)
12. K-nearest search
Problem: Fuzzy string matching
900K rows
CREATE INDEX idx_trgm_name ON subjects USING gist (name
gist_trgm_ops);
SELECT name, name <-> 'Michl Brla' AS dist
FROM subjects ORDER BY dist ASC LIMIT 10; (312ms)
"Michal Barla“ ; 0.588235
"Michal Bula“ ; 0.647059
"Michal Broz“ ; 0.647059
"Pavel Michl“ ; 0.647059
"Michal Brna“ ; 0.647059
13. K-nearest search
Problem: Fuzzy string matching
900K rows
Solution: Ngram/Trigram search
johno = {" j"," jo",”hno”,”joh”,"no ",”ohn”}
CREATE INDEX idx_trgm_name ON subjects USING gist (name
gist_trgm_ops);
SELECT name, name <-> 'Michl Brla' AS dist
FROM subjects ORDER BY dist ASC LIMIT 10; (312ms)
"Michal Barla“ ; 0.588235
"Michal Bula“ ; 0.647059
"Michal Broz“ ; 0.647059
"Pavel Michl“ ; 0.647059
"Michal Brna“ ; 0.647059
14. K-nearest search
Problem: Fuzzy string matching
900K rows
Solution: Ngram/Trigram search
johno = {" j"," jo",”hno”,”joh”,"no ",”ohn”}
CREATE INDEX idx_trgm_name ON subjects USING gist (name
gist_trgm_ops);
SELECT name, name <-> 'Michl Brla' AS dist
FROM subjects ORDER BY dist ASC LIMIT 10; (312ms)
"Michal Barla“ ; 0.588235
"Michal Bula“ ; 0.647059
"Michal Broz“ ; 0.647059
"Pavel Michl“ ; 0.647059
"Michal Brna“ ; 0.647059
15. Views
Constraints propagated down to views
CREATE VIEW edges AS
SELECT subject_id AS source_id,
connected_subject_id AS target_id FROM raw_connections
UNION ALL
SELECT connected_subject_id AS source_id,
subject_id AS target_id FROM raw_connections;
SELECT * FROM edges WHERE source_id = 123;
SELECT * FROM edges WHERE source_id < 500 ORDER BY
source_id LIMIT 10
No materialization, 2x indexed select + 1x append/merge
16. Views
Constraints propagated down to views
CREATE VIEW edges AS
SELECT subject_id AS source_id,
connected_subject_id AS target_id FROM raw_connections
UNION ALL
SELECT connected_subject_id AS source_id,
subject_id AS target_id FROM raw_connections;
SELECT * FROM edges WHERE source_id = 123;
SELECT * FROM edges WHERE source_id < 500 ORDER BY
source_id LIMIT 10
No materialization, 2x indexed select + 1x append/merge
17. Recursive Queries
Problem: Find paths between two nodes in graph
WITH RECURSIVE search_graph(source,target,distance,path) AS
(
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph LIMIT 100
18. Recursive Queries
Problem: Find paths between two nodes in graph
WITH RECURSIVE search_graph(source,target,distance,path) AS
(
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph LIMIT 100
19. Recursive Queries
Problem: Find paths between two nodes in graph
WITH RECURSIVE search_graph(source,target,distance,path) AS
(
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph WHERE target = 530556 LIMIT 100;
20. Recursive Queries
Problem: Find paths between two nodes in graph
WITH RECURSIVE search_graph(source,target,distance,path) AS
(
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph WHERE target = 530556 LIMIT 100;
21. Recursive Queries
Problem: Find paths between two nodes in graph
WITH RECURSIVE search_graph(source,target,distance,path) AS
(
SELECT source_id, target_id, 1,
ARRAY[source_id, target_id]
FROM edges WHERE source_id = 552506
UNION ALL
SELECT sg.source, e.target_id, sg.distance + 1,
path || ARRAY[e.target_id]
FROM search_graph sg
JOIN edges e ON sg.target = e.source_id
WHERE NOT e.target_id = ANY(path) AND distance < 4
)
SELECT * FROM search_graph WHERE target = 530556 LIMIT 100;
23. Recursive queries
Graph with ~1M edges (61ms)
source; target; distance; path
530556; 552506; 2; {530556,185423,552506}
JUDr. Robert Kaliňák -> FoodRest s.r.o. -> Ing. Ján
Počiatek
530556; 552506; 2; {530556,183291,552506}
JUDr. Robert Kaliňák -> FoRest s.r.o. -> Ing. Ján
Počiatek
530556; 552506; 4;
{530556,183291,552522,185423,552506}
JUDr. Robert Kaliňák -> FoodRest s.r.o. -> Lena
Sisková -> FoRest s.r.o. -> Ing. Ján Počiatek
24. Window functions
“Aggregate functions without grouping”
avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node
Order by sum of path scores
Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance,target)
) AS score
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
25. Window functions
“Aggregate functions without grouping”
avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node
Order by sum of path scores
Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance,target)
) AS score
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
26. Window functions
“Aggregate functions without grouping”
avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node
Order by sum of path scores
Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance, target)
) AS n
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
27. Window functions
“Aggregate functions without grouping”
avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node
Order by sum of path scores
Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance, target)
) AS score
FROM ( … ) AS paths
) as scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
28. Window functions
“Aggregate functions without grouping”
avg, count, sum, rank, row_number, ntile…
Problem: Find closest nodes to a given node
Order by sum of path scores
Path score = 0.9^<distance> / log(1 + <number of paths>)
SELECT source, target FROM (
SELECT source, target, path, distance,
0.9 ^ distance / log(1 +
COUNT(*) OVER (PARTITION BY distance, target)
) AS score
FROM ( … ) AS paths
) AS scored_paths
GROUP BY source, target ORDER BY SUM(score) DESC
29. Window functions
Example: Closest to Róbert Kaliňák
"Bussines Park Bratislava a.s."
"JARABINY a.s."
"Ing. Robert Pintér"
"Ing. Ján Počiatek"
"Bratislava trade center a.s.“
…
1M edges, 41ms
30. Additional resources
www.postgresql.org
Read the docs, seriously
www.explainextended.com
SQL guru blog
explain.depesz.com
First aid for slow queries
www.wikivs.com/wiki/MySQL_vs_PostgreSQL
MySQL vs. PostgreSQL comparison