Start PostgreSQL Query Optimization with a Sample Example in Local Environment from Zero Knowledge
1. Overview
If you are working on a software that processes a huge amount of data, you’re highly likely face performance issues. In most cases, the DB performance depends on the distribution of the stored data, so it’s hard to practice performance tuning on your local machine. However, without any performance tuning experience, you might not be going to have the chance to work on performance tuning. If you don’t have opportunities, you’ll never get better at it. This is one of the reason that I think is most developers are not very good at performance tuning, including myself.
So today I’ll share a simple example to practice query performance that you can test in your local.
2. Prepare DB and data in your local env
2.2. Prepare the Schema
Let’s think about very basic two entities items
and categories
and a many-to-many relationship between them item_categories
:
The table schema is like the following: test_tables.sql
DROP TABLE IF EXISTS item_categories;
DROP TABLE IF EXISTS items;
CREATE TABLE
IF NOT EXISTS items
(
id text NOT NULL PRIMARY KEY,
name VARCHAR(50) NOT NULL,
status SMALLINT NOT NULL,
created_at TIMESTAMP NOT NULL
);
CREATE INDEX
IF NOT EXISTS items_status ON items(status);
DROP TABLE IF EXISTS categories;
CREATE TABLE
IF NOT EXISTS categories
(
id text NOT NULL PRIMARY KEY,
name VARCHAR(50) NOT NULL,
created_at TIMESTAMP NOT NULL
);
CREATE TABLE item_categories
(
id character varying NOT NULL,
category_id text NOT NULL,
item_id text NOT NULL,
created_at timestamp NOT NULL,
updated_at timestamp NOT NULL,
CONSTRAINT fk_item FOREIGN KEY(item_id) REFERENCES items(id),
CONSTRAINT fk_category FOREIGN KEY(category_id) REFERENCES categories(id)
);
CREATE INDEX IF NOT EXISTS item_categories_cateogory_id_item_id_idx ON item_categories (category_id, item_id);
I just randomly added some indexes.
2.2. Prepare PostgreSQL container
Let’s use the following docker-compose yaml file to run a PostgreSQL container.
version: '3'
services:
db:
container_name: postgres
image: postgres:14
volumes:
- ./docker/pg:/var/lib/postgresql
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: test_db
ports:
- '5432:5432'
docker-compose -f docker-compose.postgres.yml up -d
2.3. Load the schema to the database
Let’s load the schema prepared in the above step:
docker exec -i postgres psql -U postgres test_db < test_table.sql
2.4. Generate data
I made a simple script(generate_data.py) to generate random data in my repo https://github.com/nakamasato/postgresql-training
The script generates random data for items
, categories
, and item_categories
relations.
You can specify
- the number of items (default: 10000)
- the number of categories (default: 3)
Next, let’s prepare the python environment to run the script:
python -m venv venv
source venv/bin/activate
pip install psycopg2
Finally, run the script:
python generate_data.py -i 100000 -c 20
The roughly estimated time to generate data is 10s for 100,000 items, 20s for 1,000,000 items, and 1000s for 10,000,000 items on M1 Mac with 16GB memory.
For the example case below, I generated 10 mil items, which took around 20 mins.
2.5. Check the data
You can check the data in the container by passing a sql via psql -c
:
docker exec -i postgres psql -U postgres test_db -c "<sql>"
item:
docker exec -i postgres psql -U postgres test_db -c "select * from items limit 10"
id | name | status | created_on
--------------------------------------+--------+--------+---------------------
2c0bb163-6bf1-475c-b4ef-0a0a6e9f91e0 | name-0 | 1 | 2021-02-12 01:23:12
e6d22047-0196-464a-908e-b4a07db53b47 | name-1 | 3 | 2022-08-01 16:57:56
4abe9fa4-abc6-42b6-9426-9e6260b855d5 | name-2 | 3 | 2021-04-15 23:28:35
028c91c2-93eb-480a-a80f-3d184811fef7 | name-3 | 3 | 2021-12-14 13:35:12
c83ca045-321c-4f52-94a3-9de02d890490 | name-4 | 3 | 2022-09-28 15:35:37
acbe49bd-74e8-4be2-90cb-1bf855769328 | name-5 | 3 | 2021-11-23 14:11:26
53ad62cc-5033-47ea-b462-2cbfd707b6e7 | name-6 | 2 | 2022-05-13 22:29:55
f2fb4124-5891-4fde-9704-8423f51187fa | name-7 | 1 | 2022-07-18 14:22:23
fe0da270-b0fb-4846-b3cf-54d5abfcb66f | name-8 | 3 | 2021-09-20 18:19:58
d57765cd-c8e2-41a8-ad15-82a3359f18c1 | name-9 | 2 | 2021-02-19 02:15:43
(10 rows)
categories:
docker exec -i postgres psql -U postgres test_db -c "select * from categories limit 10"
id | name | created_at
------------+------------+---------------------
category-0 | category-0 | 2021-08-05 07:06:01
category-1 | category-1 | 2021-11-18 19:02:48
category-2 | category-2 | 2021-04-07 15:23:46
category-3 | category-3 | 2021-05-12 15:21:52
category-4 | category-4 | 2021-06-07 13:15:27
category-5 | category-5 | 2022-01-02 13:52:22
category-6 | category-6 | 2021-12-10 15:47:41
category-7 | category-7 | 2022-01-19 18:19:37
category-8 | category-8 | 2022-03-13 17:51:36
category-9 | category-9 | 2021-11-18 14:44:20
(10 rows)
item_categories:
id | category_id | item_id | created_at | updated_at
--------------------------------------+-------------+--------------------------------------+---------------------+---------------------
c8d28294-4059-428a-97a6-6dc6cad7f2a5 | category-12 | 9dd4680f-9e09-4272-ad1a-12a1f6a1324b | 2022-05-25 04:22:56 | 2022-06-30 12:59:56
54b17fef-c9c1-45eb-bbac-419e7402a355 | category-13 | d5aff8ce-3324-417e-8f91-2b113288e8cf | 2022-01-04 04:46:37 | 2022-10-27 11:29:48
2e401c19-c22e-4edd-ae5f-a83265144dd3 | category-10 | df6f9175-86ba-4bd2-a29e-caa76dc47e38 | 2021-07-29 13:19:20 | 2021-10-02 01:57:56
cd991c35-7d98-4257-9964-b12d2faae4e4 | category-12 | 1373c90f-6914-40f0-a69f-e68bbae5955d | 2022-09-25 14:46:46 | 2022-10-10 06:15:34
87e5717c-3413-4b44-ba73-9c68a247d0de | category-0 | c1c8943a-e590-4155-8e6a-518e24d58bd2 | 2021-12-14 12:52:18 | 2022-04-20 10:47:02
a731eaed-bd61-43bb-b2e4-1aedd6e06a5d | category-10 | 3b2ce1fe-32fe-4eb0-a39c-8e2a5b47113b | 2022-06-06 23:16:58 | 2022-07-25 14:40:12
f2a0b3c0-2f03-45c8-81d6-0c8e07049308 | category-2 | c644b3c4-8884-4d1a-8879-acc90575154f | 2022-08-20 13:29:04 | 2022-09-20 04:17:05
8d1219f1-0914-4aed-b29a-cea3b8c59120 | category-3 | 97cbff50-e5f9-4d57-ad16-aea42ef979c7 | 2021-04-08 12:39:54 | 2021-06-30 02:18:41
8291e37f-e90a-4d0f-aa99-25598ee1637e | category-8 | f3e3da3c-4dc8-4931-ab34-5988e4e764e9 | 2021-11-16 03:24:32 | 2022-03-18 14:00:24
bd60a8fc-2444-455e-87b0-99ebcbcc694f | category-5 | 62838307-fb28-4205-94cd-a7730df99f10 | 2022-04-10 04:22:15 | 2022-10-23 06:58:37
Now everything’s ready to practice tuning PostgreSQL query performance!
3. Explain a query (baseline: 21 sec)
You might have heard somebody told that you should check the result of explain
before actually running a heavy query. But it’s not that easy for beginners to read the result. Today I don’t dive into the details of EXPLAIN, however we can learn some of it in the following example.
Let’s consider the following query:
SELECT i.id, i.created_at
FROM items i
JOIN item_categories ic
ON i.id = ic.item_id
WHERE i.status IN (1)
AND ic.category_id IN ('category-1', 'category-2', 'category-3')
AND (i.created_at < '2022-01-01' OR (i.created_at = '2022-01-01' AND i.id < 'test'))
ORDER BY i.created_at DESC, i.id DESC
LIMIT 30
This query is to get items that belong to specified categories and are created before the specific point, but if there are items that are created exactly on the specific point, the specified item id is used to filter out.
This looks pretty complicated but this kind of queries seem to be used for pagination.
We can execute explain for the above query:
docker exec -i postgres psql -U postgres test_db -c "EXPLAIN ANALYZE SELECT i.id, i.created_at FROM items i join item_categories ic ONi.id = ic.item_id WHERE i.status in (1) AND ic.category_id IN('category-1', 'category-2', 'category-3') AND (i.created_at < '2022-01-01' OR (i.created_at = '2022-01-01' AND i.id < 'test') ) ORDER BYi.created_at DESC, i.id DESC LIMIT30"
This query takes around 20s to complete. Bear in mind that EXPLAIN ANALYZE
actually executes the query, so make sure the target query is ok to run when you execute it in prod database. If you don’t want to actually execute the query, you can use EXPLAN
without ANALYZE
, which just gives you estimated costs.
The above command will give you the estimated result and actual time.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37.54..1134.78 rows=30 width=45) (actual time=21210.468..21210.477 rows=30 loops=1)
-> Incremental Sort (cost=37.54..4368770.52 rows=119446 width=45) (actual time=21210.467..21210.470 rows=30 loops=1)
Sort Key: i.created_at DESC, i.id DESC
Presorted Key: i.created_at
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
-> Nested Loop (cost=1.00..4363395.45 rows=119446 width=45) (actual time=21185.634..21210.297 rows=31 loops=1)
-> Index Scan Backward using items_created_at on items i (cost=0.43..846283.83 rows=1762594 width=45) (actual time=21184.908..21190.561 rows=362 loops=1)
Filter: ((status = 1) AND ((created_at < '2022-01-01 00:00:00'::timestamp without time zone) OR ((created_at = '2022-01-01 00:00:00'::timestamp without time zone) AND (id < 'test'::text))))
Rows Removed by Filter: 4675811
-> Index Only Scan using item_categories_cateogory_id_item_id_idx on item_categories ic (cost=0.56..1.99 rows=1 width=37) (actual time=0.051..0.053 rows=0 loops=362)
Index Cond: ((category_id = ANY ('{category-1,category-2,category-3}'::text[])) AND (item_id = i.id))
Heap Fetches: 0
Planning Time: 4.444 ms
Execution Time: 21210.729 ms
(14 rows)
Some of the basics of EXPLAIN results are
- (cost=<start-up cost>…<total cost>) : estimated costs (measured in cost units that are arbitrary, but conventionally mean disk page fetches)
- (actual time=<start-up time>…<time spent>): actual time in milliseconds
start-up and total costs: start-up cost before the first row can be returned and the total cost to return all the rows
In the example above, the query took 21210.729 ms to complete, which is unacceptably slow as a single query.
You can easily find the most expensive node:
-> Index Scan Backward using items_created_at on items i (cost=0.43..846283.83 rows=1762594 width=45) (actual time=21184.908..21190.561 rows=362 loops=1)
Filter: ((status = 1) AND ((created_at < '2022-01-01 00:00:00'::timestamp without time zone) OR ((created_at = '2022-01-01 00:00:00'::timestamp without time zone) AND (id < 'test'::text))))
Rows Removed by Filter: 4675811
Index Scan Backward uses items_created_at
index, and with the filter condition, a large number rows are removed by the filter, which makes the node inefficient as it reads a lot data and throws away most of them.
We can easily think of an improvement: add a new index on status
and created_at
fields on items
table.
4. Query improvements with explain result
4.1. Add an index (21s → 6s)
We can add an index by the following command:
docker exec -i postgres psql -U postgres test_db -c "CREATE INDEX IF NOT EXISTS items_status_created_at_idx ON items(status, created_at);"
We can also check the existing indexes:
docker exec -i postgres psql -U postgres test_db -c "select tablename, indexname, indexdef from pg_indexes where schemaname = 'public' order by tablename, indexname"
tablename | indexname | indexdef
-----------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------
categories | categories_pkey | CREATE UNIQUE INDEX categories_pkey ON public.categories USING btree (id)
item_categories | item_categories_cateogory_id_item_id_idx | CREATE INDEX item_categories_cateogory_id_item_id_idx ON public.item_categories USING btree (category_id, item_id)
items | items_created_at | CREATE INDEX items_created_at ON public.items USING btree (created_at)
items | items_pkey | CREATE UNIQUE INDEX items_pkey ON public.items USING btree (id)
items | items_status_created_at_idx | CREATE INDEX items_status_created_at_idx ON public.items USING btree (status, created_at)
(5 rows)
Let’s explain again.
docker exec -i postgres psql -U postgres test_db -c "EXPLAIN ANALYZE SELECT i.id, i.created_at FROM items i join item_categories ic ON i.id = ic.item_id WHERE i.status in (1) AND ic.category_id IN ('category-1', 'category-2', 'category-3') AND (i.created_at < '2022-01-01' OR (i.created_at = '2022-01-01' AND i.id < 'test')) ORDER BY i.created_at DESC, i.id DESC LIMIT 30"
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35.16..1061.27 rows=30 width=45) (actual time=6413.838..6413.844 rows=30 loops=1)
-> Incremental Sort (cost=35.16..4085502.22 rows=119446 width=45) (actual time=6413.837..6413.839 rows=30 loops=1)
Sort Key: i.created_at DESC, i.id DESC
Presorted Key: i.created_at
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
-> Nested Loop (cost=1.00..4080127.15 rows=119446 width=45) (actual time=6400.943..6413.732 rows=31 loops=1)
-> Index Scan Backward using items_status_created_at_idx on items i (cost=0.43..563015.53 rows=1762594 width=45) (actual time=6400.772..6402.429 rows=362 loops=1)
Index Cond: (status = 1)
Filter: ((created_at < '2022-01-01 00:00:00'::timestamp without time zone) OR ((created_at = '2022-01-01 00:00:00'::timestamp without time zone) AND (id < 'test'::text)))
Rows Removed by Filter: 1559539
-> Index Only Scan using item_categories_cateogory_id_item_id_idx on item_categories ic (cost=0.56..1.99 rows=1 width=37) (actual time=0.029..0.030 rows=0 loops=362)
Index Cond: ((category_id = ANY ('{category-1,category-2,category-3}'::text[])) AND (item_id = i.id))
Heap Fetches: 0
Planning Time: 2.416 ms
Execution Time: 6413.997 ms
(15 rows)
Now the execution time becomes 6s. This is huge improvements. What’s different in the explain results.
-> Index Scan Backward using items_status_created_at_idx on items i (cost=0.43..563015.53 rows=1762594 width=45) (actual time=6400.772..6402.429 rows=362 loops=1)
Index Cond: (status = 1)
Filter: ((created_at < '2022-01-01 00:00:00'::timestamp without time zone) OR ((created_at = '2022-01-01 00:00:00'::timestamp without time zone) AND (id < 'test'::text)))
Rows Removed by Filter: 1559539
The newly added index items_status_created_at_idx
is used and you can see the Rows Removed by Filter
was reduced from 4675811 to 1559539 (almost one thirds), which means the wasteful load of the data has been decreased.
But, 1.5 million rows removed by the filter is still a lot. Let’s improve it.
4.2. Use row-wise comparison (6s → 34ms)
The slow part is the following or
in where
clause from the explain result.
(i.created_at < '2022-01-01' OR (i.created_at = '2022-01-01' AND i.id < 'test') )
We can confirm it by breaking down the or
into two queries.
- One with only
i.created_at < '2022-01-01'
- One with only
(i.created_at = '2022-01-01 AND i.id <'test')
The first one (35.976 ms):
docker exec -i postgres psql -U postgres test_db -c "EXPLAIN ANALYZE SELECT i.id, i.created_at FROM items i join item_categories ic ON i.id = ic.item_id WHERE i.status in (1) AND ic.category_id IN ('category-1', 'category-2', 'category-3') AND i.created_at < '2022-01-01' ORDER BY i.created_at DESC, i.id DESC LIMIT 30"
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=34.56..1042.56 rows=30 width=45) (actual time=35.836..35.841 rows=30 loops=1)
-> Incremental Sort (cost=34.56..4013407.91 rows=119446 width=45) (actual time=35.835..35.837 rows=30 loops=1)
Sort Key: i.created_at DESC, i.id DESC
Presorted Key: i.created_at
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
-> Nested Loop (cost=1.00..4008032.84 rows=119446 width=45) (actual time=0.413..35.743 rows=31 loops=1)
-> Index Scan Backward using items_status_created_at_idx on items i (cost=0.43..490921.22 rows=1762594 width=45) (actual time=0.213..9.452 rows=362 loops=1)
Index Cond: ((status = 1) AND (created_at < '2022-01-01 00:00:00'::timestamp without time zone))
-> Index Only Scan using item_categories_cateogory_id_item_id_idx on item_categories ic (cost=0.56..1.99 rows=1 width=37) (actual time=0.068..0.070 rows=0 loops=362)
Index Cond: ((category_id = ANY ('{category-1,category-2,category-3}'::text[])) AND (item_id = i.id))
Heap Fetches: 0
Planning Time: 3.352 ms
Execution Time: 35.976 ms
(13 rows)
The second one (0.220 ms):
docker exec -i postgres psql -U postgres test_db -c "EXPLAIN ANALYZE SELECT i.id, i.created_at FROM items i join item_categories ic ON i.id = ic.item_id WHERE i.status in (1) AND ic.category_id IN ('category-1', 'category-2', 'category-3') AND i.created_at = '2022-01-01' AND i.id < 'test' ORDER BY i.created_at DESC, i.id DESC LIMIT 30"
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22.20..22.20 rows=1 width=45) (actual time=0.129..0.130 rows=0 loops=1)
-> Sort (cost=22.20..22.20 rows=1 width=45) (actual time=0.129..0.129 rows=0 loops=1)
Sort Key: i.id DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.00..22.19 rows=1 width=45) (actual time=0.088..0.088 rows=0 loops=1)
-> Index Scan using items_status_created_at_idx on items i (cost=0.43..8.46 rows=1 width=45) (actual time=0.087..0.087 rows=0 loops=1)
Index Cond: ((status = 1) AND (created_at = '2022-01-01 00:00:00'::timestamp without time zone))
Filter: (id < 'test'::text)
-> Index Only Scan using item_categories_cateogory_id_item_id_idx on item_categories ic (cost=0.56..13.72 rows=1 width=37) (never executed)
Index Cond: ((category_id = ANY ('{category-1,category-2,category-3}'::text[])) AND (item_id = i.id))
Heap Fetches: 0
Planning Time: 2.150 ms
Execution Time: 0.220 ms
(13 rows)
Each query runs very fast but if you combine the condition with OR
, it’s very slow.
Now we can focus on improving the OR
expression. In this particular case, I found PostgreSQL row-wise comparison.
A comparison like
ROW(a,b) < ROW(c,d)
was implemented asa < c AND b < d
whereas the correct behavior is equivalent toa < c OR (a = c AND b < d)
.
This means the OR
part in the example can be replaced by the row-wise comparison:
Before:
(i.created_at < '2022-01-01' OR (i.created_at = '2022-01-01' AND i.id < 'test') )
After:
(i.created_at, id) < ('2022-01-01', 'test')
a < c OR (a = c AND b < d)
is equivalent to ROW(a,b) < ROW(c,d)
a
:i.created_at
b
:i.id
c
:'2022-01-01'
d
:'test'
Let’s try the new query:
docker exec -i postgres psql -U postgres test_db -c "EXPLAIN ANALYZE SELECT i.id, i.created_at FROM items i JOIN item_categories ic ON i.id = ic.item_id WHERE i.status IN (1) AND ic.category_id IN ('category-1', 'category-2', 'category-3') AND (i.created_at, i.id) < ('2022-01-01', 'test') ORDER BY i.created_at DESC, i.id DESC LIMIT 30"
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=34.63..1044.84 rows=30 width=45) (actual time=34.256..34.261 rows=30 loops=1)
-> Incremental Sort (cost=34.63..4022220.90 rows=119446 width=45) (actual time=34.255..34.257 rows=30 loops=1)
Sort Key: i.created_at DESC, i.id DESC
Presorted Key: i.created_at
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
-> Nested Loop (cost=1.00..4016845.83 rows=119446 width=45) (actual time=0.283..34.173 rows=31 loops=1)
-> Index Scan Backward using items_status_created_at_idx on items i (cost=0.43..499734.21 rows=1762594 width=45) (actual time=0.099..8.806 rows=362 loops=1)
Index Cond: ((status = 1) AND (created_at <= '2022-01-01 00:00:00'::timestamp without time zone))
Filter: (ROW(created_at, id) < ROW('2022-01-01 00:00:00'::timestamp without time zone, 'test'::text))
-> Index Only Scan using item_categories_cateogory_id_item_id_idx on item_categories ic (cost=0.56..1.99 rows=1 width=37) (actual time=0.066..0.068 rows=0 loops=362)
Index Cond: ((category_id = ANY ('{category-1,category-2,category-3}'::text[])) AND (item_id = i.id))
Heap Fetches: 0
Planning Time: 2.184 ms
Execution Time: 34.374 ms
(14 rows)
Now you can see the query completes in 34 ms!!
You can also see the change in Index Cond
and Filter
. There’s no Rows Removed By Filter
-> Index Scan Backward using items_status_created_at_idx on items i (cost=0.43..499734.21 rows=1762594 width=45) (actual time=0.099..8.806 rows=362 loops=1)
Index Cond: ((status = 1) AND (created_at <= '2022-01-01 00:00:00'::timestamp without time zone))
Filter: (ROW(created_at, id) < ROW('2022-01-01 00:00:00'::timestamp without time zone, 'test'::text))References
For more details about row constructor comparison, please read the PostgreSQL documentation.
The query performance has been improved from 20 sec to 30 ms!
5. Summary
In this post, I shared a simple example of query optimization in your local environment: creating a schema in local PostgreSQL container, generating a sample dataset and, executing EXPLAIN, breaking down the slow part of the sample query, and finally actually improving the performance from 20s to 30ms.
I hope you can also try query optimization with the sample example or any other queries in your local environment.
This is just the beginning of the query performance improvement. I’ll keep trying to find a better way to get more familiar with query optimization.