Query Optimizations Tips - PostgreSQL 12 High Performance Guide (Part 9/12)

Key Takeaways

First question the semantic correctness of a statement before attacking the performance problem
We should avoid SELECT *, ORDER BY and DISTINCT unless we really need them and there is no other way
Explore PostgreSQL features such as CASE syntax and parallel queries to reduce time execution for queries

Understanding and fixing joins

Joins are important considering they are used on a regular basis and therefore they are relevant to maintaining good performance. To ensure that you can write good joins, we will learn about joining in this section.

Doing joins right

Before we get into optimizing joins, it is essential to take a look at some of the most common problems that appear with joins and which of them should ring a bell for you.

Here is an example of a simple table structure to demonstrate how joins work:

test=# CREATE TABLE t1 (aid int);
CREATE TABLE
 
test=# CREATE TABLE t2 (bid int);
CREATE TABLE
   
test=# INSERT INTO t1 VALUES (1), (2), (3);
INSERT 0 3
   
test=# INSERT INTO t2 VALUES (2), (3), (4);
INSERT 0 3

Now we are doing a simple outer join:

test=# SELECT * FROM t1 LEFT JOIN t2 ON (aid = bid);
 aid | bid 
-----+-----
   1 |    
   2 |   2
   3 |   3
(3 rows)

You can see that PostgreSQL will take all rows from the left-hand side, and on the right side only list the ones fitting the join.

However, consider the following example which might come as a surprise:

test=# SELECT * FROM t1 LEFT JOIN t2 ON (aid = bid AND bid = 2);
 aid | bid 
-----+-----
   1 |    
   2 |   2
   3 |    
(3 rows)

Almost everyone assumes that there will only be one row returned. However, the number of rows does not decrease and if not aware of it, it will lead to some hidden issues. For example, the following query which wants to apply a function on a single row.

test=# SELECT avg(aid), avg(bid) 
           FROM t1 LEFT JOIN t2
                ON (aid = bid AND bid = 2);
        avg         |        avg         
--------------------+--------------------
 2.0000000000000000 | 2.0000000000000000
(1 row)

Most people assume that the function is calculated based on a single row. However, as stated earlier, this is not the case. Such queries can cause performance problems because PostgreSQL does not index the table on the left-hand side of the join. Performance problem aside, here we are encountering a semantic issue. It happens on a regular basis that people writing outer joins don’t really know what they ask PostgreSQL to do. Consequently, the advice here is to always double-check the semantic correctness of an outer join before attacking the performance problem reported by the client.

Forcing Join Order

As the number of joins increases, overall query costs go up exponentially. Controlling that complexity is a major component of both query tuning and ongoing improvement in the PostgreSQL optimizer. Every use of a sequential scan or an index lookup will be magnified as those results are then joined to additional tables.

During the planning process, PostgreSQL tries to check all possible join orders. In many cases, this can be pretty expensive because there can be many permutations, which naturally slows down the planning process.

Consider the following example:

test=# SELECT * FROM tab1, tab2, tab3 WHERE tab1.id = tab2.id
        AND tab2.id = tab3.id; 
test=# SELECT * FROM tab1 CROSS JOIN tab2 CROSS JOIN tab3
        WHERE tab1.id = tab2.id
            AND tab2.id = tab3.id;
test=# SELECT * FROM tab1 JOIN (tab2 JOIN tab3
        ON (tab2.id = tab3.id)) ON (tab1.id = tab2.id);

Basically, these three queries are identical from a semantic point of view. The first query consists of implicit joins. The last one consists only of explicit joins.

PostgreSQL will do searches for an optimal plan which is time-consuming, and it has the potential to make a bad decision. If you know the right way to join the tables efficiently and want to reduce planning time you can force the order of execution. You can tell the optimizer to use the order you specified when doing a series of explicit JOIN operations.

Also, you will need to reduce the parameter join_collapse_limit from its default value of eight. By lowering this parameter to one, you will prevent all join reordering. The default value is reasonably good for normal queries. However, if your query contains a very high number of joins, playing around with this setting can reduce planning time considerably. Reducing planning time can be essential to maintaining good throughput.

test=# SET join_collapse_limit = 1;

Alternative plans won’t be considered. This can be useful in two main contexts. First, if query planning time is large for a particularly complicated join, discovering the usual order of execution and making it explicit in the query can save significant planning time. Second, in cases where the order selected by the optimizer was poor, you can use this form ordering as a hint.

To see how the join_collapse_limit variable changes the plan you can test via this simple query:

test=# EXPLAIN WITH x AS
   (
SELECT *
     FROM  generate_series(1, 1000) AS id
   )
   SELECT *
   FROM x AS a
JOIN x AS b ON (a.id = b.id) JOIN x AS c ON (b.id = c.id) JOIN x AS d ON (c.id = d.id);

Try running the query with different settings and see how the plan changes.

Queries to avoid

Using SELECT *

We should not use SELECT * unless we really need all the columns from the tables. We can reduce the response time by explicitly mentioning the columns.

Consider the following example. In both cases, indexes were being used. However, in one case where SELECT * is used, the index scan is followed by heap fetch. In the other case, PostgreSQL is able to get the data from the index and did not have to access the table.

Let’s create a test example.

test=# CREATE TABLE orders ( 
        order_no serial NOT NULL,
        descr varchar,
        created_ts timestamp,
        modified_ts timestamp);
CREATE TABLE

test=# INSERT into orders(order_no, descr,
             created_ts, modified_ts)
             SELECT generate_series(1,5000000) order_no, 
             repeat('The description for the product',10), now(), now();
INSERT 0 5000000

CREATE INDEX or_idx ON orders(order_no);

And now we can test the difference:

test=# EXPLAIN ANALYZE SELECT * FROM orders WHERE order_no < 1000;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using or_idx on orders  (cost=0.43..74.81 rows=936 width=334) (actual time=0.020..0.696 rows=1003 loops=1)
   Index Cond: (order_no < 1000)
 Planning Time: 0.127 ms
 Execution Time: 0.863 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT order_no FROM orders WHERE order_no < 1000;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using or_idx on orders  (cost=0.43..28.81 rows=936 width=4) (actual time=0.018..0.221 rows=1003 loops=1)
   Index Cond: (order_no < 1000)
   Heap Fetches: 0
 Planning Time: 0.109 ms
 Execution Time: 0.364 ms
(5 rows)

The plan clearly mentions Index Only Scan, as shown above. It might be necessary to vacuum analyze the table to ensure that index-only scans work as expected.

Using ORDER BY *

Sometimes data output has to be ordered on one or more columns. For example, the transactions we see when we access our bank accounts online. The data is usually sorted on the transaction date. However, there are cases when the data needs to be sorted on some other columns. When we sort the results of SELECT *, the entire dataset that was selected has to be shuffled. Usually, this happens in memory and in cases where memory is not sufficient, this will happen on disk with a huge performance penalty. So, we should not use ORDER BY unless we really have to.

Having an index on the ORDER BY columns can reduce the sort overhead, as the data will be retrieved in the sorted order from the index.

In the following example, there is no index on the x column. However, there is an index on the y column. In the first query, we use ORDER BY on x. As expected, PostgreSQL does a Seq Scan plus a sort for the results. In the second case, PostgreSQL uses the index, although we are not filtering on any column.

test=# EXPLAIN SELECT x,y FROM test_order_by
         ORDER BY x LIMIT 10;
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Limit  (cost=880773.41..880774.58 rows=10 width=8)
   ->  Gather Merge  (cost=880773.41..5742225.49 rows=41666680 width=8)
         Workers Planned: 2
         ->  Sort  (cost=879773.39..931856.74 rows=20833340 width=8)
               Sort Key: x
               ->  Parallel Seq Scan on test_order_by  (cost=0.00..429572.40 rows=20833340 width=8)

test=# CREATE INDEX y_idx ON test_order_by (y);

test=# EXPLAIN SELECT x,y FROM test_order_by
          ORDER BY y LIMIT 10;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Limit  (cost=0.56..1.00 rows=10 width=8)
   ->  Index Scan using y_idx on test_order_by  (cost=0.56..2183351.23 rows=50000000 width=8)
(2 rows)

Using DISTINCT

Sometimes, we are not sure about the joins and filters we have used. So, in order to ensure that there are no duplicates, we add the DISTINCT clause. However, this adds quite a bit of overhead. When we add DISTINCT to a query, PostgreSQL will add an extra node in the plan. Therefore, the actual time goes up significantly.

Unique (cost=0.29..402.33 rows=2 width=6) (actual time=0.030..44.960 rows=1 loops=1) 
      Output: first_name

Using DISTINCT to avoid duplicates might be just hiding a problem. The right approach is to check whether there are duplicates and ensure that the joins are properly formed.

Reducing query time

Reducing the number of SQL statements

Let’s consider that we want to generate a report for bank transactions. The report should categorize transactions into various groups. The report should show us how many transactions were there for amounts ranging from 0 to 100, how many transactions were there for amounts ranging from 101 to 1000, and so on. One option to get the data is to write multiple SQL statements with different filter criteria, as shown here:

pgbench=# EXPLAIN ANALYZE SELECT COUNT(aid) bl_100 FROM pgbench_accounts WHERE abalance <= 100
pgbench-# UNION
pgbench-# SELECT COUNT(aid) bl_100 FROM pgbench_accounts WHERE abalance > 100 AND abalance <= 1000 
pgbench-# UNION
pgbench-# SELECT COUNT(aid) bl_100 FROM pgbench_accounts WHERE abalance > 1000;

- - - -

Planning Time: 0.307 ms
Execution Time: 652.270 ms

If we want to group the transactions into three categories, we would need to execute three queries. While this approach fetches the data we want, it involves scanning the table three times. We can avoid this by writing the following query:

pgbench=# EXPLAIN ANALYZE SELECT COUNT(CASE WHEN abalance < 100 THEN 1 ELSE null END) bl_100,
             COUNT (CASE WHEN abalance BETWEEN 101 AND 1000 THEN 1 ELSE null END) bg_100,
            COUNT (CASE WHEN abalance > 1000 THEN 1 ELSE null END) bg_1000
         FROM pgbench_accounts;
          
- - - -
                                                             
 Planning Time: 0.127 ms
 Execution Time: 356.175 ms
(8 rows)

Here, we can see that the counts have been retrieved correctly in both cases. However, the query with the CASE syntax had to scan the table only once, and the execution cost went down.

Parallel queries

Starting with version 9.6 PostgreSQL supports parallel queries. In this section, we will take a look at how parallelism works and what can be done to speed up things. As an example, we will continue to use the pgbench database.

pgbench=# EXPLAIN SELECT count(*) FROM pgbench_accounts;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=106894.31..106894.32 rows=1 width=8)
   ->  Gather  (cost=106894.10..106894.31 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=105894.10..105894.11 rows=1 width=8)
               ->  Parallel Index Only Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.43..100685.77 rows=2083333 width=0)
(5 rows)

Let us take a look at the execution plan of the query. First PostgreSQL performs a parallel sequential scan. This implies that PostgreSQL will use more than 1 CPU to process the table and it will create partial aggregates. The job of the gather node is to collect the data and to pass it on to the parent node. The gather node is, therefore, the end of parallelism. It is important to mention that currently parallelism is never nested. There can never be a gather node inside a gather node. In my example, PostgreSQL has decided on two worker processes. Why has PostgreSQL decided on only 2 nodes?

The main configuration variable to handle parallelism is max_parallel_workers_per_gather.

pgbench=# SHOW max_parallel_workers_per_gather;
 max_parallel_workers_per_gather 
---------------------------------
 2
(1 row)

max_parallel_workers_per_gather limits the number of worker processes allowed. However, for a small table, PostgreSQL will never use parallelism. The size of a table has to be at least 8 MB as defined by the following config setting:

pgbench=# SHOW min_parallel_table_scan_size;
 min_parallel_table_scan_size 
------------------------------
 8MB
(1 row)

Even more, the size of the table has to triple in order for PostgreSQL to add one more worker process. In other words: To get three additional workers you need at least 27 times as much data. This rule is applied because usually if the size of your database goes up 100 times, the storage system is usually not 100 times faster. The number of useful cores is therefore somewhat limited.

To check the size of the table we can run:

pgbench=# \d+
                            List of relations
 Schema |       Name       | Type  |    Owner    |  Size   | Description 
--------+------------------+-------+-------------+---------+-------------
 public | pgbench_accounts | table | postgres | 649 MB  | 
 public | pgbench_branches | table | postgres | 40 kB   | 
 public | pgbench_history  | table | postgres | 0 bytes | 
 public | pgbench_tellers  | table | postgres | 56 kB   | 
(4 rows)

So, if we want to increase the number of parallel workers, the first thing we can do is to increase max_parallel_workers_per_gather parameter.

pgbench=# SET max_parallel_workers_per_gather TO 10;
SET

In this case, we will run using only 5 workers, based on the size of the table. However, there are cases in which you want the number of cores used for a certain table is a lot higher. Imagine a database of 1 TB and only a single user. This single user could use up all CPUs without harming anyone else.

To set the number of CPUs explicitly, ALTER TABLE can be used to overrule what was discussed. However, max_parallel_workers_per_gather will still serve as the upper limit.

pgbench=# ALTER TABLE pgbench_accounts  SET (parallel_workers = 9);
ALTER TABLE

pgbench=# EXPLAIN ANALYZE SELECT count(*) FROM pgbench_accounts;
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=87797.81..87797.82 rows=1 width=8) (actual time=378.612..378.612 rows=1 loops=1)
   ->  Gather  (cost=87796.88..87797.79 rows=9 width=8) (actual time=378.528..381.309 rows=8 loops=1)
         Workers Planned: 9
         Workers Launched: 7
         ->  Partial Aggregate  (cost=86796.88..86796.89 rows=1 width=8) (actual time=366.228..366.228 rows=1 loops=8)
               ->  Parallel Index Only Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.43..85407.99 rows=555556 width=0) (actual time=0.670..319.404 rows=625000 loops=8)
                     Heap Fetches: 0
 Planning Time: 0.095 ms
 Execution Time: 381.379 ms
(9 rows)

As you can see only seven cores were launched although nine processes were planned. What is the reason? In this example two more variables come into play:

pgbench=# SHOW max_worker_processes;
 max_worker_processes 
----------------------
 8
(1 row)

pgbench=# SHOW max_parallel_workers;
 max_parallel_workers 
----------------------
 8
(1 row)

The first parameter tells PostgreSQL how many worker processes are available overall. However, the max_parallel_workers parameter says how many workers are available only for parallel queries. Why are there two parameters? Background processes are not used only by the parallel query workers, they can also be used for other purposes and therefore developers have decided to use two parameters.

Conclusion

In this section, we saw a few tips on how to optimize PostgreSQL queries, and how to treat joins. We also focused on understanding what makes a query poor or bad in nature.

Query Optimizations Tips – PostgreSQL 12 High Performance Guide (Part 9/12)