Explaining the inexplicable. Part 5

We continue to prepare for PG Day’16 and to acquaint you with interesting possibilities.

previous posts this series I talked about how to read the output and EXPLAIN what each line means (operation/Assembly).

In the final post, I will try to explain why Postgres chooses "Operation X" and not "Operation Y".



You may have heard that the PostgreSQL scheduler selects operations based on the statistics. What are the statistics?

Let's imagine the most simple scenario possible:

the
SELECT * FROM table WHERE column = some_value;

If all rows in a table the same value some_value, then application to a column of the index (potentially existing) does not make sense.

On the other hand, if the values in the column are unique (or almost unique) use index is a great idea.

Let's see what happens:

the
create table test ( all_the_same int4, int4 almost_unique );
CREATE TABLE

insert into test ( all_the_same, almost_unique )
select 123, random() * 1000000 from generate_series(1,100000);
INSERT 0 100000

So, I have a table to 100,000 rows in which column "all_the_same" is always the same value (123), and the column almost_unique, as the name implies, almost unique:

the
select count(*), count(distinct almost_unique) from test;
count | count 
--------+-------
100000 | 95142
(1 row)

Now, to make them equal, I'll create two simple index:

the
create index i1 on test (all_the_same);
CREATE INDEX

create index i2 on test (almost_unique);
CREATE INDEX

OK, test configuration ready. And what about your plans?

the
explain select * from test where all_the_same = 123;
QUERY PLAN 
------------------------------------------------------------
Seq Scan on test (cost=0.00..1693.00 rows=100000 width=8)
Filter: (all_the_same = 123)
(2 rows)

explain select * from test where almost_unique = 123;
QUERY PLAN 
---------------------------------------------------------------
Index Scan using i2 on test (cost=0.29..8.31 rows=1 width=8)
Index Cond: (almost_unique = 123)
(2 rows)

As you can see, Postgres made a wise choice. But here is of interest to estimate "rows=". How does he know how many rows it can return the request?
The answer lies in the ANALYZE or VACUUM ANALYZE.

When you apply to the table "ANALYZE", Postgres takes a "random sample" (random sample) (in a second will talk more about this) and get some statistics. What kind of stats where it is, and can we see her? Of course, you can:

the
select * from pg_statistic where starelid = 'test'::regclass;
-[ RECORD 1 ]-----------------------------------------------------------------------------
starelid | 16882
staattnum | 1
stainherit | f
stanullfrac | 0
stawidth | 4
stadistinct | 1
stakind1 | 1
stakind2 | 3
stakind3 | 0
stakind4 | 0
stakind5 | 0
staop1 | 96
staop2 | 97
staop3 | 0
staop4 | 0
staop5 | 0
stanumbers1 | {1}
stanumbers2 | {1}
stanumbers3 | [null]
stanumbers4 | [null]
stanumbers5 | [null]
stavalues1 | {123}
stavalues2 | [null]
stavalues3 | [null]
stavalues4 | [null]
stavalues5 | [null]
-[ RECORD 2 ]-----------------------------------------------------------------------------
starelid | 16882
staattnum | 2
stainherit | f
stanullfrac | 0
stawidth | 4
stadistinct | -0.92146
stakind1 | 1
stakind2 | 2
stakind3 | 3
stakind4 | 0
stakind5 | 0
staop1 | 96
staop2 | 97
staop3 | 97
staop4 | 0
staop5 | 0
stanumbers1| {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05}
stanumbers2 | [null]
stanumbers3 | {-0.000468686}
stanumbers4 | [null]
stanumbers5 | [null]
stavalues1 | {21606,27889,120502,289914,417495,951355,283,1812,3774,6028,6229,10372,12234,13291,18309,18443,21758,22565,26634,28392,28413,31208,32890,36563,39277,40574,44527,49954,53344,53863,56492,56715,60856,62993,64294,65275,65355,68353,71194,74718,77205,82096,82783,84764,85301,87498,90990,94043,97304,98779,101181,103700,103889,106288,108562,110796,113154,117850,121578,122643,123874,126299,129236,129332,129512,134430,134980,136987,137368,138175,139001,141519,142934,143432,143707,144501,148633,152481,154327,157067,157799,162437,164072,164337,165942,167611,170319,171047,177383,184134,188702,189005,191786,192718,196330,197851,199457,202652,202689,205983}

stavalues3 | [null]
stavalues4 | [null]
stavalues5 | [null]

This table (pg_statistic) is certainly described the documentation, but still quite mysterious. Of course, you can find very accurate explanation in the source code, but it's (usually) not the best solution.

Fortunately, there is a view for this table, which contains the same data in more readable view:

the
select * from pg_stats where tablename = 'test';
-[ RECORD 1 ]----------+------------------------------------------------------------------
schemaname | public
tablename | test
attname | all_the_same
inherited | f
null_frac | 0
avg_width | 4
n_distinct | 1
most_common_vals | {123}
most_common_freqs | {1}
histogram_bounds | [null]
correlation | 1
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]
-[ RECORD 2 ]----------+------------------------------------------------------------------
schemaname | public
tablename | test
attname | almost_unique
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -0.92146
most_common_vals | {21606,27889,120502,289914,417495,951355,283,1812,3774,6028,6229,10372,12234,13291,18309,18443,21758,22565,26634,28392,28413,31208,32890,36563,39277,40574,44527,49954,53344,53863,56492,56715,60856,62993,64294,65275,65355,68353,71194,74718,77205,82096,82783,84764,85301,87498,90990,94043,97304,98779,101181,103700,103889,106288,108562,110796,113154,117850,121578,122643,123874,126299,129236,129332,129512,134430,134980,136987,137368,138175,139001,141519,142934,143432,143707,144501,148633,152481,154327,157067,157799,162437,164072,164337,165942,167611,170319,171047,177383,184134,188702,189005,191786,192718,196330,197851,199457,202652,202689,205983}
most_common_freqs| {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05}
histogram_bounds | {2,10560,20266,31061,40804,50080,59234,69240,79094,89371,99470,109557,119578,130454,140809,152052,162656,173855,183914,194263,204593,214876,224596,233758,243246,253552,264145,273855,283780,294475,303972,314544,324929,335008,346169,356505,367395,376639,387302,397004,407093,416615,426646,436146,445701,455588,466463,475910,485228,495434,505425,515853,525374,534824,545387,554794,563591,573721,584021,593368,602935,613238,623317,633947,643431,653397,664177,673976,684042,694791,703922,714113,724602,735848,745596,754477,764171,772535,781924,791652,801703,812487,822196,831618,841665,850722,861532,872067,881570,891654,901595,910975,921698,931785,940716,950623,960551,970261,979855,989540,999993}
correlation | -0.000468686
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]

Great. So what knowledge can we glean from here?

The columns schemaname, tablename, attname and seem obvious. Inherited simply reports whether the values in the table values from any tables that are inherited this column.

So, if I created a table:

the
create table z () inherits (test);

And then he added to this table z some data, the statistics in the table test would show the "inherited = true".

The remaining columns indicate the following:
the
    the
  • null_frac — how many rows are null in this column. This share, therefore the value will be between 0 and 1.
  • the
  • avg_width — average width (approx. TRANS.: size) data in this column. It's not very interesting if the width is constant (like int4 in this example), but in the case of any data types with variable width (as in text/varchar/numeric) that can be useful.
  • the
  • n_distinct is a very interesting quantity. If it is positive (1+), it will be just an indicative number (not fraction!) different values, as we see in the case of the column all_the_same where n_distinct is true is equal to 1. But if it is negative, the meaning changes: n_distinct shows the proportion of lines has a unique value. Therefore, in the case of almost_unique statistics believes that 92.146% of the rows have a unique value (which is slightly less 95.142% that I showed earlier). Values can be wrong because of that thing with the "random sample", which I have already mentioned and will later explain in detail.
  • the
  • most_common_freqs — how often the values in most_common_vals is also a share, so the maximum value is 1 (but it will mean that we have only one value in most_common_vals). Here in almost_unique, we see that Postgres "thinks" that values 21606, 27889, 120502, 289914, 417495, 951355 are the most common, but it's not. Again, blame the effect of "random sample".
  • the
  • histogram_bounds — array of values that divides (or should share again all depends on the "random sample") the entire dataset into groups with the same number of rows. That is, the number of rows almost_unique between 2 and 10560 is the same (more or less) as the number of rows almost_unique between 931785 and 940716.
  • the
  • correlation is a very interesting statistic, it shows whether there is correlation between physical sorting of the rows on disk, and values. This value can vary from -1 to 1, and the closer it is to -1/1, the more correlation. For example, after starting the "CLUSTER test using i2", which periarterial table in order almost_unique, I got a correlation 0.919358 — much better than the previous value -0.000468686.
    most_common_elems, most_common_elem_freqs and elem_count_histogram the same as most_common_vals, most_common_freqs and histogram_bounds, but for non-scalar data types (i.e., arrays, tsvectors and alike).

Based on these data, PostgreSQL can roughly estimate how many rows will be returned to any selected part of a query, and, based on this information, decide that it is better to use seq scan, index scan or bitmap index scan. And when the Association — which operation must be faster: Hash Join, Merge Join, or perhaps Nested Loop.

If you have carefully studied the details given above, you could wonder: is this a large enough set of output data in arrays most_common_vals/most_common_freqs/histogram_bounds contains many values. Why so many?

The reason is simple — it's all in the settings. In postgresql.conf you can find the variable default_statistics_target. This variable tells Postgres how many values to store in these arrays. In my case (the default), this number is 100. But you can easily change it. To make the change in postgresql.conf, or even for each individual column like this:

the
alter table test alter column set statistics almost_unique 5;

After applying the ALTER (and ANALYZE) data in pg_stats significantly shortened:

the
select * from pg_stats where tablename = 'test' and not inherited and attname = 'almost_unique';
-[ RECORD 1 ]----------+---------------------------------------------------------
schemaname | public
tablename | test
attname | almost_unique
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -0.92112
most_common_vals | {114832,3185,3774,6642,11984}
most_common_freqs | {0.0001,6.66667 e-05,6.66667 e-05,6.66667 e-05,6.66667 e-05}
histogram_bounds | {2,199470,401018,596414,798994,999964}
correlation | 1
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]

Changing the statistic target also has another effect.

Let me show you. To begin, I will rollback the changes in the counting statistics, which I made using the ALTER TABLE statement:

the
alter table test alter column almost_unique set statistics -1;

Now do the following:

the
$ analyze verbose test;
INFO: analyzing "public.test"
INFO: "test": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 30000 rows in sample, estimated total of 100,000 rows
ANALYZE

$ alter table test alter column almost_unique set statistics 10;
ALTER TABLE

$ alter table test alter column all_the_same set statistics 10;
ALTER TABLE

$ analyze verbose test;
INFO: analyzing "public.test"
INFO: "test": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 3000 rows in sample, estimated total of 100,000 rows
ANALYZE

Note that the second analyze tested only 3000 rows, not 30000 as the first.

This is the "random sample".

Analyzing all lines will be prohibitively expensive for any medium or large table.

So Postgres comes smarter.

First, it reads a random part of pages in the table (remember: each page is 8Kb of data). How much is it? 300 * statistics_target.

This means that in my case with default_statistics_target = 100, it will read 30000 pages (on my spreadsheet so much there, so Postgres read them all).

Of these pages ANALYZE takes only the information about the living and the dead lines. He then retrieves a random sample of rows — again 300 * statistics target, and considers statistics on column based on this data.
In my case the table had 100,000 rows, but with default_statistics_target = 100, only a third were analyzed. And, given the values of the target statistics, the number of analysed lines is even smaller — only 3,000.

You could say: OK, but in this case these statistics are accurate. It may happen that some super-common is not caught in any of the scanned lines. Of course, you're right. Possible. Although not too likely. You take a random piece of data. The chances are that you will get x% of the table in which there are no rows with some value that is present in all the other rows are negligible.

It also means that in some cases, the start analyze will "break" your requests. For example, you will get statistics on other pages in and out so that some values will be skipped (or Vice versa — you will get most_common_vals not so common values, it just so happens that Postgres chose the appropriate page/line, to see them). And on the basis of such statistics Pg will generate suboptimal plans.

If you encounter such a situation, to solve it is quite simple — increase the statistics target. This will force analyze to work harder and to scan more rows, so the chances that this happens again, will be even less.

But in the setting of large values of the statistics target has certain disadvantages. First, ANALYZE we have to work more, but it is a question of exploitation, so we are not too worried about it (usually). But the main problem is that, the more data in pg_statistic, the more data must be taken into account by the planner Pg. Therefore, it may be tempting to set default_statistics_target to a maximum of 10,000, in reality I have not seen databases where the value would be so high.

Current 100 is installed by default starting with version 8.4. In previous versions the default value was 10, and on irc often met tips to increase it. Now with a value of 100 more or less configured.

The last thing I have to tell you, though I do not really want settings that make the scheduler Postgres to use a different operation.

First, let me explain why I don't want to talk about it: I know that this can easily be abused. So remember: these settings need to find problems, not to solve them. The application that will use them in desktop mode, you can at least suspect that it is broken. And Yes, I know that sometimes it is necessary to do so. But this "sometimes" happens very rarely.

Now that I've warned you, let's see what I can do.

In postgresql.conf you have several settings:

the
enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_indexonlyscan = on
enable_material = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on

These settings need to disconnect the selected transactions.

For example, switching enable_seqscan to false (this can be done by using the SET command in a SQL session, you don't need to modify postgresql.conf) will lead to the fact that the scheduler will use everything I can in order to avoid a sequential scan.

And because sometimes to avoid a sequential scan is not possible (for example, if a table has no indexes), these settings will not actually disable the operation, and tie them using a huge costs.

Here is an example. In our test table we know that searching with "all_the_same = 123" will use a sequential scan, because it does not cost much:

the
explain select * from test where all_the_same = 123;
QUERY PLAN 
------------------------------------------------------------
Seq Scan on test (cost=0.00..1693.00 rows=100000 width=8)
Filter: (all_the_same = 123)
(2 rows)

But if we disable seq scan:

the
set enable_seqscan = false;
SET
explain select * from test where all_the_same = 123;
QUERY PLAN 

Index Scan using i1 on test (cost=0.29..3300.29 rows=100000 width=8)
Index Cond: (all_the_same = 123)
(2 rows)

We see that the estimated cost of obtaining the same data with index scan ~ two times higher (3300.29 vs 1693).

If I remove the index i1:

the
drop index i1;
DROP INDEX
set enable_seqscan = false;
SET
explain select * from test where all_the_same = 123;
QUERY PLAN 
-----------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000001693.00 rows=100000 width=8)
Filter: (all_the_same = 123)
(2 rows)

And we see that when options other than the sequential scan, not (interestingly, Postgres not choose to hold index scan on i2, although this index includes pointers to all rows in the table), the cost soared to 10,000,000,000 — that's what enable_* = false does.

I think that's all. If you read the whole series, now you should have enough knowledge to understand what is happening and, more importantly, why.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Automatically create Liquibase migrations for PostgreSQL

Looking for books as you want

Vkontakte sync with address book for iPhone. How it was done