Can spatial index help a “range - order by - limit” query

postgresql add tag

ypercubeᵀᴹ (imported from SE)

Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.

We have the following table with a tree structure (Nested Set model) of words and their frequencies:

    lexikon
    -------
    _id   integer  PRIMARY KEY
    word  text
    frequency integer
    lset  integer  UNIQUE KEY
    rset  integer  UNIQUE KEY

And the query:

    SELECT word
    FROM lexikon
    WHERE lset BETWEEN @Low AND @High
    ORDER BY frequency DESC
    LIMIT @N

I suppose a covering index on `(lset, frequency, word)` would be useful but I feel it may not perform well if there are too many `lset` values in the `(@High, @Low)` range. 

A simple index on `(frequency DESC)` may also be sufficient sometimes, when a search using that index yields early the `@N` rows that match the range condition. 

But it seems that performance depends a lot on the parameter values.

Is there a way to make it perform fast, regardless of whether the range `(@Low, @High)` is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?

Would an R-tree/spatial index help?

Adding indexes, rewriting the query, re-designing the table, there is no limitation.

Top Answer

Erwin Brandstetter (imported from SE)

### Setup
I am building on **@Jack's setup** to make it easier for people to follow and compare. Tested with **PostgreSQL 9.1.4**.

    CREATE TABLE lexikon (
       lex_id    serial PRIMARY KEY
     , word      text
     , frequency int NOT NULL  -- we'd need to do more if NULL was allowed
     , lset      int
    );
    
    INSERT INTO lexikon(word, frequency, lset) 
    SELECT 'w' || g  -- shorter with just 'w'
         , (1000000 / row_number() OVER (ORDER BY random()))::int
         , g
    FROM   generate_series(1,1000000) g

From here on I take a different route:

    ANALYZE lexikon;

### Auxiliary table
This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema `public`, use any schema of your choice.

    CREATE TABLE public.lex_freq AS
    WITH x AS (
       SELECT DISTINCT ON (f.row_min)
              f.row_min, c.row_ct, c.frequency
       FROM  (
          SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
          FROM   lexikon
          GROUP  BY 1
          ) c
       JOIN  (                                   -- list of steps in recursive search
          VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
          ) f(row_min) ON c.row_ct >= f.row_min  -- match next greater number
       ORDER  BY f.row_min, c.row_ct, c.frequency DESC
       )
    , y AS (   
       SELECT DISTINCT ON (frequency)
              row_min, row_ct, frequency AS freq_min
            , lag(frequency) OVER (ORDER BY row_min) AS freq_max
       FROM   x
       ORDER  BY frequency, row_min
       -- if one frequency spans multiple ranges, pick the lowest row_min
       )
    SELECT row_min, row_ct, freq_min
         , CASE freq_min <= freq_max
             WHEN TRUE  THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
             WHEN FALSE THEN 'frequency  = ' || freq_min
             ELSE            'frequency >= ' || freq_min
           END AS cond
    FROM   y
    ORDER  BY row_min;

Table looks like this:

    row_min | row_ct  | freq_min | cond
    --------+---------+----------+-------------
    400     | 400     | 2500     | frequency >= 2500
    1600    | 1600    | 625      | frequency >= 625 AND frequency < 2500
    6400    | 6410    | 156      | frequency >= 156 AND frequency < 625
    25000   | 25000   | 40       | frequency >= 40 AND frequency < 156
    100000  | 100000  | 10       | frequency >= 10 AND frequency < 40
    200000  | 200000  | 5        | frequency >= 5 AND frequency < 10
    400000  | 500000  | 2        | frequency >= 2 AND frequency < 5
    600000  | 1000000 | 1        | frequency  = 1

As the column `cond` is going to be used in dynamic SQL further down, you have to make this table ***secure***. Always schema-qualify the table if you cannot be sure of an appropriate current `search_path`, and revoke write privileges from `public` (and any other untrusted role):

    REVOKE ALL ON public.lex_freq FROM public;
    GRANT SELECT ON public.lex_freq TO public;

The table `lex_freq` serves three purposes:

- Create needed [partial indexes][1] automatically.
- Provide steps for iterative function.
- Meta information for tuning.

### Indexes
This `DO` statement creates *all* needed indexes:

    DO
    $$
    DECLARE
       _cond text;
    BEGIN
       FOR _cond IN
          SELECT cond FROM public.lex_freq
       LOOP
          IF _cond LIKE 'frequency =%' THEN
             EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
          ELSE
             EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
          END IF;
       END LOOP;
    END
    $$

All of these **partial** indexes together span the table once. They are about the same size as one basic index on the whole table:

    SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
    SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB

Only 21 MB of indexes for 50 MB table so far.

I create most of the partial indexes on `(lset, frequency DESC)`. The second column only helps in special cases. But as both involved columns are of type `integer`, due to the specifics of data [alignment in combination with MAXALIGN][2] in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.

There is no point in doing that for partial indexes that only span a single frequency. Those are just on `(lset)`. Created indexes look like this:

    CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
    CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
    -- ...
    CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
    CREATE INDEX ON lexikon(lset) WHERE freqency = 1;

### Function
The function is somewhat similar in style to @Jack's solution:

    CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
      RETURNS SETOF lexikon
    $func$
    DECLARE
       _n      int;
       _rest   int := _limit;   -- init with _limit param
       _cond   text;
    BEGIN 
       FOR _cond IN
          SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
       LOOP    
          --  RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
          RETURN QUERY EXECUTE '
             SELECT * 
             FROM   public.lexikon 
             WHERE  ' || _cond || '
             AND    lset >= $1
             AND    lset <= $2
             ORDER  BY frequency DESC
             LIMIT  $3'
          USING  _lset_min, _lset_max, _rest;
        
          GET DIAGNOSTICS _n = ROW_COUNT;
          _rest := _rest - _n;
          EXIT WHEN _rest < 1;
       END LOOP;
    END
    $func$ LANGUAGE plpgsql STABLE;

Key differences:

- **dynamic SQL** with `RETURN QUERY EXECUTE`.  
As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.

- **Dynamic `LIMIT`** for every query step.  
This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional `LIMIT` in the function call to trim the surplus.

## Benchmark
### Setup
I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:

1. The raw SQL query of the form:

        SELECT * 
        FROM   lexikon 
        WHERE  lset >= 20000
        AND    lset <= 30000
        ORDER  BY frequency DESC
        LIMIT  5;

2. The same after creating this index

        CREATE INDEX ON lexikon(lset);

 Needs about the same space as all my partial indexes together:

        SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB

3. The function

        SELECT * FROM f_search(20000, 30000, 5);

### Results

```
SELECT * FROM f_search(<b>20000, 30000, 5</b>);
```
```
1: Total runtime: 315.458 ms  
2: Total runtime: 36.458 ms  
3: Total runtime: **0.330 ms**
```

```
SELECT * FROM f_search(<b>60000, 65000, 100</b>);
```
```
1: Total runtime: 294.819 ms  
2: Total runtime: 18.915 ms  
3: Total runtime: **1.414 ms**
```

```
SELECT * FROM f_search(<b>10000, 70000, 100</b>);
```
```
1: Total runtime: 426.831 ms  
2: Total runtime: 217.874 ms  
3: Total runtime: **1.611 ms**
```

```
SELECT * FROM f_search(<b>1, 1000000, 5</b>);
```
```
1: Total runtime: 2458.205 ms  
2: Total runtime: 2458.205 ms  -- for large ranges of lset, seq scan is faster than index.  
3: Total runtime: **0.266 ms**
```

### Conclusion

As expected, the benefit from the function grows with bigger ranges of `lset` and smaller `LIMIT`.  

With **very small ranges of `lset`**, the raw query in combination with the index is actually *faster*. You'll want to test and maybe branch: raw query for small ranges of `lset`, else function call. You could even just **build that into the function** for a "best of both worlds" - that's what I would do.

Depending on your data distribution and typical queries, more steps in `lex_freq` may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.


  [1]: https://www.postgresql.org/docs/current/static/indexes-partial.html
  [2]: https://stackoverflow.com/questions/2966524/calculating-and-saving-space-in-postgresql/7431468#7431468

Answer #2

Jack Douglas (imported from SE)

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and `lexikon` dummy data: 

    begin;
    set role dba;
    create role stack;
    grant stack to dba;
    create schema authorization stack;
    set role stack;
    --
    create table lexikon( _id serial, 
                          word text, 
                          frequency integer, 
                          lset integer, 
                          width_granule integer);
    --
    insert into lexikon(word, frequency, lset) 
    select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
    from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
    --
    update lexikon set width_granule=ln(frequency)::integer;
    --
    create index on lexikon(width_granule, lset);
    create index on lexikon(lset);
    -- the second index is not used with the function but is added to make the timings 'fair'

`granule` analysis (mostly for information and tuning):

    create table granule as 
    select width_granule, count(*) as freq, 
           min(frequency) as granule_start, max(frequency) as granule_end 
    from lexikon group by width_granule;
    --
    select * from granule order by 1;
    /*
     width_granule |  freq  | granule_start | granule_end
    ---------------+--------+---------------+-------------
                 0 | 500000 |             1 |           1
                 1 | 300000 |             2 |           4
                 2 | 123077 |             5 |          12
                 3 |  47512 |            13 |          33
                 4 |  18422 |            34 |          90
                 5 |   6908 |            91 |         244
                 6 |   2580 |           245 |         665
                 7 |    949 |           666 |        1808
                 8 |    349 |          1811 |        4901
                 9 |    129 |          4926 |       13333
                10 |     47 |         13513 |       35714
                11 |     17 |         37037 |       90909
                12 |      7 |        100000 |      250000
                13 |      2 |        333333 |      500000
                14 |      1 |       1000000 |     1000000
    */
    alter table granule drop column freq;
    --

function for scanning high frequencies first:

    create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
           returns setof lexikon language plpgsql set search_path to 'stack' as $$
    declare
      m integer;
      n integer := 0;
      r record;
    begin 
      for r in (select width_granule from granule order by width_granule desc) loop
        return query( select * 
                      from lexikon 
                      where width_granule=r.width_granule 
                            and lset>=p_lset_low and lset<=p_lset_high );
        get diagnostics m = row_count;
        n = n+m;
        exit when n>=p_limit;
      end loop;
    end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching) 

first using the function we've written:

    \timing on
    --
    select * from f(20000, 30000, 5) order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 80.452 ms
    */
    select * from f(20000, 30000, 5) order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 0.510 ms
    */

and then with a simple index scan:

    select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 218.897 ms
    */
    select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
    /*
     _id |   word    | frequency | lset  | width_granule
    -----+-----------+-----------+-------+---------------
     141 | word23237 |      7092 | 23237 |             9
     246 | word25112 |      4065 | 25112 |             8
     275 | word23825 |      3636 | 23825 |             8
     409 | word28660 |      2444 | 28660 |             8
     418 | word29923 |      2392 | 29923 |             8
    Time: 51.250 ms
    */
    \timing off
    --
    rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the `limit` clause and size of `lset` ranges sought.

2 Answers