### Setup
I am building on **@Jack's setup** to make it easier for people to follow and compare. Tested with **PostgreSQL 9.1.4**.
CREATE TABLE lexikon (
lex_id serial PRIMARY KEY
, word text
, frequency int NOT NULL -- we'd need to do more if NULL was allowed
, lset int
);
INSERT INTO lexikon(word, frequency, lset)
SELECT 'w' || g -- shorter with just 'w'
, (1000000 / row_number() OVER (ORDER BY random()))::int
, g
FROM generate_series(1,1000000) g
From here on I take a different route:
ANALYZE lexikon;
### Auxiliary table
This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema `public`, use any schema of your choice.
CREATE TABLE public.lex_freq AS
WITH x AS (
SELECT DISTINCT ON (f.row_min)
f.row_min, c.row_ct, c.frequency
FROM (
SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
FROM lexikon
GROUP BY 1
) c
JOIN ( -- list of steps in recursive search
VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
ORDER BY f.row_min, c.row_ct, c.frequency DESC
)
, y AS (
SELECT DISTINCT ON (frequency)
row_min, row_ct, frequency AS freq_min
, lag(frequency) OVER (ORDER BY row_min) AS freq_max
FROM x
ORDER BY frequency, row_min
-- if one frequency spans multiple ranges, pick the lowest row_min
)
SELECT row_min, row_ct, freq_min
, CASE freq_min <= freq_max
WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
WHEN FALSE THEN 'frequency = ' || freq_min
ELSE 'frequency >= ' || freq_min
END AS cond
FROM y
ORDER BY row_min;
Table looks like this:
row_min | row_ct | freq_min | cond
--------+---------+----------+-------------
400 | 400 | 2500 | frequency >= 2500
1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
600000 | 1000000 | 1 | frequency = 1
As the column `cond` is going to be used in dynamic SQL further down, you have to make this table ***secure***. Always schema-qualify the table if you cannot be sure of an appropriate current `search_path`, and revoke write privileges from `public` (and any other untrusted role):
REVOKE ALL ON public.lex_freq FROM public;
GRANT SELECT ON public.lex_freq TO public;
The table `lex_freq` serves three purposes:
- Create needed [partial indexes][1] automatically.
- Provide steps for iterative function.
- Meta information for tuning.
### Indexes
This `DO` statement creates *all* needed indexes:
DO
$$
DECLARE
_cond text;
BEGIN
FOR _cond IN
SELECT cond FROM public.lex_freq
LOOP
IF _cond LIKE 'frequency =%' THEN
EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
ELSE
EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
END IF;
END LOOP;
END
$$
All of these **partial** indexes together span the table once. They are about the same size as one basic index on the whole table:
SELECT pg_size_pretty(pg_relation_size('lexikon')); -- 50 MB
SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB
Only 21 MB of indexes for 50 MB table so far.
I create most of the partial indexes on `(lset, frequency DESC)`. The second column only helps in special cases. But as both involved columns are of type `integer`, due to the specifics of data [alignment in combination with MAXALIGN][2] in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.
There is no point in doing that for partial indexes that only span a single frequency. Those are just on `(lset)`. Created indexes look like this:
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
-- ...
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
CREATE INDEX ON lexikon(lset) WHERE freqency = 1;
### Function
The function is somewhat similar in style to @Jack's solution:
CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
RETURNS SETOF lexikon
$func$
DECLARE
_n int;
_rest int := _limit; -- init with _limit param
_cond text;
BEGIN
FOR _cond IN
SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
LOOP
-- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
RETURN QUERY EXECUTE '
SELECT *
FROM public.lexikon
WHERE ' || _cond || '
AND lset >= $1
AND lset <= $2
ORDER BY frequency DESC
LIMIT $3'
USING _lset_min, _lset_max, _rest;
GET DIAGNOSTICS _n = ROW_COUNT;
_rest := _rest - _n;
EXIT WHEN _rest < 1;
END LOOP;
END
$func$ LANGUAGE plpgsql STABLE;
Key differences:
- **dynamic SQL** with `RETURN QUERY EXECUTE`.
As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.
- **Dynamic `LIMIT`** for every query step.
This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional `LIMIT` in the function call to trim the surplus.
## Benchmark
### Setup
I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:
1. The raw SQL query of the form:
SELECT *
FROM lexikon
WHERE lset >= 20000
AND lset <= 30000
ORDER BY frequency DESC
LIMIT 5;
2. The same after creating this index
CREATE INDEX ON lexikon(lset);
Needs about the same space as all my partial indexes together:
SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB
3. The function
SELECT * FROM f_search(20000, 30000, 5);
### Results
```
SELECT * FROM f_search(<b>20000, 30000, 5</b>);
```
```
1: Total runtime: 315.458 ms
2: Total runtime: 36.458 ms
3: Total runtime: **0.330 ms**
```
```
SELECT * FROM f_search(<b>60000, 65000, 100</b>);
```
```
1: Total runtime: 294.819 ms
2: Total runtime: 18.915 ms
3: Total runtime: **1.414 ms**
```
```
SELECT * FROM f_search(<b>10000, 70000, 100</b>);
```
```
1: Total runtime: 426.831 ms
2: Total runtime: 217.874 ms
3: Total runtime: **1.611 ms**
```
```
SELECT * FROM f_search(<b>1, 1000000, 5</b>);
```
```
1: Total runtime: 2458.205 ms
2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.
3: Total runtime: **0.266 ms**
```
### Conclusion
As expected, the benefit from the function grows with bigger ranges of `lset` and smaller `LIMIT`.
With **very small ranges of `lset`**, the raw query in combination with the index is actually *faster*. You'll want to test and maybe branch: raw query for small ranges of `lset`, else function call. You could even just **build that into the function** for a "best of both worlds" - that's what I would do.
Depending on your data distribution and typical queries, more steps in `lex_freq` may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.
[1]: https://www.postgresql.org/docs/current/static/indexes-partial.html
[2]: https://stackoverflow.com/questions/2966524/calculating-and-saving-space-in-postgresql/7431468#7431468