How do I take a DISTINCT ON subquery that is ordered by a separate column, and make it fast?

Question

(AKA - With a query and data very similar to question "Selecting rows ordered by some column and distinct on another", how can I get it to run fast). Postgres 11.

I have table prediction with (article_id, prediction_date, predicted_as, article_published_date) that represents the output from a classifier over a set of articles.

New articles are frequently added to a separate table (Represented by the FK article_id), and new predictions are added as we tune our classifier.

Sample data:

| id      | article_id |  predicted_as | prediction_date | article_published_date
| 1009381 | 362718     |  negative     | 2018-07-27      | 2018-06-26
| 1009382 | 362718     |  positive     | 2018-08-12      | 2018-06-26
| 1009383 | 362719     |  positive     | 2018-08-13      | 2010-09-22
| 1009384 | 362719     |  positive     | 2018-09-28      | 2010-09-22
| 1009385 | 362719     |  negative     | 2018-10-01      | 2010-09-22

Create table script:

create table prediction
(
    id serial not null
        constraint prediction_pkey
            primary key,
    article_id integer not null
        constraint prediction_article_id_fkey
            references article,
    predicted_as classifiedas not null,
    prediction_date date not null,
    article_published_date date not null
);

create index prediction_article_id_prediction_date_idx
    on prediction (article_id asc, prediction_date desc);

We frequently want to view the most recent classification for each article. In order to do so we use:

SELECT DISTINCT ON (article_id) article_id, id, article_published_date
FROM prediction
ORDER BY article_id, prediction_date desc

which returns something like:

| id     | article_id |  predicted_as | prediction_date | article_published_date
| 120950 | 1          | negative      | 2018-06-29      | 2018-03-25
| 120951 | 2          | negative      | 2018-06-29      | 2018-03-19

With an index on (article_id, prediciton_date desc), this query runs very quickly (~15ms). This is the explain plan:

Unique  (cost=0.56..775374.53 rows=1058394 width=20)
  ->  Index Scan using prediction_article_id_prediction_date_id_idx on prediction  (cost=0.56..756071.98 rows=7721023 width=20)

So far so good.

The problem occurs when I want to sort this result by the article_published_field. E.g:

explain (analyze, buffers)
select *
  from (
         select distinct on (article_id) article_id, id, article_published_date
         from prediction
         order by article_id, prediction_date desc
       ) most_recent_predictions
  order by article_published_date desc
  limit 3;

This works, but the query takes ~3-4 seconds to run, making it too slow to use directly to respond to a web request.

Here is the explain plan:

Limit  (cost=558262.52..558262.53 rows=3 width=12) (actual time=4748.977..4748.979 rows=3 loops=1)
  Buffers: shared hit=7621849 read=9051
  ->  Sort  (cost=558262.52..560851.50 rows=1035593 width=12) (actual time=4748.975..4748.976 rows=3 loops=1)
        Sort Key: most_recent_predictions.article_published_date DESC
        Sort Method: top-N heapsort  Memory: 25kB
        Buffers: shared hit=7621849 read=9051
        ->  Subquery Scan on most_recent_predictions  (cost=0.43..544877.67 rows=1035593 width=12) (actual time=0.092..4508.464 rows=1670807 loops=1)
              Buffers: shared hit=7621849 read=9051
              ->  Result  (cost=0.43..534521.74 rows=1035593 width=16) (actual time=0.092..4312.916 rows=1670807 loops=1)
                    Buffers: shared hit=7621849 read=9051
                    ->  Unique  (cost=0.43..534521.74 rows=1035593 width=16) (actual time=0.090..4056.644 rows=1670807 loops=1)
                          Buffers: shared hit=7621849 read=9051
                          ->  Index Scan using prediction_article_id_prediction_date_idx on prediction  (cost=0.43..515295.09 rows=7690662 width=16) (actual time=0.089..3248.250 rows=7690662 loops=1)
                                Buffers: shared hit=7621849 read=9051
Planning Time: 0.130 ms
Execution Time: 4749.007 ms

Is there any way to make this query run more quickly, or will I have to resort to refreshing a materialized view or setting up a trigger system to get this data quickly?

For reference:

the prediction table has 7.7M rows
there are 1.7M distinct article_ids in the prediction table
there is an index on (article_id, prediciton_date desc) as well as one on article_published_date desc
VACUUM ANALYSE has been run

About `limit 3`: is that for testing, or do you actually just want the top 3? Your explanation is not in sync with the query. Also, `EXPLAIN` is good, `EXPLAIN (ANALYZE, BUFFERS)` is *much* better to help us understand. And I assume there is also a table `article` holding all relevant (unique) articles? (You mentioned an FK ...) — Erwin Brandstetter, Mar 17 '19 at 02:45
You provided good information, better than most. Still (as *always*), actual `CREATE TABLE` scripts would be more helpful. For starters, it matters whether columns are defined `NOT NULL`. — Erwin Brandstetter, Mar 17 '19 at 03:11
Hi @ErwinBrandstetter - I've updated the `explain` with `(ANALYZE, BUFFERS` and added the create table script. — mjames, Mar 18 '19 at 19:12
There is a separate article table that is accessed through the FK, but it is not actually accessed for this query. The `article_id` on the table is sufficient. — mjames, Mar 18 '19 at 19:12
We actually need the top ~3-10. 3 would suffice. The updated query + explain reflects a limit of 3 — mjames, Mar 19 '19 at 17:56
OK, my idea should work for any trivially small `LIMIT`. (But not for large parts of the result.) — Erwin Brandstetter, Mar 19 '19 at 18:34

score 1 · Answer 1 · answered Mar 17 '19 at 02:23

1

One thing that you could try is to use window function ROW_NUMBER() OVER(...) instead of DISTINCT ON() (which implies constraints on the ORDER BY clause). This method is functionaly equivalent to your second query, and might be able to take advantage of exising indexes:

SELECT *
FROM (
    SELECT 
        article_id, 
        id, 
        article_published_date,
        ROW_NUMBER() OVER(PARTITION BY article_id ORDER BY prediction_date DESC) rn
    FROM prediction 
) x WHERE rn = 1
ORDER BY article_published_date DESC
LIMIT 3;

Demo on DB Fiddle.

answered Mar 17 '19 at 02:23

GMB

216,147
25
84
135

Hey @GMB - the above query is about 3x slower than the distinct on. The `EXPLAIN (ANALYZE, BUFFERS)` is up here: https://pastebin.com/b6fZy5nP – mjames Mar 18 '19 at 19:18
@mjames: thanks for the intersting feedback. I guess that somehow Postgres is able to optimize `DISTINCT ON()` better than the standard `ROW_NUMBER()`, which is not that suprizing after all, since the former is a bit limited compared to the latter (see my answer). – GMB Mar 18 '19 at 21:38
@mjames: I can see that you accepted an answer, but did not upvote any, while you have enough rep for that. Wasn't any of these answers useful to you? – GMB Mar 18 '19 at 21:39
Thanks @GMB - didn't realize that that was proper etiquette. All of the answers were well thought out - is the best practice to upvote each of them, or just the solution? – mjames Mar 18 '19 at 23:56

score 1 · Accepted Answer · answered Mar 17 '19 at 04:22

1

I wonder if you can make this work:

select article_id, id, article_published_date
from prediction p
where p.prediction_date = (select max(p2.prediction_date)
                           from prediction p2
                           where p2.article_id = p.article_id
                          )
order by article_published_date desc;

Then use these two indexes:

(article_published_date desc, prediction_date, article_id, id)
(article_id, prediction_date desc).

answered Mar 17 '19 at 04:22

Gordon Linoff

1,242,037
58
646
786

Hey Gordon - this appears to do the job! Thanks – mjames Mar 18 '19 at 19:22
This does not remove duplicates on `(article_published_date, prediction_date)` like the original does. – Erwin Brandstetter Mar 19 '19 at 13:37

Erwin Brandstetter · Answer 3 · 2019-03-19T14:14:01.443

While you just want a trivially small number of result rows (LIMIT 3 in your example), and if there is any positive correlation between article_published_date and prediction_date, this query should be radically faster as it only has to scan few tuples from the top of the added index (and recheck with the 2nd index):

Have these two indexes:

CREATE INDEX ON prediction (article_published_date DESC, prediction_date DESC, article_id DESC);

CREATE INDEX ON prediction (article_id, prediction_date DESC);

Recursive Query:

WITH RECURSIVE cte AS (
   (
   SELECT p.article_published_date, p.article_id, p.prediction_date, ARRAY[p.article_id] AS a_ids
   FROM   prediction p
   WHERE  NOT EXISTS (  -- no later row for same article
      SELECT FROM prediction
      WHERE  article_id = p.article_id
      AND    prediction_date > p.prediction_date
      )
   ORDER  BY p.article_published_date DESC, p.prediction_date DESC, p.article_id DESC
   LIMIT  1
   )
   UNION ALL
   SELECT p.article_published_date, p.article_id, p.prediction_date, a_ids || p.article_id
   FROM   cte c, LATERAL (
      SELECT p.article_published_date, p.article_id, p.prediction_date
      FROM   prediction p
      WHERE (p.article_published_date, p.prediction_date, p.article_id)
          < (c.article_published_date, c.prediction_date, c.article_id)
      AND    p.article_id <> ALL(a_ids)   -- different article
      AND    NOT EXISTS (                 -- no later row for same article
         SELECT FROM prediction
         WHERE  article_id = p.article_id
         AND    prediction_date > p.prediction_date
         )
      ORDER  BY p.article_published_date DESC, p.prediction_date DESC, p.article_id DESC
      LIMIT  1
      ) p
   )
SELECT article_published_date, article_id, prediction_date
FROM   cte
LIMIT  3;

Here is a plpgsql solution doing the same, probably slightly faster:

CREATE OR REPLACE FUNCTION f_top_n_predictions(_n int = 3)
  RETURNS TABLE (_article_published_date date, _article_id int, _prediction_date date) AS
$func$
DECLARE
   a_ids int[];
BEGIN
   FOR _article_published_date, _article_id, _prediction_date IN
      SELECT article_published_date, article_id, prediction_date
      FROM   prediction
      ORDER  BY article_published_date DESC, prediction_date DESC, article_id DESC
   LOOP
      IF _article_id = ANY(a_ids)
      OR EXISTS (SELECT FROM prediction p
                 WHERE  p.article_id = _article_id
                 AND    p.prediction_date > _prediction_date) THEN
         -- do nothing         
      ELSE
         RETURN NEXT;
         a_ids := a_ids || _article_id;
         EXIT WHEN cardinality(a_ids) >= _n;
      END IF;
   END LOOP;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_top_n_predictions();

I'll add explanation if it works for you, since the explanation is more work than the query itself.

Apart from that, with more than a few predictions per article, and with an additional table article, this query becomes a contender:

SELECT p.*
FROM   article a
CROSS  JOIN LATERAL (
   SELECT p.article_published_date, p.article_id, p.prediction_date
   FROM   prediction p
   WHERE  p.article_id = a.id
   ORDER  BY p.prediction_date DESC
   LIMIT  1
   ) p
ORDER  BY p.article_published_date DESC;

But you don't need this if the query above does the job. Gets interesting for a bigger or no LIMIT.

Basics:

db<>fiddle here, demonstrating all.

Hi @ErwinBrandstetter - the first query is much slower than my naive approach - it takes 43 seconds to run (vs 4). I added both indices, and analyzed the table. The EXPLAIN (ANALYZE, BUFFERS) is up here: https://pastebin.com/8D5rGQDE — mjames, Mar 18 '19 at 19:16
Ah! The index was wrong. The crucial detail is to have `article_published_date` first. Can you try once more. I am aiming for microseconds, not seconds. Also, how does my second query compare to your original? With just 5 to 6 rows per article, `DISTINCT ON` might still be faster than my second query. — Erwin Brandstetter, Mar 19 '19 at 14:04
@mjames: Did you find time to try with the fixed index? Or the plpgsql alternative? — Erwin Brandstetter, Mar 22 '19 at 13:34

How do I take a DISTINCT ON subquery that is ordered by a separate column, and make it fast?

3 Answers3

Linked