Counting distinct rows using recursive cte over non-distinct index

Question

Given the following schema:

CREATE TABLE identifiers (
  id TEXT PRIMARY KEY
);

CREATE TABLE days (
  day DATE PRIMARY KEY
);

CREATE TABLE data (
  id TEXT REFERENCES identifiers
  , day DATE REFERENCES days
  , values NUMERIC[] 
); 
CREATE INDEX ON data (id, day);

What is the best way to count all distinct days between two timestamps? I've tried the following two methods:

EXPLAIN ANALYZE
SELECT COUNT(DISTINCT day) 
FROM data 
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=200331.32..200331.33 rows=1 width=4) (actual time=1647.574..1647.575 rows=1 loops=1)
   ->  Index Only Scan using data_day_sid_idx on data  (cost=0.56..196942.12 rows=1355678 width=4) (actual time=0.348..1180.566 rows=1362532 loops=1)
         Index Cond: ((day >= '2010-01-01'::date) AND (day <= '2011-01-01'::date))
         Heap Fetches: 0
 Total runtime: 1647.865 ms
(5 rows)

EXPLAIN ANALYZE
SELECT COUNT(DISTINCT day) 
FROM days
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=18.95..18.96 rows=1 width=4) (actual time=0.481..0.481 rows=1 loops=1)
   ->  Index Only Scan using days_pkey on days  (cost=0.28..18.32 rows=252 width=4) (actual time=0.093..0.275 rows=252 loops=1)
         Index Cond: ((day >= '2010-01-01'::date) AND (day <= '2011-01-01'::date))
         Heap Fetches: 252
 Total runtime: 0.582 ms
(5 rows)

The COUNT(DISTINCT day) against days runs well, but it requires me to keep a secondary table (days) to keep the performance reasonable. In a general sense, I'd like to test if a recursive cte will allow me to achieve similar performance without maintaining a secondary table. My query looks like this, but doesn't run yet:

EXPLAIN ANALYZE
WITH RECURSIVE cte AS (
   (SELECT day FROM data ORDER BY 1 LIMIT 1)
   UNION ALL
   (  -- parentheses required
   SELECT d.day
   FROM   cte  c
   JOIN   data d ON d.day > c.day
   ORDER  BY 1 LIMIT 1
   )
)
SELECT day 
FROM cte
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';

Updates

Thanks to everyone for the ideas. Looks like maintaining a trigger-based table of distinct days is the best way to go, both storage and performance-wise. Thanks to @Erwin's update, the recursive CTE is back in the running. Very useful.

WITH RECURSIVE cte AS (
   (  -- parentheses required because of LIMIT
   SELECT day
   FROM   data
   WHERE  day >= '2010-01-01'::date  -- exclude irrelevant rows early
   ORDER  BY 1
   LIMIT  1
   )

   UNION ALL
   SELECT (SELECT day FROM data
           WHERE  day > c.day
           AND    day < '2011-01-01'::date  -- see comments below
           ORDER  BY 1
           LIMIT  1)
   FROM   cte c
   WHERE  day IS NOT NULL  -- necessary because corr. subq. always returns row
   )
SELECT count(*) AS ct
FROM   cte
WHERE  day IS NOT NULL;

                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=53.35..53.36 rows=1 width=0) (actual time=18.217..18.217 rows=1 loops=1)
   CTE cte
     ->  Recursive Union  (cost=0.43..51.08 rows=101 width=4) (actual time=0.194..17.594 rows=253 loops=1)
           ->  Limit  (cost=0.43..0.46 rows=1 width=4) (actual time=0.191..0.192 rows=1 loops=1)
                 ->  Index Only Scan using data_day_idx on data data_1  (cost=0.43..235042.00 rows=8255861 width=4) (actual time=0.189..0.189 rows=1 loops=1)
                       Index Cond: (day >= '2010-01-01'::date)
                       Heap Fetches: 0
           ->  WorkTable Scan on cte c  (cost=0.00..4.86 rows=10 width=4) (actual time=0.066..0.066 rows=1 loops=253)
                 Filter: (day IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.43..0.47 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=252)
                         ->  Index Only Scan using data_day_idx on data  (cost=0.43..1625.59 rows=52458 width=4) (actual time=0.060..0.060 rows=1 loops=252)
                               Index Cond: ((day > c.day) AND (day < '2011-01-01'::date))
                               Heap Fetches: 0
   ->  CTE Scan on cte  (cost=0.00..2.02 rows=100 width=0) (actual time=0.199..18.066 rows=252 loops=1)
         Filter: (day IS NOT NULL)
         Rows Removed by Filter: 1
 Total runtime: 19.355 ms
(19 rows)

And the also discussed EXISTS query

EXPLAIN ANALYZE
SELECT count(*) AS ct
FROM   generate_series('2010-01-01'::date, '2010-12-31'::date, '1d'::interval) d(day)
WHERE  EXISTS (SELECT 1 FROM data WHERE day = d.day::date);

                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=674.32..674.33 rows=1 width=0) (actual time=95.049..95.049 rows=1 loops=1)
   ->  Nested Loop Semi Join  (cost=0.45..673.07 rows=500 width=0) (actual time=12.438..94.749 rows=252 loops=1)
         ->  Function Scan on generate_series d  (cost=0.01..10.01 rows=1000 width=8) (actual time=9.248..9.669 rows=365 loops=1)
         ->  Index Only Scan using data_day_idx on data  (cost=0.44..189.62 rows=6023 width=4) (actual time=0.227..0.227 rows=1 loops=365)
               Index Cond: (day = (d.day)::date)
               Heap Fetches: 0
 Total runtime: 95.620 ms
(7 rows)

Why do you have a `text` primary key? Shouldn't you have an integer that references the text in the other table? — Gordon Linoff, Mar 21 '15 at 01:58
They really should be, but they were implemented as MD5s throughout the processing code and accordingly left as TEXT entries in the db. — Justin, Mar 21 '15 at 02:39
This is a follow up to the previous question: http://stackoverflow.com/questions/29171623/create-a-unique-index-on-a-non-unique-column — Erwin Brandstetter, Mar 21 '15 at 04:16
I think you deleted the results for the `EXISTS` solution by accident. — Erwin Brandstetter, Mar 22 '15 at 03:29
The previous timing for the EXISTS variant was 3 ms. Now it's 95 ms. What changed? — Erwin Brandstetter, Mar 22 '15 at 05:36
My guess is something was cached. Just reran that query several times and saw 87ms, 7ms, and 4ms. The lower bound is about the same as for the recursive query. I usually try to run these type of queries several times before I post the explains, but in general, does it make more sense to post the "uncached" or "cached"? In my case I'm more concerned with the cached, and try to post accordingly. — Justin, Mar 23 '15 at 05:13
Depends on the actual use case, of course. But typically, the result with warm cache is the more interesting one. I would take the best of 5 to compare performance. — Erwin Brandstetter, Mar 23 '15 at 07:30

Erwin Brandstetter · Accepted Answer · 2020-07-19T12:27:22.467

Several notes:

Simple query on table `day`

SELECT COUNT(DISTINCT day) 
FROM   days
WHERE  day BETWEEN '2010-01-01' AND '2011-01-01';

While day is defined as PK, DISTINCT is just expensive noise.

Recursive CTE with correlated suquery

This is the alternative if there is no day table with unique entries. The technique pays if there are multiple to many rows per day, so that the equivalent of a loose index scan is actually faster than a simple DISTINCT on the base table:

WITH RECURSIVE cte AS (
   (  -- parentheses required because of LIMIT
   SELECT day
   FROM   data
   WHERE  day >= '2010-01-01'  -- exclude irrelevant rows early
   ORDER  BY 1
   LIMIT  1
   )
  
   UNION ALL
   SELECT (SELECT day FROM data
           WHERE  day > c.day
           AND    day < '2011-01-01'  -- see below
           ORDER  BY 1
           LIMIT  1)
   FROM   cte c
   WHERE  day IS NOT NULL  -- necessary because corr. subq. always returns row
   )
SELECT count(*) AS ct
FROM   cte
WHERE  day IS NOT NULL;

Index

Only makes sense in combination with a matching index on data:

CREATE INDEX data_day_idx ON data (day);

day must be the leading column. The index you have in the question on (id, day) can be used too, but is far less efficient:

Notes

It is much cheaper to exclude irrelevant rows early. I integrated your predicate into the query.

Detailed explanation:

Optimize GROUP BY query to retrieve latest row per user

The case at hand is even simpler - the simplest possible actually.

Your original time frame was day BETWEEN '2010-01-01' AND '2011-01-01'. But BETWEEN .. AND .. includes upper and lower bound, so you'd get all of 2010 plus 2011-01-01. You probably want to exclude the upper bound. Use d.day < '2011-01-01' (not <=). See:

How to add a day/night indicator to a timestamp column?

`EXISTS` for this special case

Since you are testing for a range of enumerable days (as opposed to a range with an infinite number of possible values), you can test this alternative with an EXISTS semi-join:

SELECT count(*) AS ct
FROM   generate_series(timestamp '2010-01-01'
                     , timestamp '2010-12-31'
                     , interval  '1 day') AS d(day)
WHERE  EXISTS (SELECT FROM data WHERE day = d.day::date);

Why is this form of generate_series() optimal?

Generating time series between two dates in PostgreSQL

The same simple index is essential again.

db<>fiddle here demonstrating both with big test table.
_{Old sqlfiddle}

Tried running the provided recursive cte to no avail (995s). The EXISTS case worked very well as well (3ms), but from a storage perspective, simply maintaining the relational `days` table looks like the winner. Posted the query explains in the question as edits. Thanks! — Justin, Mar 21 '15 at 09:14
Since the `EXISTS` variant performs so well, I would consider to stick with that. Maintaining an additional table with triggers comes at a cost, too. If performance of this query is the paramount requirement, consider a materialized view like EvilPuppetMaster mentions in his answer. Performance of the rCTE was *totally unexpected*. I investigated, found and fixed a performance bug in my solution - thanks to your feedback! Also updated my reference answer and added a fiddle to prove it. Please test the new rCTE solution. Probably not as fast as `EXISTS` but close, and more versatile. — Erwin Brandstetter, Mar 22 '15 at 03:12
Works as expected. Posted the explain in my update above. Thanks! — Justin, Mar 22 '15 at 03:22

score 1 · Answer 2 · answered Mar 21 '15 at 01:43

1

Try creating an index on data(day) and then running the first query:

SELECT COUNT(DISTINCT day) 
FROM data 
WHERE day BETWEEN '2010-01-01' AND '2011-01-01';

You might find the performance sufficient for your purposes.

answered Mar 21 '15 at 01:43

Gordon Linoff

1,242,037
58
646
786

Slightly better performance, but still three orders of magnitude slower than the distinct against days. (Total runtime: 1197.124 ms) – Justin Mar 21 '15 at 01:55

EvilPuppetMaster · Answer 3 · 2015-03-21T11:32:18.900

I'm not really sure why the index on data(day) is slower, that would seem the simplest option. But if that's too slow, you could try creating a materialised view of your days. Basically just:

create materialized view days as
select day 
from data 
group by day;

I don't believe postgres updates materialised views automatically, but at least then all the maintenance you need to do is periodically refresh it. Or perhaps create a trigger on data which refreshes the view. Bear in mind of course that refreshing this view might take some time depending on the size of the data table, you might only want to do it hourly or nightly if you can get away with it.

Alternatively if this table gets a lot of updates and you need the distinct day count to be consistent at all times, you could consider going back to your original separate days table, but reduce the maintenance overhead by creating a trigger on the data table to update it.

Counting distinct rows using recursive cte over non-distinct index

3 Answers3

Simple query on table `day`

Recursive CTE with correlated suquery

Index

Notes

`EXISTS` for this special case

Linked

Counting distinct rows using recursive cte over non-distinct index

3 Answers3

Simple query on table day

Recursive CTE with correlated suquery

Index

Notes

EXISTS for this special case

Linked

Simple query on table `day`

`EXISTS` for this special case