I have a database with a few tables, each has a few millions rows (tables do have indexes). I need to count rows in a table, but only those whose foreign key field points to a subset from another table.
Here is the query:
WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  ( top.production_year >= 1982 
                  AND top.production_year <= 1984 
                  AND top.kind_id IN( 1, 2 ) 
                   OR EXISTS(SELECT 1 
                             FROM   title sub 
                             WHERE  sub.episode_of_id = top.id 
                                    AND sub.production_year >= 1982 
                                    AND sub.production_year <= 1984 
                                    AND sub.kind_id IN( 1, 2 )) )) 
SELECT Count(*) 
FROM   cast_info 
WHERE  EXISTS(SELECT 1 
              FROM   filtered_title 
              WHERE  cast_info.movie_id = filtered_title.id) 
       AND cast_info.role_id IN( 3, 8 ) 
I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries. But I have tried to get rid of CTE and the results were the same: the first time I execute the query it runs... runs... and runs for more than ten minutes. The second time I execute the query, it's down to 4 seconds, which is acceptable for me.
The result of EXPLAIN ANALYZE:
Aggregate  (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
  CTE filtered_title
    ->  Seq Scan on title top  (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
          Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
          Rows Removed by Filter: 2832906
          SubPlan 1
            ->  Index Scan using title_idx_epof on title sub  (cost=0.43..16.16 rows=1 width=0) (never executed)
                  Index Cond: (episode_of_id = top.id)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
          SubPlan 2
            ->  Seq Scan on title sub_1  (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
                  Rows Removed by Filter: 2832906
  ->  Nested Loop  (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
        ->  HashAggregate  (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
              ->  CTE Scan on filtered_title  (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
        ->  Index Scan using cast_info_idx_mid on cast_info  (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
              Index Cond: (movie_id = filtered_title.id)
              Filter: (role_id = ANY ('{3,8}'::integer[]))
              Rows Removed by Filter: 15
Total runtime: 127729.100 ms
Now to my question. What am I doing wrong and how can I fix it?
I tried a few variants of the same query: exclusive joins, joins/exists. On one hand this one seems to require the least time to do the job (10x faster), but it's still 60 seconds on average. On the other hand, unlike my first query which needs 4-6 seconds on the second run, it always requires 60 seconds.
WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  top.production_year >= 1982 
                AND top.production_year <= 1984 
                AND top.kind_id IN( 1, 2 ) 
                 OR EXISTS(SELECT 1 
                           FROM   title sub 
                           WHERE  sub.episode_of_id = top.id 
                                  AND sub.production_year >= 1982 
                                  AND sub.production_year <= 1984 
                                  AND sub.kind_id IN( 1, 2 ))) 
SELECT Count(*) 
FROM   cast_info 
       join filtered_title 
         ON cast_info.movie_id = filtered_title.id 
WHERE  cast_info.role_id IN( 3, 8 ) 
 
     
    