Count distinct based on another column

Question

I have the following table:

CREATE TABLE tbl (
  id          int NOT NULL
, date        date NOT NULL
, cid         int NOT NULL
, birth_place text NOT NULL
, location    text NOT NULL
);
 
INSERT INTO tbl VALUES
  (1 , '2022-01-01', 1, 'France' , 'Germany')
, (2 , '2022-01-30', 1, 'France' , 'France')
, (3 , '2022-01-25', 2, 'Spain'  , 'Spain')
, (4 , '2022-01-12', 3, 'France' , 'France')
, (5 , '2022-02-01', 4, 'England', 'Italy')
, (6 , '2022-02-12', 1, 'France' , 'France')
, (7 , '2022-03-05', 5, 'Spain'  , 'England')
, (8 , '2022-03-08', 2, 'Spain'  , 'Spain')
, (9 , '2022-03-15', 2, 'Spain'  , 'Spain')
, (10, '2022-03-30', 5, 'Spain'  , 'Italy')
, (11, '2022-03-22', 4, 'England', 'England')
, (12, '2022-03-22', 3, 'France' , 'England');

I need to count distinct customers (= cid) per month and location - with a special twist:
If a customer they went back to their birth place (location = birth_place) during any given month, give priority to that location. Else pick one location per month and customer arbitrarily.

My desired output:

date         location   count
2022-01-01   France     2
2022-01-01   Spain      1
2022-02-01   Italy      1
2022-02-01   France     1
2022-03-01   Spain      1
2022-03-01   England    3

cid 1 in 2022-01-01 had a location = birth_place, and no other customer had Germany as location in that time period, hence there is no Germany in my desired output location.

This is my current query:

with
  t as (
    select id, date_trunc('month', date)::date AS date, cid, birth_place, location
    from tbl),
  t1 as (
    select date, cid, location
    from t
    where birth_place = location),
  t2 as (
    select date, cid, location, row_number() over (partition by date, cid order by date) as row
    from t
    where birth_place <> location),
  t3 as (
    select t.*, 
        case
            when t1.location is not null then t1.location
            else t2.location
        end as new_loc
    from t
    left join t1
    on t.cid = t1.cid and t.date = t1.date
    left join t2
    on t.cid = t2.cid and t.date = t2.date and t2.row = 1)
select date, new_loc, count(distinct cid)
from t3
group by 1, 2
order by 1, 2

It works, but it seems inefficient for 100 million rows.
I am looking for a more efficient way.

Why is Italy in the sample results? None of your rows have `birth_place = 'Italy' AND birth_place = location` — Dai, Jun 18 '22 at 02:27
Check my answer it at https://stackoverflow.com/a/75068192/8614314 — Palash Mondal, Jan 10 '23 at 09:53

score 1 · Answer 1 · answered Jun 18 '22 at 02:29

WITH q1 AS (

    SELECT
        EXTRACT( YEAR  FROM t."date" ) AS "Year",
        EXTRACT( MONTH FROM t."date" ) AS "Month",

        t.cid,
        t.birth_place,
        t.location
    FROM
        theTable AS t
    WHERE
        t.location = t.birth_place
)
SELECT
    "Year",
    "Month",
    "location",
    COUNT( DISTINCT cId ) AS "COUNT( DISTINCT cId )",
    COUNT( * ) AS "CountAll"
FROM
    q1
GROUP BY
    "Year",
    "Month",
    "location"
ORDER BY
    "Year",
    "Month",
    "location"

SqlFiddle example.

I guess this works for the condition I mentioned, but it completely ignores all rows like `5 2022-02-01 4 England Italy`, where the condition is not matched. In this case, I'd like to have the count of distinct `cid` based on `location`. — tmul, Jun 18 '22 at 11:12

Erwin Brandstetter · Accepted Answer · 2022-06-19T22:06:20.070

Assuming this objective:

Truncate dates to the month.
Pick one location per (month, cid), the home location getting priority.
Then count rows per (month, location).

SELECT date, location, count(*)
FROM  (
   SELECT DISTINCT ON (1, 2)  --  choose **one** location per (month, cid)
          date_trunc('month', date)::date AS date, cid, location
   FROM   tbl
   ORDER  BY 1, 2, birth_place = location DESC  -- priority to home location, else **arbitrary**
   ) sub
GROUP  BY 1, 2
ORDER  BY 1, 2;  -- optional

db<>fiddle here

Be aware that the arbitrary pick in case of "not been home" can give varying results! You may want to define a stable (opportune) pick instead.

There may be faster query variants, depending on undisclosed details.

About DISTINCT ON and performance:

Select first row in each GROUP BY group?

About the sort order:

SQL select query order by day and month

If there can be NULL values:

Sort by column ASC, but NULL values first?

score 0 · Answer 3 · answered Jun 18 '22 at 05:59

My approach would be using case...when within a count. That way it works with or without a where filter and therefore allows other aggregates of the data in the same query in the future.

SELECT
    date_trunc('month', date)::date AS date, t.location
    , count(distinct (case when t.location=t.birth_place then t.cid else null end)) as "count"
FROM theTable AS t
WHERE t.location=t.birth_place
GROUP BY date_trunc('month', date)::date, t.location

Count distinct based on another column

3 Answers3