Using Postgres 9.5. Test data:
create temp table rental (
    customer_id smallint
    ,rental_date timestamp without time zone
    ,customer_name text
);
insert into rental values
    (1, '2006-05-01', 'james'),
    (1, '2006-06-01', 'james'),
    (1, '2006-07-01', 'james'),
    (1, '2006-07-02', 'james'),
    (2, '2006-05-02', 'jacinta'),
    (2, '2006-05-03', 'jacinta'),
    (3, '2006-05-04', 'juliet'),
    (3, '2006-07-01', 'juliet'),
    (4, '2006-05-03', 'julia'),
    (4, '2006-06-01', 'julia'),
    (5, '2006-05-05', 'john'),
    (5, '2006-06-01', 'john'),
    (5, '2006-07-01', 'john'),
    (6, '2006-07-01', 'jacob'),
    (7, '2006-07-02', 'jasmine'),
    (7, '2006-07-04', 'jasmine');
I am trying to understand the behaviour of existing customers. I am trying to answer this question:
What is the likelihood of a customer to order again based on when their last order was (current month, previous month (m-1)...to m-12)?
Likelihood is calculated as:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
Thus, I need to generate a table that lists a count of the people who ordered in the current month, who belong in a given cohort.
Thus, what are the rules for being in a cohort?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
I am using the DVD Store database as sample data to develop the query: http://linux.dell.com/dvdstore/
Here is an example of cohort rules and aggregations, based on July being the
"month's orders being analysed" (please notice: the "month's orders being analysed" column is the first column in the 'Desired output' table below):
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james       | 1  1  | 1     | 1     | <- member of jul cohort, made order in jul
jasmine     | 1  1  |       |       | <- member of jul cohort, made order in jul
jacob       | 1     |       |       | <- member of jul cohort, did NOT make order in jul
john        | 1     | 1     | 1     | <- member of jun cohort, made order in jul
julia       |       | 1     | 1     | <- member of jun cohort, did NOT make order in jul
juliet      | 1     |       | 1     | <- member of may cohort, made order in jul
jacinta     |       |       | 1 1   | <- member of may cohort, did NOT make order in jul
This data would output the following table:
--where m = month's orders being analysed
month's orders |how many people |how many people from  |how many people   |how many people from    |how many people   |how many people from    |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16         |5               |1                     |                  |                        |                  |                        |
jun-16         |                |                      |5                 |3                       |                  |                        |
jul-16         |3               |2                     |2                 |1                       |2                 |1                       |
My attempts so far have been on variations of:
generate_series()
and
row_number() over (partition by customer_id order by rental_id desc)
I haven't been able to get everything to come together yet (I've tried for many hours and haven't yet solved it).
For readability, I think posting my work in parts is better (if anyone wants me to post the sql query in its entirety please comment - and I'll add it).
series query:
(select
    generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
    rental) as series
rank query:
(select
    *,
    row_number() over (partition by customer_id order by rental_id desc) as rnk
from
    rental
where
    date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
I want to do something like: run the orders_ranked query for every row returned by the series query, and then base aggregations on each return of orders_ranked.
Something like:
(--this query counts the customers in cohort m-1
select
    count(distinct customer_id)
from
    (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
    select
        *,
        row_number() over (partition by customer_id order by rental_id desc) as rnk
    from
        rental
    where
        date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
    (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
    OR
    (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
    count(distinct customer_id)
from
    (--this query returns the orders by customers in cohort m-1
    select
        count(distinct customer_id)
    from
        (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
        select
            *,
            row_number() over (partition by customer_id order by rental_id desc) as rnk
        from
            rental
        where
            date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
    where
        (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
        OR
        (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
    rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
    (select
        generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
    from
        rental) as series