I have an events table that has an user_id and created_at columns.
What I'd like to figure out is the percentage of users having day-over-day repeat entries in the events table.
So, if day d1 has user_ids [a,b,c], and day d2 has user_ids [b,d,e], then b is the only repeat user_id of the three (from d1), and there is 33% overlap between d1 and d2.
I'd like to be able to extend this for an arbitrary number of days.
The schema in question:
CREATE TABLE events (
  events_id serial PRIMARY KEY
, user_id VARCHAR(255) NOT NULL
, created_at datetime NOT NULL
);
This is a large table, having ~25MM rows per day, with ~4.5MM distinct user_ids per day.
Example data set:
+---------+---------------------+ | user_id | created_at | +---------+---------------------+ | bob | 2014-12-02 11:11:11 | | sally | 2014-12-02 12:12:11 | | zed | 2014-12-02 12:22:11 | | | ... | | chris | 2014-12-03 11:13:11 | | mark | 2014-12-03 11:11:13 | | zed | 2014-12-03 11:11:33 | | | ... | | sydney | 2014-12-04 11:14:11 | | zed | 2014-12-04 11:44:11 | | chris | 2014-12-04 11:44:11 | | | ... | | sydney | 2014-12-05 11:15:11 | | zed | 2014-12-05 11:55:11 | | chris | 2014-12-05 11:55:15 | | sandy | 2014-12-05 11:55:51 | | sydney | 2014-12-05 11:55:55 | +---------+---------------------+
Expected output:
+------------+---------------------------+ | day | returning_user_percentage | +------------+---------------------------+ | 2014-12-02 | NULL | | 2014-12-03 | 33 | | 2014-12-04 | 66 | | 2014-12-05 | 75 | +------------+---------------------------+
Additionally, and quite probably far more simple, part 2:  I'd like to know how many new users there are each day, where "new" means the user_id has not previously been seen.