I have a MySQL table where there are many rows for each person, and I want to write a query which aggregates rows with special constraint. (one per person)
For example, lets say the table is consist of following data.
name date reason
---------------------------------------
John 2013-04-01 14:00:00 Vacation
John 2013-03-31 18:00:00 Sick
Ted 2012-05-06 20:00:00 Sick
Ted 2012-02-20 01:00:00 Vacation
John 2011-12-21 00:00:00 Sick
Bob 2011-04-02 20:00:00 Sick
I want to see the distribution of 'reason' column. If I just write a query like below
select reason, count(*) as count from table group by reason
then I will be able to see number of reasons for this table overall.
reason count
------------------
Sick 4
Vacation 2
However, I am only interested in single reason from each person. The reason that should be counted should be from a row with latest date from the person's records. For example, John's latest reason would be Vacation while Ted's latest reason would be Sick. And Bob's latest reason (and the only reason) is Sick.
The expected result for that query should be like below. (Sum of count will be 3 because there are only 3 people)
reason count
-----------------
Sick 2
Vacation 1
Is it possible to write a query such that single latest reason will be counted when I want to see distribution(count) of reasons?
Here are some facts about the table.
- The table has tens of millions of rows
- For most of times, each person has one reason.
- Some people have multiple reasons, but 99.99% of people have fewer than 5 reasons.
- There are about 30 different reasons while there are millions of distinct names.
- The table is partitioned based on date range.