(first of all - apologies for the title but I couldn't come up with a better one)
Here is my problem - I have a table with 4 columns - entity::INT, entry::TEXT, state::INT and day::INT.
There could be anywhere from 50 to 1,000 entities. Each entity can have over 100 million entries. Each entry can have one or more states, which changes if the data stored in the entry has changed but only one state can be written for any particular day. The day starts at one and is incremented each day.
Example:
entity | entry  | state     | day
-------------------------------------
1      | ABC123 | 1         | 1
1      | ABC124 | 2         | 1
1      | ABC125 | 3         | 1
...
1      | ABC999 | 999       | 1
2      | BCD123 | 1000      | 1
...
1      | ABC123 | 1001      | 2
2      | BCD123 | 1002      | 3
The index is set to (entity, day, state).
What I want to achieve is to efficiently select the most current state of each entry on day N.
Currently, every week I write all the entries with their latest state to the table so to minimize the number of days that we need to scan, however, given the total number of entries (worst case scenario - 1,000 entities times 100,000,000 entries is a lot of rows to write each week) the table is slowly but surely bloats and everything becomes really slow.
I need to be able to stop writing this "full" version weekly and instead have a setup that will still be fast enough to achieve that. I considered to use DISTINCT ON with a different index set to (entity, entry, day DESC, state) so that I could:
SELECT DISTINCT ON (entity, entry) entry, state
FROM table
WHERE entity = <entity> AND day <= <day>
ORDER BY entity, entry, day DESC, state;
Would that be the most efficient way to do it or there are better ways? Or entry possibly having hundreds of millions of unique values makes it a poor choice for the second column in the index and the performance will eventually come to a halt?
 
     
    