We are using Apache Druid to do time series based analytics on a dataset and are stuck trying to calculate mode for one of the analytics.
Statistical mode tells us about the data point that is most frequently repeated in the dataset.
Example: Given a dataset like below, try to Calculate the most representative sale bucket for each product
Dataset:
Product ID  Month   Sale Bucket
Pid1    January 10-20
Pid1    February    0-10
Pid1    March   10-20
Pid2    January 0-10
Pid2    February    10-20
Pid2    March   0-10
Expected output:
Product ID  Sale Bucket
Pid1    10-20
Pid2    0-10
Tried a simple google search to find if there was something in-built or an extension to do the same. Couldn't find one.
Then I started trying my own naive approach to perform the mode over this dataset using below steps:
Step 1. Getting the relevant frequencies of the data point (Group By over Product ID and Sale bucket) Resulting in
Product ID  Sale Bucket Frequency
Pid1    10-20   2
Pid1    0-10    1
Pid2    10-20   1
Pid2    0-10    2
Step 2. Getting the relevant data point with maximum Frequency per Product (out of the Frequencies calculated in 1). I am stuck at this point 2, having tried 2 approaches:
- Concatenate the column to be chosen as a string and then remove it later (Explain here on stackoverflow: SQL Query to get column values that correspond with MAX value of another column?. But druid doesn't seem to support a MAX/MIN operation on Strings. :-/
- Trying a rank based approach to rank the items by frequency and then pick the highest ranks in a further select query. Druid doesn't support Rank operator. :-/
Does anyone have any ideas on how to proceed further?
 
    