I'm using Spark 2.0.0 and dataframe. Here is my input dataframe as
| id | year        | qty    |
|----|-------------|--------|
| a  | 2012        | 10     |
| b  | 2012        | 12     |
| c  | 2013        | 5      |
| b  | 2014        | 7      |
| c  | 2012        | 3      |
What I want is
| id | year_2012 | year_2013 | year_2014 |
|----|-----------|-----------|-----------|
| a  | 10        | 0         | 0         |
| b  | 12        | 0         | 7         |
| c  | 3         | 5         | 0         |
or
| id | yearly_qty  |
|----|---------------|
| a  | [10, 0, 0]    |
| b  | [12, 0, 7]    |
| c  | [3, 5, 0]     |
The closest solution I found is collect_list() but this function doesn't provide order for the list. In my mind the solution should be like:
data.groupBy('id').agg(collect_function)
Is there a way to generate this without filtering every id out using a loop?