I would like to aggregate some columns into an array or json object on redash. The data table is on presto database. I need to query it from pyspark hive. The data table is large and I need to keep its size as small as possible so that I can save the dataframe to s3 faster and then read it as parquet from s3 efficiently.
I am not sure what the best data structure should be for this? (json object? array of array ?)
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id.    cat_name.   cat_desc.       obj_name.    obj_desc.   obj_num
1.     furniture    living office   desk         4 corners    1.5.      
1      furniture.   living office.  chair.       4 legs.      0.8
1.     furniture.   restroom.       tub.         white wide.  2.7
1.     cloth.       fashion.        T-shirt.     black large. 1.1
I want (this may not be the best data structure):
id.     cat_item_aggregation
 1.     [['furniture', ['living office', ['desk', '4 corners', '1.5'], ['chair', '4 legs', '0.8']], ['restroom', [['tub', 'white wide', '2.7']], ['cloth', ['fashion', ['T-shirt', 'black', '1.1']]]]
I have tried array_agg from PostgreSQL: Efficiently aggregate array columns as part of a group by Postgres - aggregate two columns into one item
also json_build_object from Return as array of JSON objects in SQL (Postgres) How to group multiple columns into a single array or similar?
but they do not work in redash.
Could anybody let me know what the best data structure should be for this kind of table ? and how to build it ?
json may be better than array of array because it is hard to decompose the elements from array of array ?
thanks
