I have a table with two columns -- one called json, a string containing a JSON array, and the other an Int called id. If I select just the json column and convert the resulting DataFrame to a Dataset, then parse the json using Reader.json(), the JSON array in the resulting DataFrame will be exploded out into multiple rows. Now there's no way I can see to correlate id back to the correct exploded rows. In Java this looks something like:
Dataset<Row> df = reader.table("myTable"); // DataFrame containing json and id
df = df.selectExpr("json"); // DataFrame containing just json
Dataset<String> ds = df.toJavaRDD() // Dataset<String> containing just json
.map((Function<Row, String>) row -> (String) row.get(0)).rdd(), STRING());
df = reader.schema(jsonDdl).json(ds); // exploded DataFrame, one row per array element
...now what? I have no way to correlate id back to the rows coming from the JSON parse operation. How can I do this differently such that the relationship between id and the parsed JSON columns is preserved?