I have a list of maps that contains something like this:
fields = [{"trials": 1.0, "name": "Alice", "score": 8.0}, {"trials": 2.0, "name": "Bob", "score": 10.0"}]
The list of maps is returned as a JSON blob from an API call. When I convert this to a dataframe in PySpark, I'll get the following:
+-------------------------------------------+---------+
|fields                                     |key      |
+-------------------------------------------+---------+
|[1.0, Alice, 8.0]                          |key1     |
|[2.0, Bob, 10.0]                           |key2     |
|[1.0, Charlie, 8.0]                        |key3     |
|[2.0, Sue, 10.0]                           |key4     |
|[1.0, Clark, 8.0]                          |key5     |
|[3.0, Sarah, 10.0]                         |key6     |
I would like to get it into this form:
+-------------------------------------------+---------+
|trials| name | score                       |key      |
+-------------------------------------------+---------+
|1.0   |Alice  | 8.0                        |key1     |
|2.0   | Bob   | 10.0                       |key2     |
|1.0   |Charlie| 8.0                        |key3     |
|2.0   |Sue    | 10.0                       |key4     |
|1.0   |Clark  | 8.0                        |key5     |
|3.0   |Sarah  | 10.0                       |key6     |
What is the best way of going about this? This is what I have so far:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
rdd = sc.parallelize(results)
df = sqlContext.read.json(rdd)
df.show()