I have the following data (you can reproduce it by copying and pasting):
from pyspark.sql import Row
l = [Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None)]
l_df = spark.createDataFrame(l)
Let's take a look at the schema of l_df:
l_df.printSchema()
root
|-- value: boolean (nullable = true)
Now I want to use cube() to count the frequency of each distinct value in the value column:
l_df.cube("value").count().show()
But I see two types of null values!
+-----+-----+
|value|count|
+-----+-----+
| true| 67|
| null| 100|
| null| 33|
+-----+-----+
To verify that I don't actually have two types of null:
l_df.select("value").distinct().collect()
And there is indeed only one type of null:
[Row(value=None), Row(value=True)]
Just to double check:
l_df.select("value").distinct().count()
And it returns 2.
I also noticed that len(l) is 100 and the first null is equal to this number. Why is this happening?
System info: Spark 2.1.0, Python 2.7.8, [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2
