There's one more member in the "family" which can explain it all - GROUPING SETS. We don't have it in PySpark/Scala, but it exists in SQL API.
GROUPING SETS is used to design whatever combination of groupings is required. Others (cube, rollup, groupBy) return predefined existent combinations:
cube("id", "x", "y") will return (), (id), (x), (y), (id, x), (id, y), (x, y), (id, x, y).
(All the possible existent combinations.)
rollup("id", "x", "y") will only return (), (id), (id, x), (id, x, y).
(Combinations which include the beginning of the provided sequence.)
groupBy("id", "x", "y") will only return (id, x, y) combination.
Examples
Input df:
df = spark.createDataFrame(
[("a", "foo", 1),
("a", "foo", 2),
("a", "bar", 2),
("a", "bar", 2)],
["id", "x", "y"])
df.createOrReplaceTempView("df")
cube
df.cube("id", "x", "y").count()
is the same as...
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
(x),
(y),
(id, x),
(id, y),
(x, y),
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null| 2| 3|
|null|null|null| 4|
| a|null| 2| 3|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null| 1| 1|
|null| foo|null| 2|
| a|null|null| 4|
|null|null| 1| 1|
|null| foo| 2| 1|
|null| foo| 1| 1|
| a| foo| 2| 1|
|null| bar|null| 2|
|null| bar| 2| 2|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
rollup
df.rollup("id", "x", "y").count()
is the same as... GROUPING SETS ((), (id), (id, x), (id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
--(x), <- (not used)
--(y), <- (not used)
(id, x),
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null|null| 4|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null|null| 4|
| a| foo| 2| 1|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
groupBy
df.groupBy("id", "x", "y").count()
is the same as... GROUPING SETS ((id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
--(), <- (not used)
--(id), <- (not used)
--(x), <- (not used)
--(y), <- (not used)
--(id, x), <- (not used)
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+---+---+---+-----+
| id| x| y|count|
+---+---+---+-----+
| a|foo| 2| 1|
| a|foo| 1| 1|
| a|bar| 2| 2|
+---+---+---+-----+
Note. All the above return existent combinations. In the example dataframe, there is no row for "id":"a", "x":"bar", "y":1. Even cube does not return it. In order to get all the possible combinations (existent or not) we should do something like the following (crossJoin):
df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
.join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# | id| x| y|count|
# +----+----+----+-----+
# |null|null|null| 4|
# |null|null| 1| 1|
# |null|null| 2| 3|
# |null| bar|null| 2|
# |null| bar| 2| 2|
# |null| foo|null| 2|
# |null| foo| 1| 1|
# |null| foo| 2| 1|
# | a|null|null| 4|
# | a|null| 1| 1|
# | a|null| 2| 3|
# | a| bar|null| 2|
# | a| bar| 1| null|
# | a| bar| 2| 2|
# | a| foo|null| 2|
# | a| foo| 1| 1|
# | a| foo| 2| 1|
# +----+----+----+-----+