I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
|              PERIOD|               TYPE| etc......
+--------------------+-------------------+-----------------+  
|                  W1|                 HM| 
|                  W2|                 SM|
|                  W3|                 HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
|              PERIOD|               TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...|              BORGO|                1|
|        C ATHIS MONS|   ATHIS MONS CEDEX|                1|
|    CMA BOSC LE HARD|       BOSC LE HARD|                1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.
 
    