I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.
I can get the count if I use the string date column.
crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()
and I get this output
+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
|   13/08/2010|               342|
|    6/10/2011|               334|
|   27/11/2011|               269|
|   12/01/2012|               303|
|   22/02/2012|               286|
|   31/07/2012|               276|
|   25/04/2013|               222|
+-------------+------------------+
To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date". However, when I try to use the same code to group by the new column and do the count I get an error message.
crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None
This is the data schema:
root
 |-- Offence Count: integer (nullable = true)
 |-- Reported Date: string (nullable = true)
 |-- Date: date (nullable = true)
Is there a way to fix this error? or extract the year using another method? Thank you