I have data like this:
+---+------+                                                                    
| id|   col|
+---+------+
|  1|210927|
|  2|210928|
|  3|210929|
|  4|210930|
|  5|211001|
+---+------+
I want the output like below:
+---+------+----------+
| id|   col|   t_date1|
+---+------+----------+
|  1|210927|27-09-2021|
|  2|210928|28-09-2021|
|  3|210929|29-09-2021|
|  4|210930|30-09-2021|
|  5|211001|01-10-2021|
+---+------+----------+   
Which I was able to get it using pandas and strptime. Below is my code:
pDF= df.toPandas()
valuesList = pDF['col'].to_list()
modifiedList = list()
 
for i in valuesList:
...  modifiedList.append(datetime.strptime(i, "%y%m%d").strftime('%d-%m-%Y'))
 
pDF['t_date1']=modifiedList
 
df = spark.createDataFrame(pDF)
Now, the main problem is I want to avoid using pandas and list since I would be dealing with millions or even billions of data, and pandas slowers the process when it comes to big data.
I tried various methods in spark like unixtime, to_date, timestamp with the format I need but no luck, and since strptime only works with string I can't use it directly on column. I am not willing to create a UDF since they are slow too.
The main problem is with identifying the exact year which I wasn't able to do in spark but I am looking to implement it using spark only. What needs to be changed? Where am I going wrong?
 
     
     
     
    