Below is the sample dataset representing the employees in_date and out_date. I have to obtain the last in_time of all employees.
Spark is running on 4 Node standalone cluster.
Initial Dataset:
EmployeeID-----in_date-----out_date
1111111     2017-04-20  2017-09-14 
1111111     2017-11-02  null 
2222222     2017-09-26  2017-09-26 
2222222     2017-11-28  null 
3333333     2016-01-07  2016-01-20 
3333333     2017-10-25  null 
Dataset after df.sort(col(in_date).desc()):
EmployeeID--in_date-----out_date
1111111   2017-11-02   null 
1111111   2017-04-20   2017-09-14 
2222222   2017-09-26   2017-09-26 
2222222   2017-11-28   null 
3333333   2017-10-25   null 
3333333   2016-01-07   2016-01-20 
df.dropDup(EmployeeID):  
Output :
EmployeeID-----in_date-----out_date
1111111    2017-11-02    null 
2222222    2017-09-26    2017-09-26 
3333333    2016-01-07    2016-01-20 
Expected Dataset :
EmployeeID-----in_date-----out_date
1111111    2017-11-02   null 
2222222    2017-11-28   null 
3333333    2017-10-25   null 
but when I sorted the Initial Dataset with sortWithInPartitions and deduped I got the expected dataset.
Am I missing anything big or small here? Any help is appreciated.
Additional Information :
The above expected output was achieved when df.sort was executed with Spark in local mode.
I've not done any kind of partition, repartition.
The initial dataset is obtained from the underlying Cassandra database.
 
     
    