I need to understand How can I remove duplicate rows from a Data-frame on the basis of single in Spark SQL using Java.
Like in normal SQL, ROW_NUMBER () OVER (PARTITION BY col ORDER BY Col DESC). How Can I translate this step into Spark SQL in Java?
I need to understand How can I remove duplicate rows from a Data-frame on the basis of single in Spark SQL using Java.
Like in normal SQL, ROW_NUMBER () OVER (PARTITION BY col ORDER BY Col DESC). How Can I translate this step into Spark SQL in Java?
You can remove duplicates from dataframe using dataframe.dropDuplicates("col1"). It will remove all rows which has duplicates in col1. This API is available from Spark 2.x on wards.
You are looking correctly. We should use the windowing function and then filter out dataframe with row_number=1 to get the lastest record(order by field helps in giving row_number).
Follow the below links.
http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html