In pyspark say suppose we have three column  Start_date, duration, End_date.
How can i look at the first rows end_date and second row Start_date. if second row start_date is greater than first row end date do nothing otherwise if first rows End_date is less than Second row Start_date then replace the second row start_date with first row end_date and add duration of second row to start_date and replace end_date of second row second row with new value. and do it for complete one group of ID.
            Asked
            
        
        
            Active
            
        
            Viewed 488 times
        
    0
            
            
         
    
    
        Milad Bahmanabadi
        
- 946
- 11
- 27
 
    
    
        pallav kumar
        
- 11
- 3
- 
                    1it would help others answer your question if you could provide a reproducible example for your dataframe and required output. – murtihash Apr 25 '20 at 15:48
- 
                    @MohammadMurtazaHashmi - True but since i am new to Stack attaching Image is not allowed for me as of now. I tried attaching image now see if you can see it in my post. – pallav kumar Apr 25 '20 at 16:04
- 
                    1[Please do not post images of code/data as they cant be copied](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) , , it would help if you create a reproducible example , Take a look at [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – anky Apr 25 '20 at 16:35
1 Answers
0
            
            
        Use window lag/lead functions partitionBy id, orderBy start_date to compare first rows end_Date with second row start_date. 
- Use when otherwisestatement withdatedifffunction to caluculate difference of dates fordurationcolumn.
 
    
    
        notNull
        
- 30,258
- 4
- 35
- 50
- 
                    can i look at both rows at once. using lag function i know i can define a new column but here i want to update the end date sequentially.. so in one statement can do this operation . can i write a function somethig like . when (lag 1 ,window) end_date < start_date , start_date = Lag1 ,window End_date , end date = start_date which is updated + duration , Else do nothing. – pallav kumar Apr 25 '20 at 16:24