I want to be able to consecutively go through a table using the value calculated in the previous row in the current row. It seems a window function could do this.
from pyspark.sql import SparkSession
from pyspark.sql import Window
import pyspark.sql.functions as F
# Create a sample DataFrame
data = [
    (1, 1, 3),
    (2, 2, 4),
    (3, 3, 5),
    (4, 4, 6),
    (5, 8, 10)
]
columns = ["id", "start", "stop"]
spark = SparkSession.builder.master("local").appName("SelfJoin").getOrCreate()
df = spark.createDataFrame(data, columns).sort('start')
# 1. Sort the DataFrame by 'start'
df = df.sort("start").withColumn('prev_start', col('start'))
# 2. Initialize a window that looks back one record
window = Window.orderBy(['start']).rowsBetween(-1, -1)
df = (
    df
    .withColumn("prev_start", F.lag("prev_start", 1).over(window))
)
df.toPandas().style.hide_index()
At row one
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
At row two
Extra: I need to specify that if value in previous row is missing, then use the value in the current row: 1.0
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
At Row 3:
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
| 3 | 3 | 5 | 2.0 | 
Look back, prev_start assigned to previous row value of 1
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
| 3 | 3 | 5 | 1.0 | 
At Row 4
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
| 3 | 3 | 5 | 1.0 | 
| 4 | 4 | 6 | 3.0 | 
Assign value of prev_start to what is now 1, not what it was originally: 2
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
| 3 | 3 | 5 | 1.0 | 
| 4 | 4 | 6 | 1.0 | 
so now, the actually application,  I also do this for prev_end = max(prev_end, end)
- as soon as this 
prev_endis less than the current rowstart(and the data is ordered correctly) I now have a non-contiguous time frame (start to stop), that will be fromprev_starttoprev_end 
Result
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
| 3 | 3 | 5 | 2.0 | 
| 4 | 4 | 6 | 3.0 | 
| 5 | 8 | 10 | 4.0 | 
But I want
Result
| id | start | stop | prev_start | 
|---|---|---|---|
| 1 | 1 | 3 | NaN | 
| 2 | 2 | 4 | 1.0 | 
| 3 | 3 | 5 | 1.0 | 
| 4 | 4 | 6 | 1.0 | 
| 5 | 8 | 10 | 1.0 | 
Because in this case, the values are calculated row by row, so once the value is set to 1, all prev_start values would consecutively be made 1