Is there a way to merge two tables in pyspark - respect to a date, one presenting events linked to a date, and an other one presenting some other informations, presenting a period with a start and an end date ?
There is similar topics on python, but non on pyspark, like presented (using numpy) in this answer. My idea would not to get only one information but the complete available information in my right table.
In this example, I would get in df1, based on the id, all available information in df2 for this id, respecting the event_date including in the start_period and the end_period.
df1 = spark.createDataFrame([
(1,'a', datetime.datetime(2021,1,1)),
(1,'b', datetime.datetime(2021,1,5)),
(1,'c', datetime.datetime(2021,1,24)),
(2,'d', datetime.datetime(2021,1,10)),
(2,'e' , datetime.datetime(2021,1,15))], ['id','event','event_date'])
df2 = spark.createDataFrame([
(1,'Xxz45','XX013', datetime.datetime(2021,1,1), datetime.datetime(2021,1,10)),
(1,'Xasz','XX014', datetime.datetime(2021,1,11), datetime.datetime(2021,1,22)),
(1,'Xbbd','XX015', datetime.datetime(2021,1,23), datetime.datetime(2021,1,26)),
(1,'Xaaq','XX016', datetime.datetime(2021,1,27), datetime.datetime(2021,1,31))], ['id','info1','info2','start_period', 'end_period'])
[EDIT] The expected output would be (merging on id and on the event_date included in the period):
df_results = spark.createDataFrame([
(1, 'a', datetime.datetime(2021,1,1),'Xxz45','XX013'),
(1, 'b', datetime.datetime(2021,1,5),'Xxz45','XX013'),
(1, 'c', datetime.datetime(2021,1,24),'Xbbd','XX015'),
(2, 'd', datetime.datetime(2021,1,10), NA, NA),
(2, 'e' , datetime.datetime(2021,1,15), NA, NA)], ['id','event','event_date','info1','info2'])