Spark Dataframe filldown

Question

I would like to do a "filldown" type operation on a dataframe in order to remove nulls and make sure the last row is a kind of summary row, containing the last known values for each column based on the timestamp, grouped by the itemId. As I'm using Azure Synapse Notebooks the language can be Scala, Pyspark, SparkSQL or even c#. However the problem here is that the real solution has up to millions of rows and hundreds of columns, so I need a dynamic solution that can take advantage of Spark. We can provision a big cluster to how to make sure we take good advantage of it?

Sample data:

// Assign sample data to dataframe
val df = Seq(
    ( 1, "10/01/2021", 1, "abc", null ),
    ( 2, "11/01/2021", 1, null, "bbb" ),
    ( 3, "12/01/2021", 1, "ccc", null ),
    ( 4, "13/01/2021", 1, null, "ddd" ),

    ( 5, "10/01/2021", 2, "eee", "fff" ),
    ( 6, "11/01/2021", 2, null, null ),
    ( 7, "12/01/2021", 2, null, null )
    ).
    toDF("eventId", "timestamp", "itemId", "attrib1", "attrib2")

df.show

Expected results with rows 4 and 7 as summary rows:

+-------+----------+------+-------+-------+
|eventId| timestamp|itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
|      1|10/01/2021|     1|    abc|   null|
|      2|11/01/2021|     1|    abc|    bbb|
|      3|12/01/2021|     1|    ccc|    bbb|
|      4|13/01/2021|     1|    ccc|    ddd|
|      5|10/01/2021|     2|    eee|    fff|
|      6|11/01/2021|     2|    eee|    fff|
|      7|12/01/2021|     2|    eee|    fff|
+-------+----------+------+-------+-------+

I have reviewed this option but had trouble adapting it for my use case.

Spark / Scala: forward fill with last observation

I have a kind of working SparkSQL solution but it will be very verbose for the high volume of columns, hoping for something easier to maintain:

%%sql
WITH cte (
SELECT
    eventId,
    itemId,
    ROW_NUMBER() OVER( PARTITION BY itemId ORDER BY timestamp ) AS rn,
    attrib1,
    attrib2
FROM df
)
SELECT
    eventId,
    itemId,
    CASE rn WHEN 1 THEN attrib1 
        ELSE COALESCE( attrib1, LAST_VALUE(attrib1, true) OVER( PARTITION BY itemId ) ) 
    END AS attrib1_xlast,
    CASE rn WHEN 1 THEN attrib2 
        ELSE COALESCE( attrib2, LAST_VALUE(attrib2, true) OVER( PARTITION BY itemId ) ) 
    END AS attrib2_xlast
    
FROM cte
ORDER BY eventId

koiralo · Accepted Answer · 2021-04-14T15:32:32.927

1

For many columns you could create an expression as below

val window = Window.partitionBy($"itemId").orderBy($"timestamp")

// Instead of selecting columns you could create a list of columns 
val expr = df.columns
  .map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))

df.select(expr: _*).show(false)

Update:

val mainColumns = df.columns.filterNot(_.startsWith("attrib"))
val aggColumns = df.columns.diff(mainColumns).map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))

df.select(( mainColumns.map(col) ++ aggColumns): _*).show(false)

Result:

+-------+----------+------+-------+-------+
|eventId|timestamp |itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
|1      |10/01/2021|1     |abc    |null   |
|2      |11/01/2021|1     |abc    |bbb    |
|3      |12/01/2021|1     |ccc    |bbb    |
|4      |13/01/2021|1     |ccc    |ddd    |
|5      |10/01/2021|2     |eee    |fff    |
|6      |11/01/2021|2     |eee    |fff    |
|7      |12/01/2021|2     |eee    |fff    |
+-------+----------+------+-------+-------+

edited Apr 14 '21 at 15:32

answered Apr 13 '21 at 06:14

koiralo

22,594
6
51
72

Is there a way to exclude certain columns? – wBob Apr 13 '21 at 17:14
You can `drop` the columns as `df.columns.drop(0)` with index or can use `filterNot` `df.columns.filterNot(_.endsWith("eventId"))` – koiralo Apr 14 '21 at 05:28
Ah ok, I mean I want to run the fill down only on certain columns and create a resulting data frame where some columns are untouched and some have had the filldown applied, make sense? If you look at my SQL example you can see that eventId and timestamp and itemId are untouched and only attrib1 and 2 are changed. So something like 'fill down only on columns[attrib1, attrib2]' or 'fill down only on columns starting with attrib'. Make sense? – wBob Apr 14 '21 at 11:35
1

Please check the update, You can also define custom list of fields instead of preparing this way. – koiralo Apr 14 '21 at 15:34
Thanks for the update. Don't be afraid to upvote the question if you thought it was an interesting question! – wBob Apr 14 '21 at 22:08

Spark Dataframe filldown

1 Answers1