I currently have an R-based algorithm that sorts a data.table by date and then finds the most recent non-NA / non-null value. I've found some success with the following StackOverflow question to implement a backfilling algorithm for some relatively large datasets:
Computing the first non-missing value from each column in a DataFrame
I've implemented a solution both in Python and in R, but my Python solution seems to be running much, much slower.
library(data.table)
library(microbenchmark)
test_values <- rnorm(100000)
test_values[sample(1:length(test_values), size = 10000)] <- NA
test_values_2 <- rnorm(100000)
test_values_2[sample(1:length(test_values), size = 10000)] <- NA
test_ids <- rpois(100000, lambda = 100)
random_timestamp <- sample(x = seq(as.Date('2000-01-01'), as.Date('2017-01-01'), by = 1), size = 100000, replace = TRUE)
dt <- data.table(
    'id' = test_ids,
    'date' = random_timestamp,
    'v1' = test_values,
    'v2' = test_values_2
)
# Simple functions for backfilling
backfillFunction <- function(vector) {
    # find the vector class
    colClass <- class(vector)
    if (all(is.na(vector))) {
        # return the NA of the same class as the vector
        NA_val <- NA
        class(NA_val) <- colClass
        return(NA_val)
    } else {
        # return the first non-NA value
        return(vector[min(which(!is.na(vector)))])
    }
}
print(microbenchmark(
    dt[order(-random_timestamp), lapply(.SD, backfillFunction), by = 'id', .SDcols = c('v1', 'v2')]
))
Unit: milliseconds
                                                                                                              expr      min       lq
 dt[order(-random_timestamp), c(lapply(.SD, backfillFunction),      list(.N)), by = "id", .SDcols = c("v1", "v2")] 9.976708 12.29137
    mean   median       uq      max neval
 15.4554 14.47858 16.75997 112.9467   100
And the Python solution:
import timeit
setup_statement = """
import numpy as np
import pandas as pd
import datetime
start_date = datetime.datetime(2000, 1, 1)
end_date = datetime.datetime(2017, 1, 1)
step = datetime.timedelta(days=1)
current_date = start_date
dates = []
while current_date < end_date:
    dates.append(current_date)
    current_date += step
date_vect = np.random.choice(dates, size=100000, replace=True)
test_values = np.random.normal(size=100000)
test_values_2 = np.random.normal(size=100000)
na_loc = [np.random.randint(0, 100000, size=10000)]
na_loc_2 = [np.random.randint(0, 100000, size=10000)]
id_vector = np.random.poisson(100, size=100000)
for i in na_loc:
    test_values[i] = None
for i in na_loc_2:
    test_values_2[i] = None
DT = pd.DataFrame(
    data={
        'id': id_vector,
        'date': date_vect,
        'v1': test_values,
        'v2': test_values_2
    }
)
GT = DT.sort_values(['id', 'date'], ascending=[1, 0]).groupby('id')
"""
print(timeit.timeit('{col: GT[col].apply(lambda series: series[series.first_valid_index()] if series.first_valid_index() else None) for col in DT.columns}', number=100, setup=setup_statement)*1000/100)
66.5085821699904
My average time on Python is 67ms, but for R it is only 15, even though the approach seems relatively similar (apply a function over each column within groups). Why is my R code so much quicker than my Python code, and how can I achieve similar performance in Python?
 
    