Speeding up loop over dataframes

Question

I have written the code given below. There are two Pandas dataframes: df contains columns timestamp_milli and pressure and df2 contains columns timestamp_milli and acceleration_z. Both dataframes have around 100'000 rows. In the code shown below I'm searching for each timestamp of each row of df the rows of df2 where the time difference lies within a range and is minimal.

Unfortunately the code is extremly slow. Moreover, I'm getting the following message originating from the line df_temp["timestamp_milli"] = df_temp["timestamp_milli"] - row["timestamp_milli"]:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

How can I speedup the code and solve the warning?

acceleration = []
pressure = []

for index, row in df.iterrows():
    mask = (df2["timestamp_milli"] >= (row["timestamp_milli"] - 5)) & (df2["timestamp_milli"] <= (row["timestamp_milli"] + 5))
    df_temp = df2[mask]

    # Select closest point
    if len(df_temp) > 0:
        df_temp["timestamp_milli"] = df_temp["timestamp_milli"] - row["timestamp_milli"]
        df_temp["timestamp_milli"] = df_temp["timestamp_milli"].abs()

        df_temp = df_temp.loc[df_temp["timestamp_milli"] == df_temp["timestamp_milli"].min()]

        for index2, row2 in df_temp.iterrows():
            pressure.append(row["pressure"])
            acc = row2["acceleration_z"]
            acceleration.append(acc)

For the warning, I think doing `df_temp = df2[mask].copy()` should prevent it — Ben.T, May 29 '18 at 14:54
If you only want to find the single closest match, then [`pandas.merge_asof`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge_asof.html) can accomplish this. If you provide sample data and the expected output you can get some more directed help. — ALollz, May 29 '18 at 15:03
For starters, dispense with itterrows, which is much slower than itertuples... — juanpa.arrivillaga, May 29 '18 at 15:04
Take a look at using namedtuples. https://stackoverflow.com/a/47149876/6361531 — Scott Boston, May 29 '18 at 15:29

score 2 · Accepted Answer · answered May 29 '18 at 15:07

2

I have faced a similar problem, using itertuples instead of iterrows shows significant reduction in time. why iterrows have issues. Hope this helps.

answered May 29 '18 at 15:07

Madhur Yadav

635
1
11
30

Speeding up loop over dataframes

1 Answers1