I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.
Here's the setup I'm currently using:
def winsorize_series(s):
    q = s.quantile([0.05, 0.95])
    if isinstance(q, pd.Series) and len(q) == 2:
        s[s < q.iloc[0]] = q.iloc[0]
        s[s > q.iloc[1]] = q.iloc[1]
    return s
def winsorize_df(df):
    return df.apply(winsorize_series, axis=0)
and then, with my DataFrame called features and indexed by DATE, I can do
grouped = features.groupby(level='DATE')
result = grouped.apply(winsorize_df)
This works, except that it's very slow, presumably due to the nested apply calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?
 
     
     
     
     
    