I have a Pandas DataFrame containing 3 categorical grouping variables and 1 numerical outcome variable. Within each group, there is an n = 6, where one of these values may be an outlier (as defined by the distribution within each group: an outlier can either exceed quartile 3 by 1.5 times the inter-quartile range, or be less than quartile 1 by 1.5 times the inter-quartile range).
An example of the DataFrame is shown below:
# Making the df without our outcome variable
import numpy as np
import pandas as pd
G1 = np.repeat(['E', 'F'], 24)
G2 = np.tile(np.repeat(['C', 'D'], 6), 4)
G3 = np.tile(np.repeat(['A', 'B'], 12), 2)
dummy_data = pd.DataFrame({'G1' : G1, 'G2' : G2, 'G3': G3})
# Defining a function to generate a numpy array with n = 6, where one of these values is an outlier # by our previous definition
np.random.seed(0)
def outlier_arr(low, high):
norm_arr = np.random.randint(low, high, 5)
IQR = np.percentile(norm_arr, 75) - np.percentile(norm_arr, 25)
upper_fence = np.percentile(norm_arr, 75) + (IQR * 1.5)
lower_fence = np.percentile(norm_arr, 25) - (IQR * 1.5)
rand_decision = np.random.randint(0, 2, 1)[0]
if rand_decision == 1:
high_outlier = np.round(upper_fence * 3, decimals = 0)
final_arr = np.hstack([norm_arr, high_outlier])
else:
low_outlier = np.round(lower_fence * (1/3), decimals = 0)
final_arr = np.hstack([norm_arr, low_outlier])
return final_arr.astype(int)
# Making a list to add into the dataframe to represent our values
abund_arr = []
for i in range(0, 8):
abund_arr = abund_arr + outlier_arr(700, 800).tolist()
abund_arr = np.array(abund_arr)
# Appending this list as a new row
dummy_data['V1'] = abund_arr
This should generate a DataFrame with 3 grouping variables G1, G2, and G3, and a single outcome variable V1 where each group should have one outlier that needs to be removed. We can look at the first 6 rows (a single group) with dummy_data.head(6) below to see that one of these values (the last row) is an outlier that we would like to filter out.
G1 G2 G3 V1
0 E C A 744
1 E C A 747
2 E C A 764
3 E C A 767
4 E C A 767
5 E C A 2391 <--- outlier
From what I understand, a good approach may be to use df.groupby().filter(), and to group by variables G1, G2, and G3 and implement a user-defined function to filter() that returns T/F based on the outlier criteria discusses above.
I have tried this, where the function for detecting outliers (returns array of True or False) within an array is found below:
def is_outlier(x):
IQR = np.percentile(x, 75) - np.percentile(x, 25)
upper_fence = np.percentile(x, 75) + (IQR * 1.5)
lower_fence = np.percentile(x, 25) - (IQR * 1.5)
return (x > upper_fence) | (x < lower_fence)
Which correctly detects an outlier as shown below:
test_arr = outlier_arr(300, 500)
is_outlier(test_arr)
# returns an array of [False, False, False, False, False, True]
However, when using the method described above on a pandas object, the following code throws no errors, but also does not filter any of the outliers:
dummy_data.groupby(['G1', 'G2', 'G3']).filter(lambda x: (is_outlier(x['V1'])).any())
NOTE: I actually found a way to do this here, where you use apply() instead of filter().
Running dummy_data[~dummy_data.groupby(['G1', 'G2', 'G3'])['V1'].apply(is_outlier)] produced the desired result.
However, just for the sake of doing it with this method, what needs to be tweaked to get this to work using filter()? If it's possible, which of the two ways is correct/preferred?
Thanks in advance.