Is there a way to remove some rows in the training set based on values in another column

Question

I have a dataframe and I split it into training and testing (80:20). It looks like this:

V1  V2  V3  V4  V5 Target
5   2   34  12  9   1
1   8   24  14  12  0
12  27  4   12  9   0

Then I build a simple regression model and made predictions.

The code worked with me, but my question is that, after I split the data into training and testing. I need to remove (or exclude) some data points or some rows in the training set (removing specific rows in the X_train and their corresponding y_train) based on some conditions or based on values in another column.

For example, I need to remove any row in the training set if V1 > 10.

As results this row in the X_train and its y_train should be deleted:

V1  V2  V3  V4  V5 Target
12  27  4   12  9   0

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

# Train and fit the model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Make prediction
y_pred = regressor.predict(X_test)

I think the way to do it is to extract the indexes for the rows we need to remove using the required condition and then removing them from the x_train and y_train

The suggested questions did not answer my question because here is a different scenario. It did not consider the training and testing set. I need to delete some value rows in the X_train and their corresponding y_train.

Does this answer your question? [How to delete rows from a pandas DataFrame based on a conditional expression](https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression) — Ynjxsjmh, May 26 '22 at 14:15
you want to remove the rows in wich `V1>10` or you just want ensure that they will be in the test-set? — Salvatore Daniele Bianco, May 26 '22 at 14:35
I want to remove any row in the training set (X_train and its y-train) if V1>10 — MohammedE, May 26 '22 at 14:39

Salvatore Daniele Bianco · Accepted Answer · 2022-05-26T15:12:27.957

0

if X_train and y_train are numpy arrays, how I suppose, you can simply do:

y_train = y_train[X_train[:,0]<=10]
X_train = X_train[X_train[:,0]<=10]

EDIT

if if X_train is a pandas DataFrame and y_train is a pandas Series:

y_train = y_train[X_train["V1"]<=10]
X_train = X_train.loc[X_train["V1"]<=10]

edited May 26 '22 at 15:12

answered May 26 '22 at 14:54

Salvatore Daniele Bianco

2,496
1
8
22

TypeError: '(slice(None, None, None), 0)' is an invalid key – MohammedE May 26 '22 at 15:03
ok it is not a numpy array, but is pandas dataframe. Please convert your data in numpy array or provide us a reproducible example. – Salvatore Daniele Bianco May 26 '22 at 15:07

Is there a way to remove some rows in the training set based on values in another column

1 Answers1