I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns list and form a new dataframe out of the remaining columns
banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]
df_new = df.drop(*drop_these)
The idea of banned_columns is to drop any columns that start with basket and cricket, and columns that contain the word ball anywhere in their name.
The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)
Example of dataframe
sports1basketjump | sports
In the above column name example, it will drop the column sports1basketjump because it contains the word basket.
Moreover, is using the filter or/and reduce functions adds optimization than creating list and for loops?