The sample code from Florian
-----------+-----------+-----------+
|ball_column|keep_the   |hall_column|
+-----------+-----------+-----------+
|          0|          7|         14|
|          1|          8|         15|
|          2|          9|         16|
|          3|         10|         17|
|          4|         11|         18|
|          5|         12|         19|
|          6|         13|         20|
+-----------+-----------+-----------+
The first part of the code drops columns name in the banned list
#first part of the code
banned_list = ["ball","fall","hall"]
condition = lambda col: any(word in col for word in banned_list)
new_df = df.drop(*filter(condition, df.columns))
So the above piece of code should drop the ball_column and hall_column.
The second part of the code buckets specific columns in the list. For this example, we will bucket the only one remaining, keep_column.
bagging = 
    Bucketizer(
        splits=[-float("inf"), 10, 100, float("inf")],
        inputCol='keep_the',
        outputCol='keep_the')
Now bagging the columns using pipeline was as follows
model = Pipeline(stages=bagging).fit(df)
bucketedData = model.transform(df)
How can I add the first block of the code (banned list, condition, new_df) to the ml pipeline as a stage?
 
     
    