Similar : Pipeline doesn't work with Label Encoder
I'd like to have an object that handles label encoding (in my case with a LabelEncoder), transformation and estimation. It is important to me that all theses functions can be executed through only one object.
I've tried using a pipeline this way :
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
# mock training dataset
X = np.random.rand(1000, 100)
y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])
le = LabelEncoder()
ss = StandardScaler()
clf = MyClassifier()
pl = Pipeline([('encoder', le),
               ('scaler', ss),
               ('clf', clf)])
pl.fit(X, y)
Which gives :
File "sklearn/pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
TypeError: fit_transform() takes exactly 2 arguments (3 given)
Clarifications :
- Xand- yare my training dataset,- Xbeing the values and- ythe targeted labels.
- Xis a- numpy.ndarrayof shape (n_sample, n_features) and of type float, values ranging from 0 to 1.
- yis a- numpy.ndarrayof shape (n_sample,) and of type string
- I expect - LabelEncoderto encode- y, not- X.
- I need - yonly for- MyClassifier, and I need it encoded to integers for- MyClassifierto work.
After some thoughts and facing the error above, I feel like it was naive to think that Pipeline could handle it. I figured out that Pipeline could very well handle my transformation and classifier together but it was the label encoding part that would fail.
What is the correct way to achieve what I want ? By correct I mean to do something that would allow reusability and some kind of consistency with sklearn. Is there a class in sklearn library that do what I want ?
I'm pretty surprised I haven't found an answer browsing the web because I feel like what I'm doing is nothing uncommon. I might be missing something here.
 
     
    