I would like to train a DecisionTree using sklearn Pipeline. My goal is to predict the 'language' column, using the 'tweet' as ngram transformed features. However I am not able to make the LabelEncoder transformation works for the 'language' column inside a pipeline. I saw that there is a common error, but also if I try the suggested method to reshape I am still no able to overcame the problem. This is my df:
   tweet   language
0   kann sein grund europ regulierung rat tarif bu...   ge
1   willkommen post von zdfintendant schächter ein...   ge
2   der neue formel1weltmeister kann es selbst noc...   ge
3   ruf am besten mal die hotline an unter 0800172...   ge
4   ups musikmontag verpasst hier die deutsche lis...   ge
... ... ...
9061    hoe smaakt je kerstdiner nog lekkerder sms uni...   nl
9062    femke halsema een partijvernieuwer met lef thi...   nl
9063    een lijst van alle vngerelateerde twitteraars   nl
9064    vanmiddag vanaf 1300 uur delen we gratis warme...   nl
9065    staat hier het biermeisje van 2011  nl
target_features=['language']
text_features=['tweet']
ngram_size=2
preprocessor = ColumnTransformer(
   transformers=[
       ("cat", OrdinalEncoder(), 'language'),
       ('vect',  CountVectorizer(ngram_range=(ngram_size,ngram_size),analyzer='char'), text_features)])
X_train, X_test, y_train, y_test = train_test_split(d.tweet, 
                                                   d.language, 
                                                   test_size=0.3, 
                                                   random_state=42)
ngram_size = 2
clf = DecisionTreeClassifier()
clf_ngram = Pipeline(steps=[('pre',preprocessor), ('clf', clf)])
clf_ngram.fit(X_train.values, y_train.values)
print('Test accuracy computed using cross validation:')
scores = cross_val_score(clf_ngram, X_test, y_test, cv=2)
I tried also using:
y_train = y_train.values.reshape(-1, 1) 
X_train = X_train.values.reshape(-1, 1) 
But the error is still the same.
clf_ngram.fit()
IndexError: tuple index out of range
Many thanks!
 
    