I'm reading two columns of a .csv file into a Pandas dataframe using pandas.read_csv(). The head of the Dataframe is shown below:
        Year    cleaned
    0   1909    acquaint hous receiv follow letter clerk crown...
    1   1909    ask secretari state war whether issu statement...
    2   1909    i beg present petit sign upward motor car driv...
    3   1909    i desir ask secretari state war second lieuten...
    4   1909    ask secretari state war whether would introduc...
Following this, I call df.dropna(inplace=True)(thanks to Brad Solomon) to allow the coming fit/transform calls to proceed without producing a 'MemoryError' as shown in my previous question here.
Now that I have a memory-friendly form of Dataframe, I use SKLearn's train_test_split() to create four sets of data that I intend to use for fitting/transforming on to a Pipeline.
X_train, X_test, y_train, y_test = train_test_split(df, df['Year'], test_size=0.25)
The shape of these variables is:
[IN] X_train.shape [OUT] (1785, 2)
[IN] X_test.shape  [OUT] (595, 2)
[IN] y_train.shape [OUT] (1785,)
[IN] y_test.shape  [OUT] (595,)
So, I have my data split into appropriate subsections for testing and training. I then create my Pipeline, which makes use of TfidfVectorizer, SelectKBest and LinearSVC as shown below:
pipeline = Pipeline(
    [('vectorizer', TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1,2), sublinear_tf=True)),
     ('chi2', SelectKBest(chi2, k=1000)),
     ('classifier', LinearSVC(C=1.0, penalty='l1', max_iter=3000, dual=False))
    ])
Finally, we come across the error mentioned in the title when I attempt to call fit_transform() on the aforementioned X and y training data
model = pipeline.fit_transform(X_train, y_train)
...which in turn produces the error:
ValueError: Found input variables with inconsistent numbers of samples: [2, 1785]
The full Traceback can be viewed here.