I wanted to transform a dataset or create a new one that takes a dataset column with labels as input which automatically has sequences of strings according to a pre-defined length (and pads if necessary). The example below should demonstrate what I mean.
I was able to manually create a new dataframe based on ngrams. This is obviously computationally expensive and creates many columns with repetitive words.
                                               text  labels
0  from dbl visual com david b lewis subject comp...       5
1  from johan blade stack urc tue nl johan wevers...      11
2  from mzhao magnus acs ohio state edu min zhao ...       6
3  from lhawkins annie wellesley edu r lee hawkin...      14
4  from seanmcd ac dal ca subject powerpc ruminat...       4
for example for sequence length 4 into something like this:
   text                                      labels
0  from dbl visual com                            5
1  david b lewis subject                          5
2  comp windows x frequently                      5
3  asked questions <PAD> <PAD>                    5
4  from johan blade stack                        11
5  urc tue nl johan                              11
6  wevers subject re <PAD>                       11
7  from mzhao magnus acs                          6
8  ohio state edu min                             6
9  zhao subject composite <PAD>                   6
As explained I was able to create a new dataframe based on ngrams. I could theoretically delete every n-rows afterwards again.
    df = pd.read_csv('data.csv')
    longform = pd.DataFrame(columns=['text', 'labels'])
    for idx, content in df.iterrows():
        name_words = (i.lower() for i in content[0].split())
        ngramlis = list(ngrams(name_words,20))
        longform = longform.append(
            [{'words': ng, 'labels': content[1]} for ng in ngramlis],
            ignore_index=True
        )
    longform['text_new'] = longform['words'].apply(', '.join)
    longform['text_new'] = longform['text_new'].str.replace(',', '')
This is really bad code which is why I am quite confident that someone might come up with a better solutions.
Thanks in advance!
 
    