You can try this pipeline:
First, tokenize the input Tweet (located in the column text). basically, it creates a new column rawWords as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false))
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawWords")
.setPattern("\\w+")
.setGaps(false)
Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawWords")
.setOutputCol("words")
Now it's time to vectorize the wordscolumn. In this example I'm using the CountVectorizerwhich is quite basic. There are many others such as the TF-ID Vectorizer. You can find more information here.
I've configured the CountVectorizerso that it creates a vocabulary with 10,000 words, each word appearing a minimum of 5 times across all document, and a minimum of 1 on each document.
val countVectorizer = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5.0)
.setMinTF(1.0)
Finally, just create the pipeline, and fit and transform the model generated by the pipeline by passing the dataset.
val transformPipeline = new Pipeline()
.setStages(Array(
tokenizer,
stopWordsRemover,
countVectorizer))
transformPipeline.fit(training).transform(test)
Hope it helps.