I am using R-tm-Rweka packages to do some text mining. Instead of building a tf-tdm on single words, which is not enough for my purposes, i have to extract ngrams. I used @Ben function TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
to extract trigrams. The output has an apparent error, see below. It picks up 4-, 3- and 2-word phrases. Ideally, it should have ONLY picked up the 4-word noun phrase and dropped the (3- and 2-word)rest. How do I force this solution, like Python NLTK has a backup tokenizer option? 
abstract strategy             ->this is incorrect>
  abstract strategy board        ->incorrect
  abstract strategy board game   -> this should be the correct output
accenture executive
  accenture executive simple
  accenture executive simple comment   
Many thanks.
 
     
    