Given a corpus of text, want to use tm (Text Mining) package in R for word stemming and stem-completion to normalize the terms, however, stemCompletion step has issues in 0.6.x version of the package. Using R 3.3.1 with tm 0.6-2.
This question has been asked before but have not seen a complete answer that actually works. Here is the complete code to properly demonstrate the issue.
 require(tm)
 txt <- c("Once we have a corpus we typically want to modify the documents in it",
          "e.g., stemming, stopword removal, et cetera.",
          "In tm, all this functionality is subsumed into the concept of a transformation.")
 myCorpus <- Corpus(VectorSource(txt))
 myCorpus <- tm_map(myCorpus, content_transformer(tolower))
 myCorpus <- tm_map(myCorpus, removePunctuation)
 myCorpusCopy <- myCorpus
 # *Removing common word endings* (e.g., "ing", "es") 
 myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
 myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
 tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
 print(tdm)
 print(dimnames(tdm)$Terms)
Here is the output:
<<TermDocumentMatrix (terms: 19, documents: 2)>>
Non-/sparse entries: 20/18
Sparsity           : 47%
Maximal term length: 9
Weighting          : term frequency (tf)
 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"     
Several of the terms have been stemmed: "modifi", "remov", "subsum", "typic", and "onc".
Next, want to complete the stemming.
myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)
At this stage, the corpus is no longer a TextDocument and creating TermDocumentMatrix fails with the error: inherits(doc, "TextDocument") is not TRUE. It has been documented to apply PlainTextDocument() function next.
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
Here is the output:
<TermDocumentMatrix (terms: 2, documents: 2)>>
Non-/sparse entries: 4/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)
[1] "content" "meta"   
Calling PlainTextDocument has corrupted the corpus.
Expect the stemmed words to be completed: e.g. "modifi" => "modifier", "onc" => "once", etc.
 
     
    