I am unable to use tm_combine in R. Here are the version details
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          3.3                         
year           2017                        
month          03                          
day            06                          
svn rev        72310                       
language       R                           
version.string R version 3.3.3 (2017-03-06)
nickname       Another Canoe  
I would like to understand more on this. In case there is an issue with accessing this, my question is how do I combine two Document Term matrices D1 and D2 which have different number of columns?
> packageVersion("tm")
[1] ‘0.7.1’
> dim(s.tdm)
[1] 132 536
> dim(f.tdm)
[1] 132 674
> 
Thanks.
Here's the code that I was trying
library(tm)
library(SnowballC)
s.dir <- "AuthorIdentify\\Author1"
f.dir <- "AuthorIdentify\\Author2"
s.docs <- Corpus(DirSource(s.dir, encoding="UTF-8"))
f.docs <- Corpus(DirSource(f.dir, encoding="UTF-8"))
cleanCorpus<-function(corpus){
  # apply stemming
  corpus <-tm_map(corpus, stemDocument)
  # remove punctuation
  corpus.tmp <- tm_map(corpus,removePunctuation)
  # remove white spaces
  corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
  # remove stop words
  corpus.tmp <-
    tm_map(corpus.tmp,removeWords,stopwords("en"))
  return(corpus.tmp)
}
s.cldocs <- cleanCorpus(s.docs) # preprocessing
# forms document-term matrix
s.tdm <- DocumentTermMatrix(s.cldocs)
# removes infrequent terms
s.tdm <- removeSparseTerms(s.tdm,0.97)
dim(s.tdm) # [ #docs, #numterms ]
f.cldocs <- cleanCorpus(f.docs) # preprocessing
# forms document-term matrix
f.tdm <- DocumentTermMatrix(f.cldocs)
# removes infrequent terms
f.tdm <- removeSparseTerms(f.tdm,0.97)
dim(f.tdm) # [ #docs, #numterms ]
#how do I combine f.tdm and s.tdm
tm_combine???
I need to combine them (and eventually to a matrix or data.frame) so that I can have a column identifier for Author1 or Author2
With the approach referenced in the linked article, the output of the combined DTMs does not match the expected output. I have referenced the relevant details in the comments section.
