I am extracting text from a pdf. Removing punctuation and looking at key repeated words and how often they appear.
library(pdftools)
library(tm)
setwd("S:/Shared Folders/Impact Investing/Investment/Scripts/PDF")
files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)
corp <- Corpus(URISource(files),
           readerControl = list(reader = readPDF))
opinions.tdm <- TermDocumentMatrix(corp, 
        control = 
            list(removePunctuation = TRUE,
            stopwords = TRUE,
            tolower = TRUE,
            stemming = TRUE,
            removeNumbers = TRUE,
            bounds = list(global = c(3, Inf)))) 
inspect(opinions.tdm[1:10,])
I am currently getting an error:
Error in
[.simple_triplet_matrix(opinions.tdm, 1:10, ) : subscript out of bounds
My opinions.tdm has the following characteristics:
opinions.tdm list length of 6. nrow integer [1]. ncol [1]. dimnames list [2]. attributes [3]
 
    