I believe this is a loop and gregexpr() issue. I'm trying to extract/export multi-line text from i number of standardized instances within i number of standardized .txt forms into a data frame where each instance is a separate row. So far, I can successfully extract the string data (though the algorithm extracts a little more than the stated gregexpr() parameters) but can only export as .txt as a lump sum of text.
- How can I create a data frame of the extracted txt-files' text where each instance of multi-line text has its own row? (Once the data is in a data.frame format, I know how to export as xlsx from there.)
 - How can I extract only the data from the parameters I have set?
 
With help (particularly from Ben from the comments of this post), here is what I have so far:
# Txt Data Format
txt 1 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
txt 2 <-
"A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 A. The First:  abcdefg hijklmnop qrstuv wxyz. B. The Second: abcdefg hijklmnop qrstuv wxyz.
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
 C. The Third:  abcdefg hijklmnop qrstuv wxyz. D. The Fourth: abcdefg hijklmnop qrstuv wxyz."
    abcdefg hijklmnop qrstuv wxyz. abcdefg hijklmnop qrstuv wxyz abcdefg hijklmnop qrstuv wxyz.
#################################
# Directory and Text Extraction #
#################################
dest <- "C:/~"
docs_text <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
## Assumes that all the content I want to extract is between "A." and "C." in 
## the text while ignoring "C." and "D." content.
docs_list <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
docs_doc <- lapply(docs_list, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=A. The First).*?(?=C. The Third)", j, perl=TRUE))
})
lapply(1:length(docs_doc),  function(i) write.table(docs_doc[i], file=paste(docs_list[i], " ", 
" ", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
Current output looks like this where all of the text is in one line and captures more than just between "A." and "C.":
Desired output would look like this where the multi-line text between any instance of "A." and "C." is extracted and assigned one line each:
Any help you could provide would be tremendously helpful!
I'm ultimately trying to develop an NLP model that can extract standardized form data from hundreds of large PDFs for a year over year repository. If this post suggests I'm not thinking about how to approach this problem efficiently/effectively, I'm open to direction.
Thanks in advance!

