I have a large data frame of authors and corresponding texts (approximately 450,000 records). From the data frame I extracted two vectors respectively for authors and texts, such as:
author <- c("Sallust",
"Tacitus",
"Justin",
"Cato the Elder",
"Claudius",
"Quintus Fabius Pictor",
"Justin",
"Claudius",
"Cato the Elder",
"Tacitus",
"Sallust")
text <- c("Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet")
My goal is to subset the data set in chucks sufficiently small to be text mined but still keeping all the records with the same author in the same chunk.
I noticed that extracting the vectors author and text from the original data frame is fast BUT combining the extracted vectors in a new data frame is extremely slow. So I guess I should avoid creating the data frame with all the records.
Probably the "smart" solution would be:
- Order the vector
authoralphabetically (so to make sure records with the same author are contiguous); - Order the vector
textbased on the ordering of the vectorauthor; - Create a logical vector (TRUE/FALSE) indicating if the author is the same author of the previous value;
- Create an vector
splitAtcontaining the indexes of the vectorsauthorandtextwhere to split; - Split the vectors.
In code, assuming my procedure makes sense, I got the first 3 steps working:
# Sort vectors
order <- order(author)
author <- author[order]
text <- text[order]
same_author <- duplicated(author)
But I don't know how to proceed further. Probably should be something like:
# Index for splitting
max_length <- 2
num_records <- length(author)
num_chunks <- as.integer((num_records %/% max_length)) - 1
# Initialise vector (not sure it needs value 2 to indicate first index where to split)
splitAt <- 1
for (n in num_chunks){
index <- n * max_length + 1
while (same_author[index]!=FALSE) {
splitAt <- append(splitAt, index)
index <- index + 1
}
}