I have a very large (11GB) .json file (yeah, whoever thought that a great idea?) that I need to sample (read k random lines).
I'm not very savvy in Java file IO but I have, of course, found this post: How to get a random line of a text file in Java?
I'm dropping the accepted answer because it's clearly way too slow to read every single line of an 11GB file just to select one (or rather k) out of the about 100k lines.
Fortunately, there is a second suggestion posted there that I think might be of better use to me:
Use RandomAccessFile to seek to a random byte position in the file.
Seek left and right to the next line terminator. Let L the line between them.
With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
So far so good, but I was wondering about that "let L be the line between them".
I would have done something like this (untested):
RandomAccessFile raf = ...
long pos = ...
String line = getLine(raf,pos);
...
where
private String getLine(RandomAccessFile raf, long start) throws IOException{
long pos = (start % 2 == 0) ? start : start -1;
if(pos == 0) return raf.readLine();
do{
pos -= 2;
raf.seek(pos);
}while(pos > 0 && raf.readChar() != '\n');
pos = (pos <= 0) ? 0 : pos + 2;
raf.seek(pos);
return raf.readLine();
}
and then operated with line.length(), which forgoes the need to explicitly seek the right end of the line.
So why "seek left and right to the next line terminator"? Is there a more convenient way to get the line from these two offsets?