3

I have a huge log file of around 3.5 GB and would like to sample random sections in the middle of say 10 MB for the purpose of debugging what my application is doing.

I could use head or tail commands to get the beginning or end of the file, how can I grab an arbitrary portion from the middle of the file? I guess I could do something like head -n 1.75GB | tail -n 10MB but that seems clumsy and I'd need to determine line numbers for the midpoint of the file to get 1.75GB and 10MB line counts.

WilliamKF
  • 8,058

3 Answers3

6
$ dd if=big_file.bin skip=1750 ibs=1MB count=10 of=big_file.bin.part

You might want to spend some time reading and understanding dd.

kmkkmk
  • 86
5

You can use use tail, but by specify a byte offset.

tail -c +$START_BYTE $file | head -c $LENGTH > newfile

That way tail can jump directly to the starting point (without counting new lines) and once head matches the correct length, it stops running.

1

You just have to write a little program to seek to some random spot and read some amount of lines.

An example in Python (reads one line, but you can modify it):

def get_random_line():
    """Return a randomly selected line from a file."""
    import random
    fo = open("/some/file.txt")
    try:
        point = random.randrange(fo.size)
        fo.seek(point)
        c = fo.read(1)
        while c != '\n' and fo.tell() > 0:
            fo.seek(-2, 1)
            c = fo.read(1)
        line = fo.readline().strip()
    finally:
        fo.close()
    return line
Keith
  • 8,293