Python how to read N number of lines at a time

Question

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).

I have been reading about using itertools islice for this operation. I think I am halfway there:

from itertools import islice
N = 16
infile = open("my_very_large_text_file", "r")
lines_gen = islice(infile, N)

for lines in lines_gen:
     ...process my lines...

The trouble is that I would like to process the next batch of 16 lines, but I am missing something

possible duplicate of [Lazy Method for Reading Big File in Python?](http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) — Ken White, Jun 13 '11 at 23:40
@ken - OP is asking about how to do this using `islice`, in that post the OP asks how to do this with `yield`. — Kev, Jun 14 '11 at 00:30
Possible duplicate of [How to read file N lines at a time in Python?](http://stackoverflow.com/questions/5832856/how-to-read-file-n-lines-at-a-time-in-python) — Jonathan H, Mar 15 '17 at 17:55
@JonathanH I determined that this is a better version of the question, mainly on the strength of the top answer. The top/accepted answer there only gets the first N lines, and includes a variation that reads the whole file into memory first (clearly undesirable). — Karl Knechtel, Aug 01 '22 at 23:21

score 81 · Answer 1 · edited Nov 28 '22 at 12:17

81

islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.

from itertools import islice
with open(...) as f:
    while True:
        next_n_lines = list(islice(f, n))
        if not next_n_lines:
            break
        # process next_n_lines

An alternative is to use the grouper pattern:

from itertools import zip_longest
with open(...) as f:
    for next_n_lines in zip_longest(*[f] * n):
        # process next_n_lines

edited Nov 28 '22 at 12:17

uniquegino

1,841
1
12
11

answered Jun 13 '11 at 20:24

Sven Marnach

574,206
118
941
841

1

I am learning python these days, have a question, ideally if you were reading a database or a file of records, you will need to mark the records as read (another column needed) and in the next batch you will start processing the next unmarked records, how is that being achieved here? esp here `next_n_lines = list(islice(infile, n))` – zengr Jun 13 '11 at 20:28
@zengr: I don't understand your question. `list(islice(infile, n))` will get the next chunk of `n` lines from the file. Files know what you already read, you can simply continue reading. – Sven Marnach Jun 13 '11 at 20:39
@Sven Say, my batch job runs once everyday. I have a huge text file of 1M lines. But, I only want to read first 1000lines on day1. The job stops. Now, day2: I should start processing the same file from 1001th line. So, how do you maintain that, except storing the line number count some where else. – zengr Jun 13 '11 at 20:51
2

@zengr: You have to store the counter somewhere. That's a completely unrelated question -- use the "Ask Question" button in the upper right corner. – Sven Marnach Jun 13 '11 at 21:06
@Sven Marnach Thank you! Your islice code snippet worked awesomely! – brokentypewriter Jun 14 '11 at 13:11
I wanted to read a file n lines by nlines. izip_longest() is exactly what I looked for ! However, I really don't understand the `*[f] * n` arguments ... any ideas ? – Stéphane Jun 23 '11 at 13:12
@Stéphane: This idiom is mentioned in the [docs of `itertools.izip()`](http://docs.python.org/library/itertools.html#itertools.izip). Have a look at the equivalent Python code there, and keep in mind that e.g. `izip(*[f] * 5)` is equivalent to `izip(f, f, f, f, f)`. – Sven Marnach Jun 23 '11 at 13:33
Thx. Doing some tests with `test(*[f]*2)` , `test([f]*2)` and `test(f*2)` helped me With `def test(*args): print args` – Stéphane Jun 23 '11 at 14:09
(Admittedly not the same question as the original) What if the n, the number of lines to group next, changes and is specified in the input file itself? For example, 2 a b 3 a b c => we want to group the input as (a,b) (a,b,c) – dhfromkorea Dec 10 '14 at 08:26
1

@dhfromkorea: I would suggest using a custom generator funciton fo this, see https://gist.github.com/smarnach/75146be0088e7b5c503f. – Sven Marnach Dec 10 '14 at 12:56

score 11 · Answer 2 · answered Jun 13 '11 at 22:22

11

The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.

Thus:

with open('my_very_large_text_file') as f:
    for line in f:
        process(line)

is probably superior to any alternative in time, space, complexity and readability.

See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.

answered Jun 13 '11 at 22:22

msw

42,753
9
87
112

2

Hi! The reason I have to process my enormous textfile in blocks of N lines is that I am choosing one random line out of each group of N. This is for a bioinformatics analysis, and I want to make an smaller file that has equal representation from the entire dataset. Not all data is created equally in biology! There may be a different (perhaps, better?) way to choose X number of random lines equally distributed from a huge data set, but this is the first thing that I thought of. Thanks for the links! – brokentypewriter Jun 14 '11 at 13:16
@brokentypewriter that's a hugely different question for which there are far more statistically useful samplings. I shall look for for something off the shelf, and turn it into a new question here. I'll put a link here when I do. Auto-correlation is a sad artifact to introduce. – msw Jun 14 '11 at 14:34
I answered it in this question instead: http://stackoverflow.com/questions/6335839/python-how-to-read-n-number-of-lines-at-a-time/6347142#6347142 – msw Jun 14 '11 at 16:53
@msw What if I need to read, lets say 10 lines of a very huge file to send them to multiprocessing.Pool ? Your one by one line read wont be usefull. Would it ? – pippo1980 Oct 20 '22 at 15:26

score 3 · Answer 3 · edited May 23 '17 at 12:18

Here is another way using groupby:

from itertools import count, groupby

N = 16
with open('test') as f:
    for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
        print list(group)

How it works:

Basically groupby() will group the lines by the return value of the key parameter and the key parameter is the lambda function lambda _, c=count(): c.next()/N and using the fact that the c argument will be bound to count() when the function will be defined so each time groupby() will call the lambda function and evaluate the return value to determine the grouper that will group the lines so :

# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1   
...

score 2 · Answer 4 · edited May 23 '17 at 10:31

Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.

"""randsamp - extract a random subset of n lines from a large file"""

import random

def scan_linepos(path):
    """return a list of seek offsets of the beginning of each line"""
    linepos = []
    offset = 0
    with open(path) as inf:     
        # WARNING: CPython 2.7 file.tell() is not accurate on file.next()
        for line in inf:
            linepos.append(offset)
            offset += len(line)
    return linepos

def sample_lines(path, linepos, nsamp):
    """return nsamp lines from path where line offsets are in linepos"""
    offsets = random.sample(linepos, nsamp)
    offsets.sort()  # this may make file reads more efficient

    lines = []
    with open(path) as inf:
        for offset in offsets:
            inf.seek(offset)
            lines.append(inf.readline())
    return lines

dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once

lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)

I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.

Just to check the performance of sample_lines I used the timeit module as so

import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)', 
        'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
        elapsed, (elapsed/trials) * (10 ** 6))

For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.

The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".

@brokentypewriter - thanks for the pleasant diversion from my real work o.O — msw, Jun 14 '11 at 16:52
@msw Awesome solution. It runs very fast, and I love that random.sample takes a sample without replacement. The only problem is that I have a memory error when writing my output files... but I can probably fix it myself. (The first thing that I will try is writing the outputfile one line at a time, instead of all the lines joined together). Thanks for a great solution! I have 9 million lines, sampling them 11 times in a loop, so time saving measures are great! Manipulating lists and loading all the lines into lists was just taking way too long to run. — brokentypewriter, Jun 15 '11 at 18:32
@msw I have fixed it to write each line to the outfile one at a time to avoid memory issues. Everything runs great! It takes 4 min 25 seconds to run, which is way better than 2+ hours to run the previous version (iterating over lists). I really like that this solution is only loading into memory the lines that are sampled from their offset. It's a neat and efficient trick. I can say I learned something new today! — brokentypewriter, Jun 15 '11 at 19:23
@brokentypewriter - glad to be of assistance, however the credit for the approach goes to Kernighan and Plaugher "Software Tools in Pascal" (1981) where they use this index method for implementing ed(1) in a language without a native character type! Some tricks just never get old. — msw, Jun 15 '11 at 20:52
@brokentypewriter, msw: `scan_linepos()` doesn't include the offset 0 in the list, but it does include the offset past the last line. This means the sample never includes the first line, but might include an empty line if the offset past the last line is hit. The easiest fix is to swap the two lines in the for-loop. — Sven Marnach, Jun 16 '11 at 00:34

score 1 · Answer 5 · edited May 23 '17 at 12:10

1

Used chunker function from What is the most “pythonic” way to iterate over a list in chunks?:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(*args, fillvalue=fillvalue)


with open(filename) as f:
    for lines in grouper(f, chunk_size, ""): #for every chunk_sized chunk
        """process lines like 
        lines[0], lines[1] , ... , lines[chunk_size-1]"""

edited May 23 '17 at 12:10

Community

1
1

answered Jun 13 '11 at 20:32

utdemir

26,532
10
62
81

@Sven Marnach; Sorry, that "grouper" must be "chunker". But I think(I don't really understand yours) it does the same with your grouper function. edit: no it doesn't. – utdemir Jun 13 '11 at 20:39
Still confusing. 1. `chunker()` is defined with two parameters and called with three. 2. Passing `f` as `seq` will try to slice the file object, which simply doesn't work. You can only slice sequences. – Sven Marnach Jun 13 '11 at 20:42
@Sven Marnach; actually first I took the first answer from that question on my answer, created the code for that, and thought second answer is better, and changed the function, but I forgot to change function call. And you are right about slicing, my mistake, trying to correct it. thanks. – utdemir Jun 13 '11 at 20:46
@utdemir izip_longest ---> zip_longest – pippo1980 Oct 20 '22 at 16:02

score 0 · Answer 6 · answered Jun 13 '11 at 22:46

Assuming "batch" means to want to process all 16 recs at one time instead of individually, read the file one record at a time and update a counter; when the counter hits 16, process that group.

interim_list = []
infile = open("my_very_large_text_file", "r")
ctr = 0
for rec in infile:
    interim_list.append(rec)
    ctr += 1
    if ctr > 15:
        process_list(interim_list)
        interim_list = []
        ctr = 0

the final group

process_list(interim_list)

Sebastian Hack · Answer 7 · 2022-12-03T23:23:57.537

0

Another solution might be to create an iterator that yields lists of n elements:

def n_elements(n, it):
    try:
        while True:
            yield [next(it) for j in range(0, n)]
    except StopIteration:
        return

with open(filename, 'rt') as f:
    for n_lines in n_elements(n, f):
        do_stuff(n_lines)

edited Dec 03 '22 at 23:23

answered Dec 03 '22 at 20:48

Sebastian Hack

1
1

Python how to read N number of lines at a time

7 Answers7

the final group

Linked

Related