Using a for loop vs list comprehension to optimise code

Question

Why is the almost_all_a more efficient than almost_all_b which utilizes a for loop vs using list comprehension? Surely they are both O(n)? EDIT: The answers in the other question are non specific and basically say one can be faster than the other depending on the case. So then what about this case?

def almost_all_a(numbers):
    sumall = sum(numbers)
    rangelen = range(len(numbers))
    return [sumall - numbers[i] for i in rangelen]

def almost_all_b(numbers):
    sumall = sum(numbers)
    for i in range(len(numbers)):
        numbers[i] = sumall - numbers[i]

Possible duplicate: https://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops — Carles Mitjans, Mar 01 '19 at 07:37
Hi thanks for that I'm still getting used to stack overflow but I still can't see how in this case the list iteration is more efficient? The second algorithm is failing the test for larger lists. — Lev, Mar 01 '19 at 07:54

score 1 · Answer 1 · answered Mar 01 '19 at 09:31

The answer to the related question contained everything, but it was somehow hidden.

Let us look at what almost_all_a: it builds a new list of the same size as the original list and then returns that new list. For a large list, this will use twice the memory required by the list (assuming lists of numbers here). And is you call the function that way: nums = almost_all_a(nums), you are just building a new list and when done discard the previous one. Two performance impacts: requires (temporary) memory, and requires the garbage collector to clean up the old list.

In almost_all_b nothing of that happens: you are just changing the list element in place: no extra allocation (memory gain) and nothing to collect (execution time gain).

TL/DR: what makes a version lose, it that it boils down to allocating a new list, while the related answer said:

Using a list comprehension in place of a loop that doesn't build a list, nonsensically accumulating a list of meaningless values and then throwing the list away, is often slower because of the overhead of creating and extending the list.

score 1 · Answer 2 · answered Mar 05 '19 at 19:20

Your complexity analysis is correct: n operations to compute the sum plus n operations to compute the list in both cases makes O(n).

But before we talk about speed, you surely have noticed that almost_all_b has a side-effect whereas almost_all_a is side-effect free. Worse, almost_all_b is not idempotent. If you call repeatedly almost_all_b, the argument numbers will be modified each time. Unless you have a very good reason, you should prefer almost_all_a to almost_all_b since is easier to understand and less error prone.

Benchmark 1

I will try to confirm your assertion (almost_all_a [is] more efficient than almost_all_b) with timeit:

>>> from timeit import timeit
>>> ns=list(range(100))
>>> timeit(lambda: almost_all_a(ns), number=10000)
0.06381335399782984
>>> timeit(lambda: almost_all_b(ns), number=10000)
2.3228586789991823

Wow! almost_all_a is roughly 35 times faster than almost_all_b!!! No. That was a joke. You can see what happened: almost_all_b was applied 10000 times to [1,...,90] with the side-effect, hence the numbers grew insanely:

>>> len(str(ns[0])) # number of digits of the first element!
19959

Ok, that was just to convince you to avoid functions with side effects.

Benchmark 2

Now, the real test:

>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_a})
5.720672591000039
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_b})
5.937547881

Note that the benchmark might give different results with another list or on another platform. (Think of what will happen if the list takes 90 % of the available RAM.) But let's assume that we can generalize.

Python bytecode

Let's look at the bytecode:

>>> import dis
>>> dis.dis(almost_all_a)
  2           0 LOAD_GLOBAL              0 (sum)
              2 LOAD_DEREF               0 (numbers)
              4 CALL_FUNCTION            1
              6 STORE_DEREF              1 (sumall)

  3           8 LOAD_GLOBAL              1 (range)
             10 LOAD_GLOBAL              2 (len)
             12 LOAD_DEREF               0 (numbers)
             14 CALL_FUNCTION            1
             16 CALL_FUNCTION            1
             18 STORE_FAST               1 (rangelen)

  4          20 LOAD_CLOSURE             0 (numbers)
             22 LOAD_CLOSURE             1 (sumall)
             24 BUILD_TUPLE              2
             26 LOAD_CONST               1 (<code object <listcomp> at 0x7fdc551dee40, file "<stdin>", line 4>)
             28 LOAD_CONST               2 ('almost_all_a.<locals>.<listcomp>')
             30 MAKE_FUNCTION            8
             32 LOAD_FAST                1 (rangelen)
             34 GET_ITER
             36 CALL_FUNCTION            1
             38 RETURN_VALUE

And:

>>> dis.dis(almost_all_b)
  2           0 LOAD_GLOBAL              0 (sum)
              2 LOAD_FAST                0 (numbers)
              4 CALL_FUNCTION            1
              6 STORE_FAST               1 (sumall)

  3           8 SETUP_LOOP              36 (to 46)
             10 LOAD_GLOBAL              1 (range)
             12 LOAD_GLOBAL              2 (len)
             14 LOAD_FAST                0 (numbers)
             16 CALL_FUNCTION            1
             18 CALL_FUNCTION            1
             20 GET_ITER
        >>   22 FOR_ITER                20 (to 44)
             24 STORE_FAST               2 (i)

  4          26 LOAD_FAST                1 (sumall)
             28 LOAD_FAST                0 (numbers)
             30 LOAD_FAST                2 (i)
             32 BINARY_SUBSCR
             34 BINARY_SUBTRACT
             36 LOAD_FAST                0 (numbers)
             38 LOAD_FAST                2 (i)
             40 STORE_SUBSCR
             42 JUMP_ABSOLUTE           22
        >>   44 POP_BLOCK
        >>   46 LOAD_CONST               0 (None)
             48 RETURN_VALUE

The beginning is almost the same. Then, you have a list comprehension that is like a black box. If we open the box, we see:

>>> dis.dis(almost_all_a.__code__.co_consts[1])
  4           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                16 (to 22)
              6 STORE_FAST               1 (i)
              8 LOAD_DEREF               1 (sumall)
             10 LOAD_DEREF               0 (numbers)
             12 LOAD_FAST                1 (i)
             14 BINARY_SUBSCR
             16 BINARY_SUBTRACT
             18 LIST_APPEND              2
             20 JUMP_ABSOLUTE            4
        >>   22 RETURN_VALUE

You have two differences:

in the list comprehension, sumall and numbers are loaded with LOAD_DEREF and not LOAD_FAST (that's normal for a closure) and should be a little slower;
in the list comprehension, LIST_APPEND replaces the assignment to numbers[i] (LOAD_FAST(numbers)/LOAD_FAST(i)/STORE_SUBSCR at lines 36-40).

My guess is that the overhead comes from that assignement.

Another benchmark

You can rewrite almost_all_a to be even neater, because you don't need the index:

def almost_all_c(numbers):
    sumall = sum(numbers)
    return [sumall - n for n in numbers]

This version is faster (on my example + platform) than almost_all_a:

>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_a})
5.755438814000172
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_b})
5.93645353099987
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_c})
4.571863283000084

(Note that, as often in Python, the neater version is the fastest.) The difference between almost_all_a and almost_all_c is the use of the access to the i-th item of numbers (you can decompile almost_all_c of you want to check).

Conlusion

I seems that the bottleneck here is the access to the i-th item of numbers:

one time in almost_all_a;
two times in almost_all_b;
never in almost_all_c.

That's why almost_all_a is faster than almost_all_b.