Your complexity analysis is correct: n operations to compute the sum plus n operations to compute the list in both cases makes O(n).
But before we talk about speed, you surely have noticed that almost_all_b has a side-effect whereas almost_all_a is side-effect free. Worse, almost_all_b is not idempotent. If you call repeatedly almost_all_b, the argument numbers will be modified each time. Unless you have a very good reason, you should prefer almost_all_a to almost_all_b since is easier to understand and less error prone.
Benchmark 1
I will try to confirm your assertion (almost_all_a [is] more efficient than almost_all_b) with timeit:
>>> from timeit import timeit
>>> ns=list(range(100))
>>> timeit(lambda: almost_all_a(ns), number=10000)
0.06381335399782984
>>> timeit(lambda: almost_all_b(ns), number=10000)
2.3228586789991823
Wow! almost_all_a is roughly 35 times faster than almost_all_b!!! No. That was a joke. You can see what happened: almost_all_b was applied 10000 times to [1,...,90] with the side-effect, hence the numbers grew insanely:
>>> len(str(ns[0])) # number of digits of the first element!
19959
Ok, that was just to convince you to avoid functions with side effects.
Benchmark 2
Now, the real test:
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_a})
5.720672591000039
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_b})
5.937547881
Note that the benchmark might give different results with another list or on another platform. (Think of what will happen if the list takes 90 % of the available RAM.) But let's assume that we can generalize.
Python bytecode
Let's look at the bytecode:
>>> import dis
>>> dis.dis(almost_all_a)
2 0 LOAD_GLOBAL 0 (sum)
2 LOAD_DEREF 0 (numbers)
4 CALL_FUNCTION 1
6 STORE_DEREF 1 (sumall)
3 8 LOAD_GLOBAL 1 (range)
10 LOAD_GLOBAL 2 (len)
12 LOAD_DEREF 0 (numbers)
14 CALL_FUNCTION 1
16 CALL_FUNCTION 1
18 STORE_FAST 1 (rangelen)
4 20 LOAD_CLOSURE 0 (numbers)
22 LOAD_CLOSURE 1 (sumall)
24 BUILD_TUPLE 2
26 LOAD_CONST 1 (<code object <listcomp> at 0x7fdc551dee40, file "<stdin>", line 4>)
28 LOAD_CONST 2 ('almost_all_a.<locals>.<listcomp>')
30 MAKE_FUNCTION 8
32 LOAD_FAST 1 (rangelen)
34 GET_ITER
36 CALL_FUNCTION 1
38 RETURN_VALUE
And:
>>> dis.dis(almost_all_b)
2 0 LOAD_GLOBAL 0 (sum)
2 LOAD_FAST 0 (numbers)
4 CALL_FUNCTION 1
6 STORE_FAST 1 (sumall)
3 8 SETUP_LOOP 36 (to 46)
10 LOAD_GLOBAL 1 (range)
12 LOAD_GLOBAL 2 (len)
14 LOAD_FAST 0 (numbers)
16 CALL_FUNCTION 1
18 CALL_FUNCTION 1
20 GET_ITER
>> 22 FOR_ITER 20 (to 44)
24 STORE_FAST 2 (i)
4 26 LOAD_FAST 1 (sumall)
28 LOAD_FAST 0 (numbers)
30 LOAD_FAST 2 (i)
32 BINARY_SUBSCR
34 BINARY_SUBTRACT
36 LOAD_FAST 0 (numbers)
38 LOAD_FAST 2 (i)
40 STORE_SUBSCR
42 JUMP_ABSOLUTE 22
>> 44 POP_BLOCK
>> 46 LOAD_CONST 0 (None)
48 RETURN_VALUE
The beginning is almost the same. Then, you have a list comprehension that is like a black box. If we open the box, we see:
>>> dis.dis(almost_all_a.__code__.co_consts[1])
4 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_DEREF 1 (sumall)
10 LOAD_DEREF 0 (numbers)
12 LOAD_FAST 1 (i)
14 BINARY_SUBSCR
16 BINARY_SUBTRACT
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
You have two differences:
- in the list comprehension,
sumall and numbers are loaded with LOAD_DEREF and not LOAD_FAST (that's normal for a closure) and should be a little slower;
- in the list comprehension,
LIST_APPEND replaces the assignment to numbers[i] (LOAD_FAST(numbers)/LOAD_FAST(i)/STORE_SUBSCR at lines 36-40).
My guess is that the overhead comes from that assignement.
Another benchmark
You can rewrite almost_all_a to be even neater, because you don't need the index:
def almost_all_c(numbers):
sumall = sum(numbers)
return [sumall - n for n in numbers]
This version is faster (on my example + platform) than almost_all_a:
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_a})
5.755438814000172
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_b})
5.93645353099987
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_c})
4.571863283000084
(Note that, as often in Python, the neater version is the fastest.) The difference between almost_all_a and almost_all_c is the use of the access to the i-th item of numbers (you can decompile almost_all_c of you want to check).
Conlusion
I seems that the bottleneck here is the access to the i-th item of numbers:
- one time in
almost_all_a;
- two times in
almost_all_b;
- never in
almost_all_c.
That's why almost_all_a is faster than almost_all_b.