@Nathan's suggestion is surprisingly not the solution, due to some subtly details of CPython's longint-implementation. With his explanation, the memory footprint for
...
lst[i] = (1<<30)+i
should still be 40.52, because sys.sizeof(1<<30) is 32, but the measurements show it to be 48.56. On the other hand, for
...
lst[i] = (1<<60)+i
the footprint is still 48.56, despite the fact, that sys.sizeof(1<<60) is 36.
The reason: the sys.getsizeof() doesn't tell us the real memory footprint for the result of a summation, i.e. a+b which is
- 32 bytes for
1000+i
- 36 bytes for
(1<<30)+i
- 40 bytes for
(1<<60)+i
That happens because when two integers are added in x_add, the resulting integer has at first one "digit", i.e. 4 bytes, more than the maximum of a and b:
static PyLongObject *
x_add(PyLongObject *a, PyLongObject *b)
{
Py_ssize_t size_a = Py_ABS(Py_SIZE(a)), size_b = Py_ABS(Py_SIZE(b));
PyLongObject *z;
...
/* Ensure a is the larger of the two: */
...
z = _PyLong_New(size_a+1);
...
after the addition the result is normalized:
...
return long_normalize(z);
};
i.e. the possible leading zeros are discarded, but the memory isn't released - 4 bytes aren't worth it, the source of the function can be found here.
Now, we can use @Nathans insight to explain, why the footprint of (1<<30)+i is 48.56 and not 44.xy: The used py_malloc-allocator uses memory-blocks with alignment of 8 bytes, that means 36 bytes will be stored in a block of size 40 - the same as the result of (1<<60)+i (keep the additional 8-bytes for pointers in mind).
To explain the remaining 0.5 bytes we need to dive deeper into details of py_malloc-allocator. A good overview is the source-code itself, my last try to describe it can be found in this SO-post.
In a nutshell, the allocator manages memory in arenas, each 256MB. When an arena is allocated, the memory is reserved, but not commited. We see memory as "used", only when a so called pool is touched. A pool is 4Kb big (POOL_SIZE) and is used only for memory-blocks with same size - in our case 32 byte. That means the resolution of peak_used_memory is 4Kb and cannot be responsible for those 0.5 bytes.
However, these pools must be managed, which leads to an additional overhead: py_malloc needs a pool_header per pool:
/* Pool for small blocks. */
struct pool_header {
union { block *_padding;
uint count; } ref; /* number of allocated blocks */
block *freeblock; /* pool's free list head */
struct pool_header *nextpool; /* next pool of this size class */
struct pool_header *prevpool; /* previous pool "" */
uint arenaindex; /* index into arenas of base adr */
uint szidx; /* block size class index */
uint nextoffset; /* bytes to virgin block */
uint maxnextoffset; /* largest valid nextoffset */
};
The size of this struct is 48 (called POOL_OVERHEAD) bytes on my Linux_64 machine. This pool_header is a part of the pool (a quite smart way to avoid additional allocation via cruntime-memory-allocator) and will take place of two 32-byte-blocks, that means a pool has place for 126 32 byte integers:
/* Return total number of blocks in pool of size index I, as a uint. */
#define NUMBLOCKS(I) ((uint)(POOL_SIZE - POOL_OVERHEAD) / INDEX2SIZE(I))
Which leads to:
4Kb/126 = 32.51 bytes footprint for 1000+i, plus additional 8 bytes for the pointer.
(30<<1)+i needs 40 bytes, that means 4Kb has place for 102 blocks, of which one (there are remaining 16 bytes when pool is divided in 40-bytes block, and they can be used for the pool_header) is used forpool_header, which leads to 4Kb/101=40.55 bytes (plus 8 byte pointer).
We can also see, that there are some additional overhead, responsible for ca. 0.01 byte per integer - not big enough for me to care.