Why does reading a whole file take up more RAM than its size on DISK?

Question

Caveat

This is NOT a duplicate of this. I'm not interested in finding out my memory consumption or the matter, as I'm already doing that below. The question is WHY the memory consumption is like this.

Also, even if I did need a way to profile my memory do note that guppy (the suggested Python memory profiler in the aforementioned link does not support Python 3 and the alternative guppy3 does not give accurate results whatsoever yielding in results such as (see actual sizes below):

Partition of a set of 45968 objects. Total size = 5579934 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  13378  29  1225991  22   1225991  22 str
     1  11483  25   843360  15   2069351  37 tuple
     2   2974   6   429896   8   2499247  45 types.CodeType

Background

Right, so I have this simple script which I'm using to do some RAM consumption tests, by reading a file in 2 different ways:

reading a file one line at a time, processing, and discarding it (via generators), which is efficient and recommended for basically any file size (especially large files), which works as expected.
reading a whole file into memory (I know this is advised against, however this was just for educational purposes).

Test script

import os
import psutil
import time


with open('errors.log') as file_handle:
    statistics = os.stat('errors.log')  # See below for contents of this file
    file_size = statistics.st_size / 1024 ** 2

    process = psutil.Process(os.getpid())

    ram_usage_before = process.memory_info().rss / 1024 ** 2
    print(f'File size: {file_size} MB')
    print(F'RAM usage before opening the file: {ram_usage_before} MB')

    file_handle.read()  # loading whole file in memory

    ram_usage_after = process.memory_info().rss / 1024 ** 2
    print(F'Expected RAM usage after loading the file: {file_size + ram_usage_before} MB')
    print(F'Actual RAM usage after loading the file: {ram_usage_after} MB')

    # time.sleep(30)

Output

File size: 111.75 MB
RAM usage before opening the file: 8.67578125 MB
Expected RAM usage after loading the file: 120.42578125 MB
Actual RAM usage after loading the file: 343.2109375 MB

I also added a 30 second sleep to check with awk at the os level, where I've used the following command:

ps aux | awk '{print $6/1024 " MB\t\t" $11}' | sort -n

which yields:

...
343.176 MB      python  # my script
619.883 MB      /Applications/PyCharm.app/Contents/MacOS/pycharm
2277.09 MB      com.docker.hyperkit

The file contains about 800K copies of the following line:

[2019-09-22 16:50:17,236] ERROR in views, line 62: 404 Not Found: The
following URL: http://localhost:5000/favicon.ico was not found on the
server.

Is it because of block sizes or dynamic allocation, whereby the contents would be loaded in blocks and a lot of that memory would actually be unused ?

Mainly because you're decoding the file and storing it in a different form in RAM. Try if `open('errors.log', 'rb')` makes a difference. — L3viathan, Sep 24 '19 at 11:35
Possible duplicate of [How do I profile memory usage in Python?](https://stackoverflow.com/questions/552744/how-do-i-profile-memory-usage-in-python) — ivan_pozdeev, Sep 24 '19 at 11:48
... also keep in mind you measure the complete python proces there including the python interpreter (parser and runtime) memory usage aswell.. But @L3viathan is most likely right here.. — Raymond Nijland, Sep 24 '19 at 11:48
well.. I mean @L3viathan looks like you're correct. If I open it in `rb` the size is roughly the same, although I'd like to understand a bit more about this... if you're up for a formal answer I'm happy to **upvote** & **accept**. — Marius Mucenicu, Sep 24 '19 at 12:28

L3viathan · Accepted Answer · 2019-09-24T13:11:36.910

When you're opening a file in Python, by default you're opening it in Text-mode. That means that the binary data is decoded based on operating system defaults or explicitly given codecs.

Like all data, textual data is represented by bytes in your computer. Most of the English alphabet is representable in a single byte, e.g. the letter "A" is usually translated to the number 65, or in binary: 01000001. This encoding (ASCII) is good enough for many cases, but when you want to write text in languages like Romanian, it is already not enough, because the characters ă, ţ, etc. are not part of ASCII.

For a while, people used different encodings per language (group), e.g. the Latin-x group of encodings (ISO-8859-x) for languages based on the latin alphabet, and other encodings for other (especially CJK) languages.

If you want to represent some Asian languages, or several different languages, you'll need encodings that encode one character to multiple bytes. That can either be a fixed number (e.g. in UTF-32 and UTF-16) or a variable number, like in the most common "popular" encoding today, UTF-8.

Back to Python: The Python string interface promises many properties, among them random access in O(1) complexity, meaning you can get the 1245th character even from a very long string very quickly. This clashes with the compact UTF-8 encoding: Because one "character" (really: one unicode codepoint) is sometimes one and sometimes several bytes long, Python couldn't just jump to the memory address start_of_string + length_of_one_character * offset, as the length_of_one_character varies in UTF-8. Python therefore needs to use a fixed-bytelength encoding.

For optimization reasons it doesn't always use UCS-4 (~UTF-32), because that will waste lots of space when the text is ASCII-only. Instead, Python dynamically chooses either Latin-1, UCS-2, or UCS-4 to store strings internally.

To bring everything together with an example:

Say you want to store the string "soluţie" in memory, from a file encoded as UTF-8. Since the letter ţ needs two bytes to be represented, Python chooses UCS-2:

characters | s       | o       | l       | u       | ţ       | i       | e         
     utf-8 |0x73     |0x6f     |0x6c     |0x75     |0xc5 0xa3|0x69     |0x65
     ucs-2 |0x00 0x73|0x00 0x6f|0x00 0x6c|0x00 0x75|0x01 0x63|0x00 0x69|0x00 0x65

As you can see, UTF-8 (file on disk) needs 8 bytes, whereas UCS-2 needs 14.

Add to this the overhead of a Python string and the Python interpreter itself, and your calculations make sense again.

When you open a file in binary mode (open(..., 'rb')), you don't decode the bytes, but take them as-is. That is problematic if there is text in the file (because in order to process the data you'll sooner or later want to convert it to a string, where you then have to do the decoding), but if it's really binary data, such as an image, it's fine (and better).

This answer contains simplifications. Use with caution.

Why does reading a whole file take up more RAM than its size on DISK?

Caveat

Background

Test script

Output

1 Answers1