copy from un-initialized buffer is much faster than from initialized buffer

Question

I was tasked to develop a test software to generate 100Gbps of traffic over 1 TCP socket on Linux (X86-64, kernel 4.15) on a machine with 32GB of RAM.

I developed something like the following code (removed some sanity checks for simplicity) to run on a pair of veth interfaces (one of them is in a different netns).

It generate about 60Gbps on my PC according to bmon,an open source software. To my surprise, if I remove the statement memset(buff, 0, size);, I get about 94Gbps. That's very puzzling.

void test(int sock) {
    int size = 500 * 0x100000;
    char *buff = malloc(size);
    //optional
    memset(buff, 0, size);
    int offset = 0;
    int chunkSize = 0x200000;
    while (1) {
        offset = 0;
        while (offset < size) {
            chunkSize = size - offset;
            if (chunkSize > CHUNK_SIZE) chunkSize = CHUNK_SIZE;
            send(sock, &buff[offset], chunkSize, 0);
            offset += chunkSize;
        }
    }
}

I did some experiments by replacing memset(buff, 0, size); with the following (initialize a portion of the buff),

memset(buff, 0, size * ratio);

if ratio is 0, the throughput is highest at around 94Gbps, as the ratio goes up to 100 percent (1.0), the throughput goes down to around 60Gbps. If the ratio is 0.5 (50%), the throughput goes to about 72Gbps

Appreciate any light you can shed on this.

Edit 1. Here is a relatively complete code that shows the effect that copy on initialized buffer appearing to be slower.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

int size = 500 * 0x100000;
char buf[0x200000];

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}

void test1(int init) {
    char *p = malloc(size);
    int offset = 0;
    if (init) memset(p, 0, size);
    double startTs = getTS();
    for (int i = 0; i < 100; i++) {
        offset = 0;
        while (offset < size) {
            memcpy(&buf[0], p+offset, 0x200000);
            offset += 0x200000;
        }
    }
    printf("took %f secs\n", getTS() - startTs);
}

int main(int argc, char *argv[]) {
    test1(argc > 1);
    return 0;
}

On my PC (Linux 18.04, Linux 4.15 with 32GB RAM), tried twice without initialization, it took 1.35 seconds. With initialization, it took 3.02 seconds.

Edit 2. Would love to get sendfile (Thanks @marco-bonelli) as fast as sending from the buffer with all 0 (by calloc). I think it's going to be an requirement for my task pretty soon.

Why is it puzzling that if you do fewer operations, you get a higher transfer rate? — Cheatah, Jun 11 '22 at 20:45
You should try using a smaller buffer and pushing its data repeatedly through the socket. — chqrlie, Jun 11 '22 at 21:08
@chqrlie Thanks for the idea. Allocated a 3MB buffer and read 3MB file data into it, get 80Gbps. If I don't read file data, I get close to 100Gbps. — packetie, Jun 11 '22 at 21:35
@packetie: you could improve this with asynchronous I/O, but it is somewhat tricky. — chqrlie, Jun 11 '22 at 21:38
That's right @chqrlie, async I/O is tricky. Hope there is a easy explanation on what's observed here. — packetie, Jun 11 '22 at 21:48
You are misusing the term, "bandwidth." You mean the throughput. The bandwidth is how fast an interface serializes bits onto the wire. The bandwidth using a 100 Gbps interface is 100 Gbps, regardless if the throughput is 60 or 94 Gbps, because the rate at which the interface serializes bits is 100 Gbps. — Ron Maupin, Jun 11 '22 at 22:00
Bandwidth is capacity, typically expressed in a unit of frequency, e.g.. Hertz. Your measurement of bits per second is not bandwidth but rather transfer rate or throughput. — sawdust, Jun 11 '22 at 22:50
FWIW **malloc()** only gives you *virtual memory*. The physical memory pages are only mapped on-demand. So the **memset()** would force physical pages to be assigned. Perhaps this results in some page swapping going on during the socket transfer??? But when memory has not been initialized (and not mapped to a physical page), then could there be an optimization of the *page fault* (from a read) to just access from a special page (sort of like reading from an unwritten part of a *sparse file*), and not perform the page assignment??? — sawdust, Jun 12 '22 at 08:36
Regardless of this weird quirk with initialization, you should be able to achieve much higher throughput using `vmsplice` or `splice`/`sendfile`. See the relative manual pages. FYI cannot reproduce this on my machines (Linux 4.19 and 5.10) so it could be a weird quirk fixed in later kernel versions than yours. PS: I use `echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` and `taskset 1` to reduce noise. — Marco Bonelli, Jun 12 '22 at 16:24
@MarcoBonelli Thanks for your suggestion about using `vmsplice`, `splice` or `sendfile`. I t tried sendfile but I only get about 63Gbps. Haven't tried the others yet. I would really like to figure out how to make sendfile to be as fast as sending uninitialized data since I anticipate that's going to be my next requirement :-) — packetie, Jun 12 '22 at 18:28
@sawdust: Yes, read from a never-written anonymous page will get every page copy-on-write mapped to the same physical page of zeros, so you can get TLB misses but L1d cache hits when reading it. (Applies to the .bss, and memory from mmap(MAP_ANONYMOUS), like glibc calloc and malloc use for large allocations.) I wrote up some details with `perf` results for an experiment on [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/a/57130924) — Peter Cordes, Jun 12 '22 at 19:25
@PeterCordes: I think you nailed it! I amended my answer and credited your contribution. — chqrlie, Jun 12 '22 at 20:18
@PeterCordes knew it had to be because of the zero page, but I was missing a "why". That's when I thought tagging x86 since OP is dealing with x86 was a good idea, and indeed here you are :'). Good one as usual! Did not know L1D was physically tagged, which is basically what all of this boils down to I suppose. I also suppose you could wrap your comment + a link to [this](https://stackoverflow.com/questions/19039280/physical-or-virtual-addressing-is-used-in-processors-x86-x86-64-for-caching-in-t) in a nice concise answer. — Marco Bonelli, Jun 12 '22 at 21:20
Thanks Peter, your explanation makes perfect sense. Wish I could "star" your comment! — packetie, Jun 13 '22 at 00:12

chqrlie · Answer 1 · 2022-06-12T20:15:06.320

I have been running various tests to investigate this surprising result.

I wrote the test program below that combines various operations in the init phase and the loop:

#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

int alloc_size = 500 * 0x100000;  // 500 MB
char copy_buf[0x200000];    // 2MB

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}

// set a word on each page of a memory area
void memtouch(void *buf, size_t size) {
    uint64_t *p = buf;
    size /= sizeof(*p);
    for (size_t i = 0; i < size; i += 4096 / sizeof(*p))
        p[i] = 0;
}

// compute the sum of words on a memory area
uint64_t sum64(const void *buf, size_t size) {
    uint64_t sum = 0;
    const uint64_t *p = buf;
    size /= sizeof(*p);
    for (size_t i = 0; i < size; i++)
        sum += p[i];
    return sum;
}

void test1(int n, int init, int run) {
    int size = alloc_size;
    char msg[80];
    int pos = 0;
    double times[n+1];
    uint64_t sum = 0;
    double initTS = getTS();
    char *p = malloc(size);

    pos = snprintf(msg + pos, sizeof msg - pos, "malloc");
    if (init > 0) {
        memset(p, init - 1, size);
        pos += snprintf(msg + pos, sizeof msg - pos, "+memset%.0d", init - 1);
    } else
    if (init == -1) {
        memtouch(p, size);
        pos += snprintf(msg + pos, sizeof msg - pos, "+memtouch");
    } else
    if (init == -2) {
        sum = sum64(p, size);
        pos += snprintf(msg + pos, sizeof msg - pos, "+sum64");
    } else {
        /* leave p uninitialized */
    }
    pos += snprintf(msg + pos, sizeof msg - pos, "+rep(%d, ", n);
    if (run > 0) {
        pos += snprintf(msg + pos, sizeof msg - pos, "memset%.0d)", run - 1);
    } else
    if (run < 0) {
        pos += snprintf(msg + pos, sizeof msg - pos, "sum64)");
    } else {
        pos += snprintf(msg + pos, sizeof msg - pos, "memcpy)");
    }
    double startTS = getTS();
    for (int i = 0; i < n; i++) {
        if (run > 0) {
            memset(p, run - 1, size);
        } else
        if (run < 0) {
            sum = sum64(p, size);
        } else {
            int offset = 0;
            while (offset < size) {
                memcpy(copy_buf, p + offset, 0x200000);
                offset += 0x200000;
            }
        }
        times[i] = getTS();
    }
    double firstTS = times[0] - startTS;
    printf("%f + %f", startTS - initTS, firstTS);
    if (n > 2) {
        double avgTS = (times[n - 2] - times[0]) / (n - 2);
        printf(" / %f", avgTS);
    }
    if (n > 1) {
        double lastTS = times[n - 1] - times[n - 2];
        printf(" / %f", lastTS);
    }
    printf(" secs  %s", msg);
    if (sum != 0) {
        printf("  sum=%016llx", (unsigned long long)sum);
    }
    printf("\n");
    free(p);
}

int main(int argc, char *argv[]) {
    int n = 4;
    if (argc < 2) {
        test1(n, 0, 0);
        test1(n, 0, 1);
        test1(n, 0, -1);
        test1(n, 1, 0);
        test1(n, 1, 1);
        test1(n, 1, -1);
        test1(n, 2, 0);
        test1(n, 2, 1);
        test1(n, 2, -1);
        test1(n, -1, 0);
        test1(n, -1, 1);
        test1(n, -1, -1);
        test1(n, -2, 0);
        test1(n, -2, 1);
        test1(n, -2, -1);
    } else {
        test1(argc > 1 ? strtol(argv[1], NULL, 0) : n,
              argc > 2 ? strtol(argv[2], NULL, 0) : 0,
              argc > 3 ? strtol(argv[2], NULL, 0) : 0);
    }
    return 0;
}

Running it on an old linux box running Debian Linux 3.16.0-11-amd64, I got these timings:

The columns are

init phase
first iteration of the loop
average of the second to penultimate iterations
last iteration of the loop
sequence of operations

0.000071 + 0.242601 / 0.113761 / 0.113711 secs  malloc+rep(4, memcpy)
0.000032 + 0.349896 / 0.125809 / 0.125681 secs  malloc+rep(4, memset)
0.000032 + 0.190461 / 0.049150 / 0.049210 secs  malloc+rep(4, sum64)
0.350089 + 0.186691 / 0.186705 / 0.186548 secs  malloc+memset+rep(4, memcpy)
0.350078 + 0.125603 / 0.125632 / 0.125531 secs  malloc+memset+rep(4, memset)
0.349931 + 0.105991 / 0.105859 / 0.105788 secs  malloc+memset+rep(4, sum64)
0.349860 + 0.186950 / 0.187031 / 0.186494 secs  malloc+memset1+rep(4, memcpy)
0.349584 + 0.125537 / 0.125525 / 0.125535 secs  malloc+memset1+rep(4, memset)
0.349620 + 0.106026 / 0.106114 / 0.105756 secs  malloc+memset1+rep(4, sum64)  sum=ebebebebebe80000
0.339846 + 0.186593 / 0.186686 / 0.186498 secs  malloc+memtouch+rep(4, memcpy)
0.340156 + 0.125663 / 0.125716 / 0.125751 secs  malloc+memtouch+rep(4, memset)
0.340141 + 0.105861 / 0.105806 / 0.105869 secs  malloc+memtouch+rep(4, sum64)
0.190330 + 0.113774 / 0.113730 / 0.113754 secs  malloc+sum64+rep(4, memcpy)
0.190149 + 0.400483 / 0.125638 / 0.125624 secs  malloc+sum64+rep(4, memset)
0.190214 + 0.049136 / 0.049170 / 0.049149 secs  malloc+sum64+rep(4, sum64)

The timings are consistent with the observations of the OP. I found an explanation that is consistent with the observed timings:

if the first access to a page is a read operation, the timings are substantially better than if the first operation is a write access.

Here are some observations consistent with this explanation:

malloc() for a large 500MB block just makes a system call to map the memory, it does not access this memory and calloc probably would do exactly the same thing.
if you do not initialize this memory, it still gets mapped in RAM as zero initialized pages for security reasons.
when you initialize the memory with memset, the first access to the whole block is a write access, then the timings for the loop are slower
initializing the memory to all bytes 1 produces exactly the same timings
if instead I use memtouch, only writing the first word to zero, I get the same timings in the loop
conversely, if instead of initializing the memory, I compute a checksum, it comes out as zero (which is not guarateed, but expected) and the timings in the loop are faster.
if no access is performed to the block, then the timings in the loop depend on the operation performed: memcpy and sum64 are faster because the first access is a read access, while memset is slower because the first access is a write access.

This seems specific to the linux kernel, I do not observe the same difference on macOS, but with a different processor. This behavior might be specific to older linux kernels and/or CPU and bridge architecture.

FINAL UPDATE: As commented by Peter Cordes, a read from a never-written anonymous page will get every page copy-on-write mapped to the same physical page of zeros, so you can get TLB misses but L1d cache hits when reading it. (Applies to the .bss, and memory from mmap(MAP_ANONYMOUS), like glibc calloc and malloc use for large allocations.) He wrote up some details with perf results for an experiment on Why is iterating though `std::vector` faster than iterating though `std::array`?

This explains why memcpy or just reading from memory that is only implicitly initialized to zero is faster than from memory explicitly written to. A good reason to use calloc() instead of malloc() + memset() for sparse data.

Not sure how good of a benchmark this is. You are running every test in one program doing `malloc()`/`free()` multiple times, thus getting the exact same chunk every time. Malloc'd memory is not reclaimed by the system when freed through `free()`. Your cache and TLB will also most probably always be hot except on the first test. I'd suggest doing one separate run at a time, doing `echo 3 | sudo tee /proc/sys/vm/drop_caches` before each run. — Marco Bonelli, Jun 12 '22 at 17:16
Removing the calls to `free()` has very little impact on the timings. Repeating the same test a second time gives consistent timings too. Same with dropping the caches. — chqrlie, Jun 12 '22 at 17:37
@MarcoBonelli: large `malloc` memory blocks are reclaimed by the system: `free` just unmaps the block. This can be verified by adding a `test1(n, 0, -1);` after the `test1(n, 2, -1);` — chqrlie, Jun 12 '22 at 17:42
The issue I mention is not solved just removing the calls to `free()`. Regardless of `free()`, I can clearly see the second test being 10 times faster than the first if I run everything at once. Running the two separately with two separate invocations yields the same timing (within reasonable error rate). You are right about `malloc()`/`free()` doing `mmap()`/`munmap()` though. PS: I think `argc > 2 ? strtol(argv[2], NULL, 0) : n` should be the first parameter. — Marco Bonelli, Jun 12 '22 at 17:43
First is `test1(n, 0, 0);`, second is `test1(n, 0, 1);`, this is what I mean. I see very different timings for the second test if I run them all at once (first branch of your `if` in `main`) or separately (command line arguments). — Marco Bonelli, Jun 12 '22 at 17:47
@MarcoBonelli: OK, I need to go and put some more work into this — chqrlie, Jun 12 '22 at 17:49
Anyway the call in your `else` branch should be `test1(argc > 1 ? strtol(argv[1], NULL, 0) : n, argc > 2 ? strtol(argv[2], NULL, 0) : 0, argc > 3 ? strtol(argv[3], NULL, 0) : 0);` — Marco Bonelli, Jun 12 '22 at 17:51
Thanks @chqrlie for your detailed investigation, I like your conclusions and I too have some hunch that some untouched pages give an opportunity to the kernel to optimize data copying operation, but can't prove it. I will accept your answer tonight after some more research on this. — packetie, Jun 12 '22 at 18:26
@packetie: the explanation for the faster operation on uninitialized (or `calloc` initialized) memory was nailed by Peter Cordes: until they are written to, all pages read from this area get mmapped with copy-on-write to the same zero filled page. Thus reading from the whole block only hits a single page that is well cached, hence much faster than reading from a much larger area with the same contents but that does not fit in the cache. — chqrlie, Jun 12 '22 at 20:58
@chqrlie yes, Peter's answer perfectly answered the question on why copying from the uninitialized area is faster. Could you add a link to Peter's answer in your answer? Thanks — packetie, Jun 13 '22 at 00:15
@packetie: I added a link to Peter's answer to another question. He only commented on this one. — chqrlie, Jun 13 '22 at 06:59

sawdust · Answer 2 · 2022-06-13T05:14:16.823

Linux uses virtual memory and backs-up that virtual memory with physical memory (allocated as pages) only on demand. When your two example programs request "memory" for a buffer using malloc(), only virtual memory is actually allocated. Only when your program uses that "memory" (e.g. write to that memory buffer) will Linux assign a physical memory page to map to the virtual page. This permits Linux to over-commit memory in a manner very similar to filesystem allocation using sparse files.

When either of your programs initializes the allocated buffer using memset(), that would force assignment of physical pages to correspond to the virtual memory. Perhaps this results in some page swapping during the socket transfer or buffer copying?
But when the memory buffer has not been initialized (and not yet mapped to a physical page), then could there be an optimization of the page fault (due to a read for I/O or copy operation) to just access from a special page (sort of like reading from an unwritten part of a sparse file), and not perform the page assignment?

Based on your question, there does seem to be somekind of page optimization for a page fault. So let's hypothesize that reading virtual memory that has not been written does not trigger an allocation and mapping of physical memory.

To test this hypothesis, we can use the top utility to obtain the virtual versus physical memory usage of your second program. The man page for top describes virtual memory usage is the total of "everything in-use and/or reserved (all quadrants)." Resident memory usage is "anything occupying physical memory which, beginning with Linux-4.5, is the sum of the following three fields:
RSan - quadrant 1 pages, which include any former quadrant 3 pages if modified
RSfd - quadrant 3 and quadrant 4 pages
RSsh - quadrant 2 pages"

When your 2nd program initializes the allocated buffer, this slow version uses 518.564 MB of virtual memory and 515.172 MB of resident memory. The similar memory numbers seems to indicate that the malloc()'d buffer of 500MB is backed-up with physical memory (as it should be).

When your 2nd program does not initialize the allocated buffer, this fast version uses the same 518.564 MB of virtual memory but only 3.192 MB of resident memory. The disparate memory numbers are a good indication that (most if not all of) the malloc()'d buffer of 500 MB is not backed-up with physical memory.

So the hypothesis seems valid.

Peter Cordes' comment confirms that there is such a page-fault optimization: that a "read from a never-written anonymous page will get every page copy-on-write mapped to the same physical page of zeros, so you can get TLB misses but L1d cache hits when reading it."

So your improved transfer rates and copy times seem to be due to reduced overhead from page swapping in the virtual memory subsystem and improved processor cache hits.

copy from un-initialized buffer is much faster than from initialized buffer

2 Answers2

Linked