Faster way to move memory page than mremap()?

Question

I've been experimenting with mremap(). I'd like to be able to move virtual memory pages around at high speeds. At least higher speeds than copying them. I have some ideas for algorithms which could make use of being able to move memory pages really fast. Problem is that the program below shows that mremap() is very slow -- at least on my i7 laptop -- compared to actually copying the same memory pages byte by byte.

How does the test source code work? mmap() 256 MB of RAM which is bigger than the on-CPU caches. Iterate for 200,000 times. On each iteration swap two random memory pages using a particular swap method. Run once and time using the mremap()-based page swap method. Run again and time using the byte-by-byte copy swap methed. Turns out that mremap() only manages 71,577 page swaps per second, whereas the byte-by-byte copy manages a whopping 287,879 page swaps per second. So mremap() is 4 times slower than a byte by byte copy!

Questions:

Why is mremap() so slow?

Is there another user-land or kernel-land callable page mapping manipulation API which might be faster?

Is there another user-land or kernel-land callable page mapping manipulation API allowing multiple, non-consecutive pages to be remapped in one call?

Are there any kernel extensions that support this sort of thing?

#include <stdio.h>
#include <string.h>
#define __USE_GNU
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/errno.h>
#include <asm/ldt.h>
#include <asm/unistd.h>    

// gcc mremap.c && perl -MTime::HiRes -e '$t1=Time::HiRes::time;system(q[TEST_MREMAP=1 ./a.out]);$t2=Time::HiRes::time;printf qq[%u per second\n],(1/($t2-$t1))*200_000;'
// page size = 4096
// allocating 256 MB
// before 0x7f8e060bd000=0
// before 0x7f8e060be000=1
// before 0x7f8e160bd000
// after  0x7f8e060bd000=41
// after  0x7f8e060be000=228
// 71577 per second

// gcc mremap.c && perl -MTime::HiRes -e '$t1=Time::HiRes::time;system(q[TEST_COPY=1 ./a.out]);$t2=Time::HiRes::time;printf qq[%u per second\n],(1/($t2-$t1))*200_000;'
// page size = 4096
// allocating 256 MB
// before 0x7f1a9efa5000=0
// before 0x7f1a9efa6000=1
// before 0x7f1aaefa5000
// sizeof(i)=8
// after  0x7f1a9efa5000=41
// after  0x7f1a9efa6000=228
// 287879 per second

// gcc mremap.c && perl -MTime::HiRes -e '$t1=Time::HiRes::time;system(q[TEST_MEMCPY=1 ./a.out]);$t2=Time::HiRes::time;printf qq[%u per second\n],(1/($t2-$t1))*200_000;'
// page size = 4096
// allocating 256 MB
// before 0x7faf7c979000=0
// before 0x7faf7c97a000=1
// before 0x7faf8c979000
// sizeof(i)=8
// after  0x7faf7c979000=41
// after  0x7faf7c97a000=228
// 441911 per second

/*
 * Algorithm:
 * - Allocate 256 MB of memory
 * - loop 200,000 times
 *   - swap a random 4k block for a random 4k block
 * Run the test twice; once for swapping using page table, once for swapping using CPU copying!
 */

#define PAGES (1024*64)

int main() {
    int PAGE_SIZE = getpagesize();
    char* m = NULL;
    unsigned char* p[PAGES];
    void* t;

    printf("page size = %d\n", PAGE_SIZE);

    printf("allocating %u MB\n", PAGE_SIZE*PAGES / 1024 / 1024);
    m = (char*)mmap(0, PAGE_SIZE*(1+PAGES), PROT_READ | PROT_WRITE, MAP_SHARED  | MAP_ANONYMOUS, -1, 0);
    t = &m[PAGES*PAGE_SIZE];
    {
        unsigned long i;
        for (i=0; i<PAGES; i++) {
            p[i] = &m[i*PAGE_SIZE];
            memset(p[i], i & 255, PAGE_SIZE);
        }
    }

    printf("before %p=%u\n", p[0], p[0][0]);
    printf("before %p=%u\n", p[1], p[1][0]);
    printf("before %p\n", t);

    if (getenv("TEST_MREMAP")) {
        unsigned i;
        for (i=0; i<200001; i++) {
            unsigned p1 = random() % PAGES;
            unsigned p2 = random() % PAGES;
    //      mremap(void *old_address, size_t old_size, size_t new_size,int flags, /* void *new_address */);
            mremap(p[p2], PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, t    );
            mremap(p[p1], PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, p[p2]);
            mremap(t    , PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, p[p1]); // p3 no longer exists after this!
        } /* for() */
    }
    else if (getenv("TEST_MEMCPY")) {
        unsigned long * pu[PAGES];
        unsigned long   i;
        for (i=0; i<PAGES; i++) {
            pu[i] = (unsigned long *)p[i];
        }
        printf("sizeof(i)=%lu\n", sizeof(i));
        for (i=0; i<200001; i++) {
            unsigned p1 = random() % PAGES;
            unsigned p2 = random() % PAGES;
            unsigned long * pa = pu[p1];
            unsigned long * pb = pu[p2];
            unsigned char t[PAGE_SIZE];
            //memcpy(void *dest, const void *src, size_t n);
            memcpy(t , pb, PAGE_SIZE);
            memcpy(pb, pa, PAGE_SIZE);
            memcpy(pa, t , PAGE_SIZE);
        } /* for() */
    }
    else if (getenv("TEST_MODIFY_LDT")) {
        unsigned long * pu[PAGES];
        unsigned long   i;
        for (i=0; i<PAGES; i++) {
            pu[i] = (unsigned long *)p[i];
        }
        printf("sizeof(i)=%lu\n", sizeof(i));
        // int modify_ldt(int func, void *ptr, unsigned long bytecount);
        // 
        // modify_ldt(int func, void *ptr, unsigned long bytecount);
        // modify_ldt() reads or writes the local descriptor table (ldt) for a process. The ldt is a per-process memory management table used by the i386 processor. For more information on this table, see an Intel 386 processor handbook.
        // 
        // When func is 0, modify_ldt() reads the ldt into the memory pointed to by ptr. The number of bytes read is the smaller of bytecount and the actual size of the ldt.
        // 
        // When func is 1, modify_ldt() modifies one ldt entry. ptr points to a user_desc structure and bytecount must equal the size of this structure.
        // 
        // The user_desc structure is defined in <asm/ldt.h> as:
        // 
        // struct user_desc {
        //     unsigned int  entry_number;
        //     unsigned long base_addr;
        //     unsigned int  limit;
        //     unsigned int  seg_32bit:1;
        //     unsigned int  contents:2;
        //     unsigned int  read_exec_only:1;
        //     unsigned int  limit_in_pages:1;
        //     unsigned int  seg_not_present:1;
        //     unsigned int  useable:1;
        // };
        //
        // On success, modify_ldt() returns either the actual number of bytes read (for reading) or 0 (for writing). On failure, modify_ldt() returns -1 and sets errno to indicate the error.
        unsigned char ptr[20000];
        int result;
        result = modify_ldt(0, &ptr[0], sizeof(ptr)); printf("result=%d, errno=%u\n", result, errno);
        result = syscall(__NR_modify_ldt, 0, &ptr[0], sizeof(ptr)); printf("result=%d, errno=%u\n", result, errno);
        // todo: how to get these calls returning a non-zero value?
    }
    else {
        unsigned long * pu[PAGES];
        unsigned long   i;
        for (i=0; i<PAGES; i++) {
            pu[i] = (unsigned long *)p[i];
        }
        printf("sizeof(i)=%lu\n", sizeof(i));
        for (i=0; i<200001; i++) {
            unsigned long j;
            unsigned p1 = random() % PAGES;
            unsigned p2 = random() % PAGES;
            unsigned long * pa = pu[p1];
            unsigned long * pb = pu[p2];
            unsigned long t;
            for (j=0; j<(4096/8/8); j++) {
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
                t = *pa; *pa ++ = *pb; *pb ++ = t;
            }
        } /* for() */
    }

    printf("after  %p=%u\n", p[0], p[0][0]);
    printf("after  %p=%u\n", p[1], p[1][0]);
    return 0;
}

Update: So that we don't need to question how fast 'round-trip to kernelspace' is, here's a further performance test program that shows that we can call getpid() 3 times in a row, 81,916,192 times per second on the same i7 laptop:

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

// gcc getpid.c && perl -MTime::HiRes -e '$t1=Time::HiRes::time;system(q[TEST_COPY=1 ./a.out]);$t2=Time::HiRes::time;printf qq[%u per second\n],(1/($t2-$t1))*100_000_000;'
// running_total=8545800085458
// 81916192 per second

/*
 * Algorithm:
 * - Call getpid() 100 million times.
 */

int main() {
    unsigned i;
    unsigned long running_total = 0;
    for (i=0; i<100000001; i++) {
        /*      123123123 */
        running_total += getpid();
        running_total += getpid();
        running_total += getpid();
    } /* for() */
    printf("running_total=%lu\n", running_total);
}

Update 2: I added WIP code to call a function I discovered called modify_ldt(). The man page hints that page manipulation might be possible. However, no matter what I try then the function always returns zero when I'm expecting it to return the number of bytes read. 'man modify_ldt' says "On success, modify_ldt() returns either the actual number of bytes read (for reading) or 0 (for writing). On failure, modify_ldt() returns -1 and sets errno to indicate the error." Any ideas (a) whether modify_ldt() will be an alternative to mremap() ? and (b) how to get modify_ldt() working?

`man mremap` - mremap() expands (or shrinks) an existing memory mapping. Your use case is to *relocate* an existing mapping. You can do this, but it is not what it is intended for; basically as per [R](http://stackoverflow.com/users/379897/r). — artless noise, May 27 '13 at 17:48
man mremap also says "mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3)." So the 'very efficient realloc' is a somewhat inaccurate based upon the findings above, or? — simonhf, May 27 '13 at 21:45
The use case for `realloc()` is to append memory at the end of a current allocation. Your test code above is replacing allocations in the *middle*. You don't normally place all the allocations close together and then try to re-order them; they are guaranteed not to fit. Look closely at `t`; don't even bother allocating it, just use the address as a spare address space. Ie, `m = mmap(0, PAGE_SIZE*(PAGES)...` and the `memset()` is not needed? [Does mremap initialize...](http://stackoverflow.com/questions/7978831/does-mremap-initialize-memory-on-growth). — artless noise, May 27 '13 at 22:04
Check the return value of `mremap()`. I don't think it is placing the memory at your requested address. It is doing more work than you think. — artless noise, May 27 '13 at 22:45
The TLB invalidate is going to cost you at least 1000 cycles on a multi-core CPU. For a single page when you add in syscall, locking, page table changes, and TLB invalidation you're squarely in the "that sucks" territory. But mremap is fantastic for realloc of large memory areas. Also if you don't have to remap a page in the middle of an existing mapping less page table locks would probably needed (no need to split an existing VMA entry, changing the tree itself usually requires more locking than changing entries.) — Eloff, Jul 19 '13 at 19:28
@simonhf: FYI, `getpid` is not a good test case for system calls; some (if not most/all?) runtime libraries that implement the C API for it cache the result of the system call (it never changes after all) on first call, so all subsequent calls are just reading the cache, not making a system call. This is true on [glibc 2.3.4 and higher for instance](http://man7.org/linux/man-pages/man2/getpid.2.html), and it seems an obvious optimization for other runtime libraries to use. — ShadowRanger, Aug 31 '16 at 17:28

score 22 · Accepted Answer · edited Apr 26 '19 at 23:29

It appears that there is no faster user-land mechanism to re-order memory pages than memcpy(). mremap() is far slower and therefore only useful for re-sizing an area of memory previously assigned using mmap().

But page tables must be extremely fast I hear you saying! And it's possible for user-land to call kernel functions millions of times per second! The following references help explain why mremap() is so slow:

"An Introduction to Intel Memory Management" is a nice introduction to the theory of memory page mapping.

"Key concepts of Intel virtual memory" shows how it all works in more detail, in case you plan on writing your own OS :-)

"Sharing Page Tables in the Linux Kernel" shows some of the difficult Linux memory page mapping architectural decisions and their effect on performance.

Looking at all three references together then we can see that there has been little effort so far from kernel architects to expose memory page mapping to user-land in an efficient way. Even in the kernel, manipulation of the page table must be done by using up to three locks which will be slow.

Going forwards, since the page table itself is made up of 4k pages, it may be possible to change the kernel so that particular page table pages are unique to a particular thread and can be assumed to have lock-less access for the duration of the process. This would facilitate very efficient manipulation of that particular page table page via user-land. But this moves outside the scope of the original question.

You should be able to use the SSE streaming intrinsics which might be faster than memcpy. Your memory blocks are 4kb aligned so you can easily use SSE/AVX/etc to read/write memory and the streaming intrinsics will avoid polluting the cache (depending on the memory type, WC/WB/etc and the hardware you have). See _mm_stream_load_si128. You can also easily unroll, prefetch, and prime the TLB. — Nicholas Frechette, May 01 '16 at 19:23
Newer link for the third expired one: https://landley.net/kdocs/ols/2003/ols2003-pages-315-320.pdf — Tomáš Janoušek, Nov 29 '17 at 16:00

score 13 · Answer 2 · answered Jul 23 '12 at 23:07

13

What makes you think mremap could ever be efficient for swapping single 4k pages? At the very least, a round-trip to kernelspace even just to read a single value (like pid) and return it will cost more than moving 4k of data. And that's before we get to the cache invalidation/TLB costs of remapping memory, which I don't understand well enough to address in this answer, but which should have some serious cost.

mremap is useful for basically one thing: implementing realloc for large allocations that were serviced by mmap. And by large, I mean probably at least 100k.

answered Jul 23 '12 at 23:07

R.. GitHub STOP HELPING ICE

208,859
35
376
711

2

Thank you for the answer, but what makes you think that a 'round-trip to kernelspace' is so slow, and slower than moving 4k of data? I have added a further performance test program above illustrating that the 'round-trip to kernelspace' should not be the problem. We can make 3 calls to getpid() 82 million times per second, but only swap 0.2 million 4k blocks per second. – simonhf Jul 23 '12 at 23:48
Just do a timing on one of the fastest syscalls like `getuid` that just reads a value from kernelspace. I just measured and got 800 cycles on my system. You should be able to copy 4k in less than 800 cycles on x86[_64]. – R.. GitHub STOP HELPING ICE Jul 24 '12 at 00:08
4

Note that glibc caches the result of `getpid` in userspace. This is unfortunate and actually erroneous as it can give wrong results in certain setups with signal handlers and `fork`... – R.. GitHub STOP HELPING ICE Jul 24 '12 at 00:09
1

Yes, good point. When I re-run the getpid() performance test with getuid() instead then instead of 82 million per second then 3 calls to getuid() is 8,092,532 times per second. So 10 times slower but still 40 times faster than swapping two 4k blocks using byte by byte copying. – simonhf Jul 24 '12 at 00:16
Byte-by-byte copying is extremely slow. You should be using `memcpy`. – R.. GitHub STOP HELPING ICE Jul 24 '12 at 00:26
Yes, good point. I added memcpy() to the original source code above and using memcpy() I managed to swap 441,911 4k pages per second. But that's all still very slow compared to calling getuid(). The original question is how to shuffle 4k pages around very fast with the MMU? The only way I have discovered so far is mremap() which is very slow. Do you know a faster way? And preferably a way in which many 4k pages can be remapped using a single call to kernel-land? – simonhf Jul 24 '12 at 01:57
The time for `getuid` is just the overhead for a single call to kernelspace that *does nothing* once it gets there. Obtaining the necessary locks to modify the process's page tables takes nontrivial time. Then of course swapping 2 pages is going to take at least 3 calls to `mremap`, and it's impossible to do in a multi-threaded program without race conditions anyway because the page will be left unmapped at one step and thus another `mmap` call in another thread could take it. – R.. GitHub STOP HELPING ICE Jul 24 '12 at 02:06
2

If you really want to try making your idea work, I think you'll have to invent a new syscall (or new flags to `mremap`) to swap pages in a single syscall. But it's still going to have a lot of nontrivial work to do splitting and merging VMAs because the kernel does not account for pages individually; it accounts for them as part of VMA spans. Maybe there's a way, if it's anonymous memory, that you could just switch the underlying pages it's backed by without adjusting the VMAs, but for that to work you'd almost surely need to have all the memory locked (`mlock`) in advance. – R.. GitHub STOP HELPING ICE Jul 24 '12 at 02:09
I found another API called modify_ldt() which sounds like it allows manipulation of memory pages. However, I could not get the API to work as the man page describes it. I updated the source code above with my attempt. Do you think modify_ldt() would be useful and do you know why it doesn't appear to work? – simonhf Jul 24 '12 at 16:56
No, LDT (segment/selector descriptors) and page tables are completely different things. I don't think there's anything you can do with `modify_ldt` that would have an effect on straight C code (i.e. code not written in assembly and explicitly using x86 segment registers) without just horribly breaking the whole program. – R.. GitHub STOP HELPING ICE Jul 24 '12 at 17:51
3

I discovered the paper "Sharing Page Tables in the Linux Kernel" [1] which is very informative and tells of remap_file_pages() which I found to be even slower than mremap(). It goes a long way to explaining why mremap() is so slow (multiple locks) as well as dishing up some of the hard choices that kernel developers have had to make in terms of implementing virtual memory. The most interesting quote is "It is possible for a massively shared application to use over half of physical memory for its page tables." [1] http://www.linuxinsight.com/files/ols2003/mccracken-reprint.pdf – simonhf Jul 24 '12 at 19:34

Faster way to move memory page than mremap()?

2 Answers2

Linked