Two TLB-miss per mmap/access/munmap

Question

for (int i = 0; i < 100000; ++i) {
    int *page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
                            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    page[0] = 0;

    munmap(page, PAGE_SIZE);
}

I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration (Also ~100000 page-faults and dTLB-load-misses for kernel). Running following command, the result is roughly 2x what I expect. I would appreciate if someone could clarify why this is the case:

perf stat -e dTLB-store-misses:u ./test
Performance counter stats for './test':

           200,114      dTLB-store-misses

       0.213379649 seconds time elapsed

P.S. I have verified and am certain that the generated code doesn't introduce anything that would justify this result. Also, I do get ~100000 page-faults and dTLB-load-misses:k.

Brendan · Accepted Answer · 2018-02-01T07:20:46.287

I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration

I would expect that:

CPU tries to do page[0] = 0;, tries to load the cache line containing page[0], can't find the TLB entry for it, increments dTLB-load-misses, fetches the translation, realises the page is "not present", then generates a page fault.
Page fault handler allocates a page and (because the page table was modified) ensures that the TLB entry is invalidated (possibly by relying on the fact that Intel CPU's don't cache "not present" pages anyway, not necessarily by explicitly doing an INVLPG). The page fault handler returns to the instruction that caused the fault so it can be retried.
CPU tries to do page[0] = 0; a second time, tries to load the cache line containing page[0], can't find the TLB entry for it, increments dTLB-load-misses, fetches the translation, then modifies the cache line.

For fun, you could use the MAP_POPULATE flag with mmap() to try to get the kernel to pre-allocate the pages (and avoid the page fault and the first TLB miss).

Ah yes, I bet this is right. I forgot about mmap *not* modifying the page tables without `MAP_POPULATE`, even though the OP mentioned page faults. derp. — Peter Cordes, Feb 01 '18 at 07:28

Peter Cordes · Answer 2 · 2018-02-01T07:30:29.057

Update 2: I think Brendan's answer is right. I should maybe delete this, but the ocperf.py suggestion is still useful for future readers, I think. And it might explain extra TLB misses on CPUs without Process-Context-Identifiers with kernels that mitigate Meltdown.

Update: the below guess was wrong. New guess: mmap has to modify your process's page table, so perhaps there's some TLB invalidation of something just from that. My recommendation to use ocperf.py record to try to figure out which asm instructions are causing TLB misses still stands. Even with optimization enabled, the code will store to the stack when pushing/popping a return address for the glibc wrapper function calls.

Perhaps your kernel has kernel / user page-table isolation enabled to mitigate Meltdown, so on return from kernel to user, all TLB entries have been invalidated (by modifying CR3 to point to page tables that don't include the kernel mappings at all).

Look for Kernel/User page tables isolation: enabled in your dmesg output. You can try booting with kpti=off as a kernel option to disable it, if you don't mind being vulnerable to Meltdown while testing.

Because you're using C, you're using the mmap and munmap system calls through their glibc wrappers, not with inline syscall instructions directly. The ret instruction in that wrapper needs to load the return address from the stack, which TLB misses.

The extra store misses probably come from call instructions pushing a return address, although I'm not sure that's right because the current stack page should already be in the TLB from the ret from the previous system call.

You can profile with ocperf.py to get symbolic names for uarch-specific events. Assuming you're on a recent Intel CPU, ocperf.py record -e mem_inst_retired.stlb_miss_stores,page-faults,dTLB-load-misses to find which instructions cause store misses. (Then use ocperf.py report -Mintel). If report doesn't make it easy to choose which event to see counts for, only record with a single event.

mem_inst_retired.stlb_miss_stores is a "precise" event, unlike most of the other store TLB events, so the counts should be for the real instruction, rather than maybe some later instructions like imprecise perf events. (See Andy Glew's trap vs. exception answer for some details about why some performance-counters can't easily be precise; many store events aren't.)

I am using 4.5 Linux kernel which doesn't have KPTI (I am running on Haswell anyway which has PCID and TLB will not be flushed even with KPTI). — Mohammad Hedayati, Feb 01 '18 at 04:09
@Hedy: Ah, I hadn't realized the KPTI patches were already using PCIDs to reduce CR3 modifications. I knew that was possible, but IIRC Linux didn't previously use PCIDs at all, so I thought that would be too big a change to get it tested so quickly. Thanks for the feedback that my guess was wrong :P Updated my answer to reflect that (but I don't have any other great ideas other than figuring out which instructions TLB-miss, and thus which pages were invalidated.) — Peter Cordes, Feb 01 '18 at 06:16

Two TLB-miss per mmap/access/munmap

2 Answers2

Linked