Update 2: I think Brendan's answer is right. I should maybe delete this, but the ocperf.py suggestion is still useful for future readers, I think. And it might explain extra TLB misses on CPUs without Process-Context-Identifiers with kernels that mitigate Meltdown.
Update: the below guess was wrong. New guess: mmap has to modify your process's page table, so perhaps there's some TLB invalidation of something just from that. My recommendation to use ocperf.py record to try to figure out which asm instructions are causing TLB misses still stands. Even with optimization enabled, the code will store to the stack when pushing/popping a return address for the glibc wrapper function calls.
Perhaps your kernel has kernel / user page-table isolation enabled to mitigate Meltdown, so on return from kernel to user, all TLB entries have been invalidated (by modifying CR3 to point to page tables that don't include the kernel mappings at all).
Look for Kernel/User page tables isolation: enabled in your dmesg output. You can try booting with kpti=off as a kernel option to disable it, if you don't mind being vulnerable to Meltdown while testing.
Because you're using C, you're using the mmap and munmap system calls through their glibc wrappers, not with inline syscall instructions directly. The ret instruction in that wrapper needs to load the return address from the stack, which TLB misses.
The extra store misses probably come from call instructions pushing a return address, although I'm not sure that's right because the current stack page should already be in the TLB from the ret from the previous system call.
You can profile with ocperf.py to get symbolic names for uarch-specific events. Assuming you're on a recent Intel CPU, ocperf.py record -e mem_inst_retired.stlb_miss_stores,page-faults,dTLB-load-misses to find which instructions cause store misses. (Then use ocperf.py report -Mintel). If report doesn't make it easy to choose which event to see counts for, only record with a single event.
mem_inst_retired.stlb_miss_stores is a "precise" event, unlike most of the other store TLB events, so the counts should be for the real instruction, rather than maybe some later instructions like imprecise perf events. (See Andy Glew's trap vs. exception answer for some details about why some performance-counters can't easily be precise; many store events aren't.)