If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
    return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
        mov     edx, DWORD PTR [rcx]
        npad    1
        mov     eax, DWORD PTR [rcx]
        npad    1
        add     eax, edx
        ret     0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?