I have 64 bits which I need to read extremely quickly before an event and then after the event perform a compare-and-exchange.
I was thinking I could load(std::memory_order_relaxed) before the event to read quickly and then use the regular compare and exchange after the event.
When I was comparing the assembly between a non-atomic 64 bit read, atomic (relaxed) and atomic (acquire) I could not see any difference in the assembly. This was the C++ test:
int main(){
    volatile uint64_t var2;
    std::atomic<uint64_t> var;  // The variable I wish to read quickly
    var = 10;
    var2 = var.load(std::memory_order_relaxed);
    //var2 = var; // when var is not atomic
    //var2 = var.load(std::memory_order_acquire);  To see if the x86 changed 
}
Giving this assembly:
!int main(){
main()+0: sub    $0x48,%rsp
main()+4: callq  0x100401180 <__main>
!    volatile uint64_t var2;
!    volatile std::atomic<uint64_t> var;
!    var = 10;
!    
!    
!    var2 = var.load(std::memory_order_acquire);
main()()
main()+26: mov    %rax,0x38(%rsp)
!    
!    int x;
!    std::cin >> x;
main()+31: lea    0x2c(%rsp),%rdx
main()+36: mov    0x1f45(%rip),%rcx        # 0x100403050 <__fu0__ZSt3cin>
main()+43: callq  0x100401160 <_ZNSirsERi>
!}main()+48: mov    $0x0,%eax
main()+53: add    $0x48,%rsp
main()+57: retq  
Obviously the assembly for using std::memory_order_acquire should be different to a non-atomic variable read?
Is this because reading 64 bits is atomic regardless, so long as the data is aligned, hence the assembly is not different? I would have thought using a stronger memory barrier would have inserted a fence instruction or something?
The real question I want to know is, if I declare the 64 bits as atomic and I read using a relaxed memory barrier, will it have the same performance cost as reading an unatomic 64-bit variable?
 
    