std::atomic exists because many ISAs have direct hardware support for it
What the C++ standard says about std::atomic has been analyzed in other answers.
So now let's see what std::atomic compiles to to get a different kind of insight.
The main takeaway from this experiment is that modern CPUs have direct support for atomic integer operations, for example the LOCK prefix in x86, and std::atomic basically exists as a portable interface to those intructions: What does the "lock" instruction mean in x86 assembly? In aarch64, LDADD would be used.
This support allows for faster alternatives to more general methods such as std::mutex, which can make more complex multi-instruction sections atomic, at the cost of being slower than std::atomic because std::mutex it makes futex system calls in Linux, which is way slower than the userland instructions emitted by std::atomic, see also: Does std::mutex create a fence?
Let's consider the following multi-threaded program which increments a global variable across multiple threads, with different synchronization mechanisms depending on which preprocessor define is used.
main.cpp
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
size_t niters;
#if STD_ATOMIC
std::atomic_ulong global(0);
#else
uint64_t global = 0;
#endif
void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
#if LOCK
        __asm__ __volatile__ (
            "lock incq %0;"
            : "+m" (global),
              "+g" (i) // to prevent loop unrolling
            :
            :
        );
#else
        __asm__ __volatile__ (
            ""
            : "+g" (i) // to prevent he loop from being optimized to a single add
            : "g" (global)
            :
        );
        global++;
#endif
    }
}
int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    uint64_t expect = nthreads * niters;
    std::cout << "expect " << expect << std::endl;
    std::cout << "global " << global << std::endl;
}
GitHub upstream.
Compile, run and disassemble:
comon="-ggdb3 -O3 -std=c++11 -Wall -Wextra -pedantic main.cpp -pthread"
g++ -o main_fail.out                    $common
g++ -o main_std_atomic.out -DSTD_ATOMIC $common
g++ -o main_lock.out       -DLOCK       $common
./main_fail.out       4 100000
./main_std_atomic.out 4 100000
./main_lock.out       4 100000
gdb -batch -ex "disassemble threadMain" main_fail.out
gdb -batch -ex "disassemble threadMain" main_std_atomic.out
gdb -batch -ex "disassemble threadMain" main_lock.out
Extremely likely "wrong" race condition output for main_fail.out:
expect 400000
global 100000
and deterministic "correct" output of the others:
expect 400000
global 400000
Disassembly of main_fail.out:
   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     mov    0x29b5(%rip),%rcx        # 0x5140 <niters>
   0x000000000000278b <+11>:    test   %rcx,%rcx
   0x000000000000278e <+14>:    je     0x27b4 <threadMain()+52>
   0x0000000000002790 <+16>:    mov    0x29a1(%rip),%rdx        # 0x5138 <global>
   0x0000000000002797 <+23>:    xor    %eax,%eax
   0x0000000000002799 <+25>:    nopl   0x0(%rax)
   0x00000000000027a0 <+32>:    add    $0x1,%rax
   0x00000000000027a4 <+36>:    add    $0x1,%rdx
   0x00000000000027a8 <+40>:    cmp    %rcx,%rax
   0x00000000000027ab <+43>:    jb     0x27a0 <threadMain()+32>
   0x00000000000027ad <+45>:    mov    %rdx,0x2984(%rip)        # 0x5138 <global>
   0x00000000000027b4 <+52>:    retq
Disassembly of main_std_atomic.out:
   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     cmpq   $0x0,0x29b4(%rip)        # 0x5140 <niters>
   0x000000000000278c <+12>:    je     0x27a6 <threadMain()+38>
   0x000000000000278e <+14>:    xor    %eax,%eax
   0x0000000000002790 <+16>:    lock addq $0x1,0x299f(%rip)        # 0x5138 <global>
   0x0000000000002799 <+25>:    add    $0x1,%rax
   0x000000000000279d <+29>:    cmp    %rax,0x299c(%rip)        # 0x5140 <niters>
   0x00000000000027a4 <+36>:    ja     0x2790 <threadMain()+16>
   0x00000000000027a6 <+38>:    retq   
Disassembly of main_lock.out:
Dump of assembler code for function threadMain():
   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     cmpq   $0x0,0x29b4(%rip)        # 0x5140 <niters>
   0x000000000000278c <+12>:    je     0x27a5 <threadMain()+37>
   0x000000000000278e <+14>:    xor    %eax,%eax
   0x0000000000002790 <+16>:    lock incq 0x29a0(%rip)        # 0x5138 <global>
   0x0000000000002798 <+24>:    add    $0x1,%rax
   0x000000000000279c <+28>:    cmp    %rax,0x299d(%rip)        # 0x5140 <niters>
   0x00000000000027a3 <+35>:    ja     0x2790 <threadMain()+16>
   0x00000000000027a5 <+37>:    retq
Conclusions:
- the non-atomic version saves the global to a register, and increments the register. - Therefore, at the end, very likely four writes happen back to global with the same "wrong" value of - 100000.
 
- std::atomiccompiles to- lock addq. The LOCK prefix makes the following- incfetch, modify and update memory atomically.
 
- our explicit inline assembly LOCK prefix compiles to almost the same thing as - std::atomic, except that our- incis used instead of- add. Not sure why GCC chose- add, considering that our INC generated a decoding 1 byte smaller.
 
ARMv8 could use either LDAXR + STLXR or LDADD in newer CPUs: How do I start threads in plain C?
Tested in Ubuntu 19.10 AMD64, GCC 9.2.1, Lenovo ThinkPad P51.