std::atomic exists because many ISAs have direct hardware support for it
What the C++ standard says about std::atomic has been analyzed in other answers.
So now let's see what std::atomic compiles to to get a different kind of insight.
The main takeaway from this experiment is that modern CPUs have direct support for atomic integer operations, for example the LOCK prefix in x86, and std::atomic basically exists as a portable interface to those intructions: What does the "lock" instruction mean in x86 assembly? In aarch64, LDADD would be used.
This support allows for faster alternatives to more general methods such as std::mutex, which can make more complex multi-instruction sections atomic, at the cost of being slower than std::atomic because std::mutex it makes futex system calls in Linux, which is way slower than the userland instructions emitted by std::atomic, see also: Does std::mutex create a fence?
Let's consider the following multi-threaded program which increments a global variable across multiple threads, with different synchronization mechanisms depending on which preprocessor define is used.
main.cpp
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
size_t niters;
#if STD_ATOMIC
std::atomic_ulong global(0);
#else
uint64_t global = 0;
#endif
void threadMain() {
for (size_t i = 0; i < niters; ++i) {
#if LOCK
__asm__ __volatile__ (
"lock incq %0;"
: "+m" (global),
"+g" (i) // to prevent loop unrolling
:
:
);
#else
__asm__ __volatile__ (
""
: "+g" (i) // to prevent he loop from being optimized to a single add
: "g" (global)
:
);
global++;
#endif
}
}
int main(int argc, char **argv) {
size_t nthreads;
if (argc > 1) {
nthreads = std::stoull(argv[1], NULL, 0);
} else {
nthreads = 2;
}
if (argc > 2) {
niters = std::stoull(argv[2], NULL, 0);
} else {
niters = 10;
}
std::vector<std::thread> threads(nthreads);
for (size_t i = 0; i < nthreads; ++i)
threads[i] = std::thread(threadMain);
for (size_t i = 0; i < nthreads; ++i)
threads[i].join();
uint64_t expect = nthreads * niters;
std::cout << "expect " << expect << std::endl;
std::cout << "global " << global << std::endl;
}
GitHub upstream.
Compile, run and disassemble:
comon="-ggdb3 -O3 -std=c++11 -Wall -Wextra -pedantic main.cpp -pthread"
g++ -o main_fail.out $common
g++ -o main_std_atomic.out -DSTD_ATOMIC $common
g++ -o main_lock.out -DLOCK $common
./main_fail.out 4 100000
./main_std_atomic.out 4 100000
./main_lock.out 4 100000
gdb -batch -ex "disassemble threadMain" main_fail.out
gdb -batch -ex "disassemble threadMain" main_std_atomic.out
gdb -batch -ex "disassemble threadMain" main_lock.out
Extremely likely "wrong" race condition output for main_fail.out:
expect 400000
global 100000
and deterministic "correct" output of the others:
expect 400000
global 400000
Disassembly of main_fail.out:
0x0000000000002780 <+0>: endbr64
0x0000000000002784 <+4>: mov 0x29b5(%rip),%rcx # 0x5140 <niters>
0x000000000000278b <+11>: test %rcx,%rcx
0x000000000000278e <+14>: je 0x27b4 <threadMain()+52>
0x0000000000002790 <+16>: mov 0x29a1(%rip),%rdx # 0x5138 <global>
0x0000000000002797 <+23>: xor %eax,%eax
0x0000000000002799 <+25>: nopl 0x0(%rax)
0x00000000000027a0 <+32>: add $0x1,%rax
0x00000000000027a4 <+36>: add $0x1,%rdx
0x00000000000027a8 <+40>: cmp %rcx,%rax
0x00000000000027ab <+43>: jb 0x27a0 <threadMain()+32>
0x00000000000027ad <+45>: mov %rdx,0x2984(%rip) # 0x5138 <global>
0x00000000000027b4 <+52>: retq
Disassembly of main_std_atomic.out:
0x0000000000002780 <+0>: endbr64
0x0000000000002784 <+4>: cmpq $0x0,0x29b4(%rip) # 0x5140 <niters>
0x000000000000278c <+12>: je 0x27a6 <threadMain()+38>
0x000000000000278e <+14>: xor %eax,%eax
0x0000000000002790 <+16>: lock addq $0x1,0x299f(%rip) # 0x5138 <global>
0x0000000000002799 <+25>: add $0x1,%rax
0x000000000000279d <+29>: cmp %rax,0x299c(%rip) # 0x5140 <niters>
0x00000000000027a4 <+36>: ja 0x2790 <threadMain()+16>
0x00000000000027a6 <+38>: retq
Disassembly of main_lock.out:
Dump of assembler code for function threadMain():
0x0000000000002780 <+0>: endbr64
0x0000000000002784 <+4>: cmpq $0x0,0x29b4(%rip) # 0x5140 <niters>
0x000000000000278c <+12>: je 0x27a5 <threadMain()+37>
0x000000000000278e <+14>: xor %eax,%eax
0x0000000000002790 <+16>: lock incq 0x29a0(%rip) # 0x5138 <global>
0x0000000000002798 <+24>: add $0x1,%rax
0x000000000000279c <+28>: cmp %rax,0x299d(%rip) # 0x5140 <niters>
0x00000000000027a3 <+35>: ja 0x2790 <threadMain()+16>
0x00000000000027a5 <+37>: retq
Conclusions:
the non-atomic version saves the global to a register, and increments the register.
Therefore, at the end, very likely four writes happen back to global with the same "wrong" value of 100000.
std::atomic compiles to lock addq. The LOCK prefix makes the following inc fetch, modify and update memory atomically.
our explicit inline assembly LOCK prefix compiles to almost the same thing as std::atomic, except that our inc is used instead of add. Not sure why GCC chose add, considering that our INC generated a decoding 1 byte smaller.
ARMv8 could use either LDAXR + STLXR or LDADD in newer CPUs: How do I start threads in plain C?
Tested in Ubuntu 19.10 AMD64, GCC 9.2.1, Lenovo ThinkPad P51.