What exactly is std::atomic?

Question

I understand that std::atomic<> is an atomic object. But atomic to what extent? To my understanding an operation can be atomic. What exactly is meant by making an object atomic? For example if there are two threads concurrently executing the following code:

a = a + 12;

Then is the entire operation (say add_twelve_to(int)) atomic? Or are changes made to the variable atomic (so operator=())?

You need to use something like `a.fetch_add(12)` if you want an atomic RMW. — Kerrek SB, Aug 13 '15 at 02:01
Yep that's what I don't understand. What is meant by making an object atomic. If there was an interface it could simply have been made atomic with a mutex or a monitor. — , Aug 13 '15 at 02:39
@AaryamanSagar it solves an issue of efficiency. *Mutexes and monitors* carry computational overhead. Using `std::atomic` lets the standard library decide what's needed to achieve atomicity. — Drew Dormann, Aug 13 '15 at 03:02
@AaryamanSagar: `std::atomic` is a type that *allows for* atomic operations. It doesn't magically make your life better, you still have to know what you want to do with it. It's for a very specific use case, and uses of atomic operations (on the object) are generally very subtle and need to be thought of from a non-local perspective. So unless you already know that and why you want atomic operations, the type is probably not of much use for you. — Kerrek SB, Aug 13 '15 at 10:38

score 303 · Accepted Answer · edited Jun 02 '19 at 05:28

Each instantiation and full specialization of std::atomic<> represents a type that different threads can simultaneously operate on (their instances), without raising undefined behavior:

Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.

In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.

std::atomic<> wraps operations that, in pre-C++ 11 times, had to be performed using (for example) interlocked functions with MSVC or atomic bultins in case of GCC.

Also, std::atomic<> gives you more control by allowing various memory orders that specify synchronization and ordering constraints. If you want to read more about C++ 11 atomics and memory model, these links may be useful:

Note that, for typical use cases, you would probably use overloaded arithmetic operators or another set of them:

std::atomic<long> value(0);
value++; //This is an atomic op
value += 5; //And so is this

Because operator syntax does not allow you to specify the memory order, these operations will be performed with std::memory_order_seq_cst, as this is the default order for all atomic operations in C++ 11. It guarantees sequential consistency (total global ordering) between all atomic operations.

In some cases, however, this may not be required (and nothing comes for free), so you may want to use more explicit form:

std::atomic<long> value {0};
value.fetch_add(1, std::memory_order_relaxed); // Atomic, but there are no synchronization or ordering constraints
value.fetch_add(5, std::memory_order_release); // Atomic, performs 'release' operation

Now, your example:

a = a + 12;

will not evaluate to a single atomic op: it will result in a.load() (which is atomic itself), then addition between this value and 12 and a.store() (also atomic) of final result. As I noted earlier, std::memory_order_seq_cst will be used here.

However, if you write a += 12, it will be an atomic operation (as I noted before) and is roughly equivalent to a.fetch_add(12, std::memory_order_seq_cst).

As for your comment:

A regular int has atomic loads and stores. Whats the point of wrapping it with atomic<>?

Your statement is only true for architectures that provide such guarantee of atomicity for stores and/or loads. There are architectures that do not do this. Also, it is usually required that operations must be performed on word-/dword-aligned address to be atomic std::atomic<> is something that is guaranteed to be atomic on every platform, without additional requirements. Moreover, it allows you to write code like this:

void* sharedData = nullptr;
std::atomic<int> ready_flag = 0;

// Thread 1
void produce()
{
    sharedData = generateData();
    ready_flag.store(1, std::memory_order_release);
}

// Thread 2
void consume()
{
    while (ready_flag.load(std::memory_order_acquire) == 0)
    {
        std::this_thread::yield();
    }

    assert(sharedData != nullptr); // will never trigger
    processData(sharedData);
}

Note that assertion condition will always be true (and thus, will never trigger), so you can always be sure that data is ready after while loop exits. That is because:

store() to the flag is performed after sharedData is set (we assume that generateData() always returns something useful, in particular, never returns NULL) and uses std::memory_order_release order:

memory_order_release

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable

sharedData is used after while loop exits, and thus after load() from flag will return a non-zero value. load() uses std::memory_order_acquire order:

std::memory_order_acquire

A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread.

This gives you precise control over the synchronization and allows you to explicitly specify how your code may/may not/will/will not behave. This would not be possible if only guarantee was the atomicity itself. Especially when it comes to very interesting sync models like the release-consume ordering.

Are there actually architectures which do not have atomic loads and stores for primitives like `int`s? — , Aug 13 '15 at 03:02
It's not only about atomicity. it's also about ordering, behaviour in multi-core systems, etc. You may want to read [this article](http://preshing.com/20120930/weak-vs-strong-memory-models/). — Mateusz Grzejek, Aug 13 '15 at 03:09
@AaryamanSagar If I'm not mistaken, even on x86 reads and writes are atomic ONLY if aligned on word boundaries. — v.shashenko, Jan 05 '16 at 11:53
@MateuszGrzejek I have taken a reference to an atomic type. Could you kindly verify if the following would still guarantee atomic operation on object assignment https://ideone.com/HpSwqo — xAditya3393, Jul 12 '18 at 22:16
@v.shashenko Any operation smaller than a word isn't "aligned on word boundaries" yet writing a single byte is atomic. — curiousguy, Jun 02 '19 at 04:00
@curiousguy Doesn't it depend on packing type? Even smaller-than-a-word data can be aligned on word boundaries with gaps between them, unless they're packed tightly, which is controlled during compilation. — v.shashenko, Jun 03 '19 at 08:01
@v.shashenko If "word aligned" means address divisible by the word size then not all naturally aligned smaller-than-a-word objects will be word aligned: half words will be aligned on half word boundary, etc. — curiousguy, Jun 03 '19 at 20:55
"_no reads or writes in the current thread can be reordered after this store._" Visibly reordered. The impl can still reorder f.ex. operations on "register" local variables (whose address is never accessible by another thread). Some other non atomic writes that can't be legally observed by other threads can also be reordered. — curiousguy, Jun 14 '19 at 00:35
I'm not clear on why atomic is needed in the producer/consumer example. If `ready_flag` is written to after `sharedData` then how could the consumer end its loop early? Or can a CPU choose to delay writing the value of `sharedData` to the memory shared by the threads until after it has written to `ready_flag`? — Tim MB, Sep 28 '19 at 13:43
@TimMB Yes, normally, you would have (at least) two situations, where order of execution may be altered: (1) compiler can reorder the instructions (as much as standard allows that) in order to provide better performance of the output code (based on the usage of CPU registers, predictions, etc.) and (2) CPU can execute instructions in a different order to, for example, minimize the number of cache sync points. Ordering constraints provided for `std::atomic` (`std::memory_order`) serves exactly the purpose of limiting the reorders that are allowed to happen. — Mateusz Grzejek, Oct 01 '19 at 12:27
What am I missing with the lack of ready_flag being set to 0 in the beginning of the produce method provided that it may be called multiple times? — , Mar 01 '21 at 09:14
generateData() doesn't need release semantics. The facto that the update to flag has release semantics guarantees that other threads see the result before flag is set. — SeattleCplusplus, Apr 22 '23 at 20:18

Ciro Santilli OurBigBook.com · Answer 2 · 2021-05-24T11:48:00.550

std::atomic exists because many ISAs have direct hardware support for it

What the C++ standard says about std::atomic has been analyzed in other answers.

So now let's see what std::atomic compiles to to get a different kind of insight.

The main takeaway from this experiment is that modern CPUs have direct support for atomic integer operations, for example the LOCK prefix in x86, and std::atomic basically exists as a portable interface to those intructions: What does the "lock" instruction mean in x86 assembly? In aarch64, LDADD would be used.

This support allows for faster alternatives to more general methods such as std::mutex, which can make more complex multi-instruction sections atomic, at the cost of being slower than std::atomic because std::mutex it makes futex system calls in Linux, which is way slower than the userland instructions emitted by std::atomic, see also: Does std::mutex create a fence?

Let's consider the following multi-threaded program which increments a global variable across multiple threads, with different synchronization mechanisms depending on which preprocessor define is used.

main.cpp

#include <atomic>
#include <iostream>
#include <thread>
#include <vector>

size_t niters;

#if STD_ATOMIC
std::atomic_ulong global(0);
#else
uint64_t global = 0;
#endif

void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
#if LOCK
        __asm__ __volatile__ (
            "lock incq %0;"
            : "+m" (global),
              "+g" (i) // to prevent loop unrolling
            :
            :
        );
#else
        __asm__ __volatile__ (
            ""
            : "+g" (i) // to prevent he loop from being optimized to a single add
            : "g" (global)
            :
        );
        global++;
#endif
    }
}

int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    uint64_t expect = nthreads * niters;
    std::cout << "expect " << expect << std::endl;
    std::cout << "global " << global << std::endl;
}

GitHub upstream.

Compile, run and disassemble:

comon="-ggdb3 -O3 -std=c++11 -Wall -Wextra -pedantic main.cpp -pthread"
g++ -o main_fail.out                    $common
g++ -o main_std_atomic.out -DSTD_ATOMIC $common
g++ -o main_lock.out       -DLOCK       $common

./main_fail.out       4 100000
./main_std_atomic.out 4 100000
./main_lock.out       4 100000

gdb -batch -ex "disassemble threadMain" main_fail.out
gdb -batch -ex "disassemble threadMain" main_std_atomic.out
gdb -batch -ex "disassemble threadMain" main_lock.out

Extremely likely "wrong" race condition output for main_fail.out:

expect 400000
global 100000

and deterministic "correct" output of the others:

expect 400000
global 400000

Disassembly of main_fail.out:

   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     mov    0x29b5(%rip),%rcx        # 0x5140 <niters>
   0x000000000000278b <+11>:    test   %rcx,%rcx
   0x000000000000278e <+14>:    je     0x27b4 <threadMain()+52>
   0x0000000000002790 <+16>:    mov    0x29a1(%rip),%rdx        # 0x5138 <global>
   0x0000000000002797 <+23>:    xor    %eax,%eax
   0x0000000000002799 <+25>:    nopl   0x0(%rax)
   0x00000000000027a0 <+32>:    add    $0x1,%rax
   0x00000000000027a4 <+36>:    add    $0x1,%rdx
   0x00000000000027a8 <+40>:    cmp    %rcx,%rax
   0x00000000000027ab <+43>:    jb     0x27a0 <threadMain()+32>
   0x00000000000027ad <+45>:    mov    %rdx,0x2984(%rip)        # 0x5138 <global>
   0x00000000000027b4 <+52>:    retq

Disassembly of main_std_atomic.out:

   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     cmpq   $0x0,0x29b4(%rip)        # 0x5140 <niters>
   0x000000000000278c <+12>:    je     0x27a6 <threadMain()+38>
   0x000000000000278e <+14>:    xor    %eax,%eax
   0x0000000000002790 <+16>:    lock addq $0x1,0x299f(%rip)        # 0x5138 <global>
   0x0000000000002799 <+25>:    add    $0x1,%rax
   0x000000000000279d <+29>:    cmp    %rax,0x299c(%rip)        # 0x5140 <niters>
   0x00000000000027a4 <+36>:    ja     0x2790 <threadMain()+16>
   0x00000000000027a6 <+38>:    retq

Disassembly of main_lock.out:

Dump of assembler code for function threadMain():
   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     cmpq   $0x0,0x29b4(%rip)        # 0x5140 <niters>
   0x000000000000278c <+12>:    je     0x27a5 <threadMain()+37>
   0x000000000000278e <+14>:    xor    %eax,%eax
   0x0000000000002790 <+16>:    lock incq 0x29a0(%rip)        # 0x5138 <global>
   0x0000000000002798 <+24>:    add    $0x1,%rax
   0x000000000000279c <+28>:    cmp    %rax,0x299d(%rip)        # 0x5140 <niters>
   0x00000000000027a3 <+35>:    ja     0x2790 <threadMain()+16>
   0x00000000000027a5 <+37>:    retq

Conclusions:

the non-atomic version saves the global to a register, and increments the register.

Therefore, at the end, very likely four writes happen back to global with the same "wrong" value of 100000.
std::atomic compiles to lock addq. The LOCK prefix makes the following inc fetch, modify and update memory atomically.
our explicit inline assembly LOCK prefix compiles to almost the same thing as std::atomic, except that our inc is used instead of add. Not sure why GCC chose add, considering that our INC generated a decoding 1 byte smaller.

ARMv8 could use either LDAXR + STLXR or LDADD in newer CPUs: How do I start threads in plain C?

Tested in Ubuntu 19.10 AMD64, GCC 9.2.1, Lenovo ThinkPad P51.

score 22 · Answer 3 · answered Aug 13 '15 at 02:42

22

I understand that std::atomic<> makes an object atomic.

That's a matter of perspective... you can't apply it to arbitrary objects and have their operations become atomic, but the provided specialisations for (most) integral types and pointers can be used.

a = a + 12;

std::atomic<> does not (use template expressions to) simplify this to a single atomic operation, instead the operator T() const volatile noexcept member does an atomic load() of a, then twelve is added, and operator=(T t) noexcept does a store(t).

answered Aug 13 '15 at 02:42

Tony Delroy

102,968
15
177
252

That was what I wanted to ask. A regular int has atomic loads and stores. Whats the point of wrapping it with atomic<> – Aug 13 '15 at 02:50
11

@AaryamanSagar Simply modifying a normal `int` doesn't portably ensure the change is visible from other threads, nor does reading it ensure you see other threads' changes, and some things like `my_int += 3` aren't guaranteed to be done atomically unless you use `std::atomic<>` - they might involve a fetch, then add, then store sequence, wherein some other thread trying to update the same value might come in after the fetch and before the store, and clobber your thread's update. – Tony Delroy Aug 13 '15 at 02:56
"_Simply modifying a normal int doesn't portably ensure the change is visible from other threads_" It's worse than that: any attempt to measure that visibility would result in UB. – curiousguy Jun 14 '19 at 00:36

What exactly is std::atomic?

3 Answers3

Linked

Related