What exact rules in the C++ memory model prevent reordering before acquire operations?

Question

I have a question regarding the order of operations in the following code:

std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
  y.exchange(1, std::memory_order_acq_rel);
  r1 = x.load(std::memory_order_relaxed);
}
void thread2() {
  x.exchange(1, std::memory_order_acq_rel);
  r2 = y.load(std::memory_order_relaxed);
}

Given the description of std::memory_order_acquire on the cppreference page (https://en.cppreference.com/w/cpp/atomic/memory_order), that

A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load.

it seems obvious that there can never be an outcome that r1 == 0 && r2 == 0 after running thread1 and thread2 concurrently.

However, I cannot find any wording in the C++ standard (looking at the C++14 draft right now), which establishes guarantees that two relaxed loads cannot be reordered with acquire-release exchanges. What am I missing?

EDIT: As has been suggested in the comments, it is actually possible to get both r1 and r2 equal to zero. I've updated the program to use load-acquire as follows:

std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
  y.exchange(1, std::memory_order_acq_rel);
  r1 = x.load(std::memory_order_acquire);
}
void thread2() {
  x.exchange(1, std::memory_order_acq_rel);
  r2 = y.load(std::memory_order_acquire);
}

Now is it possible to get both and r1 and r2 equal to 0 after concurrently executing thread1 and thread2? If not, which C++ rules prevent this?

Just to confirm: you want the [language-lawyer] answer why the two relaxed loads are still considered reads, and therefore bound by the preceding reordering barrier? — MSalters, Oct 02 '18 at 10:43
@MSalters I definitely want a language-lawyer answer, thanks, added the corresponding tag. I'm probably just confused how is the acquire barrier (load-load+load-store) formally defined in the standard.. The only thing I found was the release-acquire pairing which is not something relevant here. — Oleg Andreev, Oct 02 '18 at 10:49
@Caleth Well, I thought that if this has been sequenced-before, then replacing the `exchange(std::memory_order_acq_rel)` with `exchange(std::memory_order_relaxed)` would not change the program correctness, but if I understand correctly it changes the program: compiler or CPU can now reorder the exchange with load and, thus, make outcome `r1 == 0 && r2 == 0` possible. Am I wrong? — Oleg Andreev, Oct 02 '18 at 11:41
It's the acquire-release that enforces the order. In this snippet, *whatever* order you specify for the loads doesn't change that. — Caleth, Oct 02 '18 at 11:51
@Caleth Yup, that is exactly what I'm asking: what rules from C++ standard describe the acquire barrier on `exchange`. — Oleg Andreev, Oct 02 '18 at 11:54
in concrete code can be `r1==0 && r2 == 0` - no any synchronization point between 2 threads — RbMm, Oct 02 '18 at 12:00
Actually, I think I was confused by the discussion of replacing the exchange. It *is* possible to see `r1==0 && r2 == 0` here, because acquire-release (and consume-release) are *paired*. Only one thread acquires (and releases) each atomic, so the following relaxed load can see an "old" value — Caleth, Oct 02 '18 at 12:06
So, the only way to fix the code is replacing all `memory_orders` with `seq_cst`? — Oleg Andreev, Oct 02 '18 at 12:08
No, you can acquire on the load as well as release on the exchange — Caleth, Oct 02 '18 at 12:09
@OlegAndreev- task not in memory order, but you need the same atomic object in both threads for create synchronization point. here you use different — RbMm, Oct 02 '18 at 12:15
@Caleth I've updated the question with load-acquires. I still am somewhat confused: the standard establishes the dependency between store-then-load, but does not say anything about the other way round. What prevents each of the loads occuring before the corresponding store? — Oleg Andreev, Oct 02 '18 at 12:17
you can still get r1==0 and r2==0. where is synchronization point between 2 threads ? x will be read after write to y from thread1 view. y will be read after x write from thread2 view. but no synchronization between 2 threads. both threads can read data before view changes from another thread — RbMm, Oct 02 '18 at 12:20
@RbMm What do you mean by synchronization point? If you replace all `memory_orders` with `seq_cst` you can never get r1 == 0 and r2 == 0, because you have one global order of modifications. My question is whether you can replace `seq_cst` with a more relaxed memory ordering. — Oleg Andreev, Oct 02 '18 at 12:43
@OlegAndreev - i mean the **same** atomic variable. not exist it in your code — RbMm, Oct 02 '18 at 12:45
and so what ? The synchronization is established only between the threads releasing and acquiring the **same** atomic variable. no such variable in your code. — RbMm, Oct 02 '18 at 13:37
no, true. no such common variable. try prove that impossible r1==0 and r2==0. why this must be ? — RbMm, Oct 02 '18 at 13:44
if you say use `z.exchange(1, std::memory_order_acq_rel);` (`M`) in both threads - all what you write before `M` in thread1 visible after `M` in thread2 and visa versa. but no such common z. the `y.exchange(1, memory_order_acq_rel);` not synchronization point with `x.exchange(1, memory_order_acq_rel);` because x and y different memory locations — RbMm, Oct 02 '18 at 13:48
https://groups.google.com/forum/#!topic/lock-free/Nescdq-8qVM — RbMm, Oct 02 '18 at 22:02
@OlegAndreev "_acquire barrier_" there is no such thing, only observation of a given release operation. Note that a release as a first memory operation in a thread makes no sense (nothing to release here) and acquire as a non reading operation also makes no sense so `memory_order_acq_rel` is useless and so all operations should be relaxed as they are already relaxed anyway. — curiousguy, May 31 '19 at 19:31

score 14 · Accepted Answer · answered Oct 03 '18 at 06:02

14

The standard does not define the C++ memory model in terms of how operations are ordered around atomic operations with a specific ordering parameter. Instead, for the acquire/release ordering model, it defines formal relationships such as "synchronizes-with" and "happens-before" that specify how data is synchronized between threads.

N4762, §29.4.2 - [atomics.order]

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

In §6.8.2.1-9, the standard also states that if a store A synchronizes with a load B, anything sequenced before A inter-thread "happens-before" anything sequenced after B.

No "synchronizes-with" (and hence inter-thread happens-before) relationship is established in your second example (the first is even weaker) because the runtime relationships (that check the return values from the loads) are missing.
But even if you did check the return value, it would not be helpful since the exchange operations do not actually 'release' anything (i.e. no memory operations are sequenced before those operations). Neiter do the atomic load operations 'acquire' anything since no operations are sequenced after the loads.

Therefore, according to the standard, each of the four possible outcomes for the loads in both examples (including 0 0) is valid. In fact, the guarantees given by the standard are no stronger than memory_order_relaxed on all operations.

If you want to exclude the 0 0 result in your code, all 4 operations must use std::memory_order_seq_cst. That guarantees a single total order of the involved operations.

answered Oct 03 '18 at 06:02

LWimsey

6,189
2
25
53

3

After rereading the relevant parts from the standard multiple times I feel like this is the correct answer. This is unfortunate, because it differs from what many people believe (that acquire is basically LoadLoad+LoadStore, release is LoadStore+StoreStore and acq_rel on an RMW is equivalent to the seq_cst if only two threads are involved). The good news I guess is that for all relevant archs acq_rel RMW compiles to the same code as seq_cst RMW. – Oleg Andreev Oct 03 '18 at 09:40
BTW, do I understand correctly that another (much more stupid, but still) way to fix this is change the load operations to exchange/compare_exchange with acq_rel memory order, because that would establish one true modification order for each of x, y variables? – Oleg Andreev Oct 03 '18 at 09:41
1

@OlegAndreev - you can do next `x.store(1, memory_order_relaxed); atomic_thread_fence(memory_order_acq_rel); r1=y.exchange(1, memory_order_relaxed);` and `y.store(1, memory_order_relaxed); atomic_thread_fence(memory_order_acq_rel); r2 = x.exchange(1, memory_order_relaxed);` in this case can be prove that `r1+r2 != 0` or next `x.exchange(1, memory_order_acq_rel); r1 = y.exchange(1, memory_order_acq_rel)` and `y.exchange(1, memory_order_acq_rel); r2 = x.exchange(1, memory_order_acq_rel);` – RbMm Oct 03 '18 at 10:49
@RbMm Nice! So basically you cannot do pure loads, you have to replace them with load-and-stores. – Oleg Andreev Oct 03 '18 at 11:19
@OlegAndreev - i try [describe](https://stackoverflow.com/a/52627628/6401656) how i view situation from formal c++ rules – RbMm Oct 03 '18 at 13:01
@OlegAndreev: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html points out that there are 2 ways to map C++11 sequential-consistency onto ARM asm: either fence before/after stores, and after loads. Or fence before/after loads but only before stores. Since having cheap loads is usually better / more important than cheap stores, compilers choose the first way. But both satisfy the C++11 rules. (Atomic RMW operations are the same either way, though.) But according to that page, ARM acq_rel cmpxchg only needs `isb` at the end, while seq_cst needs `dmb ish`. I didn't check real gcc though – Peter Cordes Oct 03 '18 at 14:28
2

@PeterCordes Oh, I was interested mostly in ARMv8 (aarch64). At least gcc and clang generated the identical code for both acq_rel and seq_cst xchg: https://godbolt.org/z/uDcqRJ There is even some uncertainty whether this code is actually a correct barrier: (http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html or https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr vs https://community.arm.com/processors/f/discussions/6558/spin-lock-implementation-for-aarch64----how-to-enforce-acquire-semantics) In the end the store(seq_cst) looks nice enough:) – Oleg Andreev Oct 03 '18 at 15:13
4

@OlegAndreev To your first point. Those people are not wrong.. The C++ committee did an excellent job by taking inter-thread behavior to a slightly higher abstraction level. By defining those formal relationships (sync-with, HB), they provide a framework in which atomic operations behave well-defined as long as they fit within that framework. For code that does not fit into it (such as your examples, no defined HB), they don't specify its ordering behavior by allowing any possible outcome..... – LWimsey Oct 03 '18 at 20:01
3

.... By following that approach, the C++ standard saves programmers from dealing with the messy world of memory reordering. But looking at this from a memory reordering approach (acquire = #LoadLoad/#LoadStore, etc.) is perfectly fine. Even though the standard allows the 0 0 outcome, it is virtually impossible based on those lower level properties. The standard does not forbid, but just does not want to encourage code like that. – LWimsey Oct 03 '18 at 20:02
@OlegAndreev To your second point.. By replacing the loads with RMW calls, the 0 0 outcome has become even less likely because RMW's are guaranteed to observe the latest in the modification order. As long as the compiler does not re-order the statements (which it won't do with those orderings) you're fine. But even then, the standard does not really specify the behavior which leaves the 0 0 result as a (highly theoretical) possibility – LWimsey Oct 03 '18 at 20:18
@OlegAndreev btw, that last comment only applies to the scenario whereby the loads are replaced with RMW's still using `mo_acquire`. If all 4 calls are RMW's with `mo_acq_rel`, then the standard guarantees that 0 0 is not possible – LWimsey Oct 03 '18 at 21:52
@OlegAndreev "_believe (that **acquire is basically LoadLoad+LoadStore**_" The semantic of atomic operation was written with the **explicit goal of never doing just that**: atomic operations are no substitute for barriers where they are needed. When an object isn't used by multiple threads (like a refcount only used in the current thread) the compiler can remove the atomic part and generate fast unthreaded code. This wouldn't always be possible if two threads doing ordered atomic operations on distinct object could somehow obtain guarantees regarding operations on shared unordered atomics. – curiousguy Nov 10 '18 at 01:09

Peter Cordes · Answer 2 · 2021-03-27T19:17:58.977

You already have an answer to the language-lawyer part of this. But I want to answer the related question of how to understand why this can be possible in asm on a possible CPU architecture that uses LL/SC for RMW atomics.

It doesn't make sense for C++11 to forbid this reordering: it would require a store-load barrier in this case where some CPU architectures could avoid one.

It might actually be possible with real compilers on PowerPC, given the way they map C++11 memory-orders to asm instructions.

On PowerPC64, a function with an acq_rel exchange and an acquire load (using pointer args instead of static variables) compiles as follows with gcc6.3 -O3 -mregnames. This is from a C11 version because I wanted to look at clang output for MIPS and SPARC, and Godbolt's clang setup works for C11 <atomic.h> but fails for C++11 <atomic> when you use -target sparc64.

#include <stdatomic.h>   // This is C11, not C++11, for Godbolt reasons

long foo(_Atomic long *a, _Atomic int *b) {
  atomic_exchange_explicit(b, 1, memory_order_acq_rel);
  //++*a;
  return atomic_load_explicit(a, memory_order_acquire);
}

(source + asm on Godbolt for MIPS32R6, SPARC64, ARM 32, and PowerPC64.)

foo:
    lwsync            # with seq_cst exchange this is full sync, not just lwsync
                      # gone if we use exchage with mo_acquire or relaxed
                      # so this barrier is providing release-store ordering
    li %r9,1
.L2:
    lwarx %r10,0,%r4    # load-linked from 0(%r4)
    stwcx. %r9,0,%r4    # store-conditional 0(%r4)
    bne %cr0,.L2        # retry if SC failed
    isync             # missing if we use exchange(1, mo_release) or relaxed

    ld %r3,0(%r3)       # 64-bit load double-word of *a
    cmpw %cr7,%r3,%r3
    bne- %cr7,$+4       # skip over the isync if something about the load? PowerPC is weird
    isync             # make the *a load a load-acquire
    blr

isync is not a store-load barrier; it only requires the preceding instructions to complete locally (retire from the out-of-order part of the core). It doesn't wait for the store buffer to be flushed so other threads can see the earlier stores.

Thus the SC (stwcx.) store that's part of the exchange can sit in the store buffer and become globally visible after the pure acquire-load that follows it. In fact, another Q&A already asked this, and the answer is that we think this reordering is possible. Does `isync` prevent Store-Load reordering on CPU PowerPC?

If the pure load is seq_cst, PowerPC64 gcc puts a sync before the ld. Making the exchange seq_cst does not prevent the reordering. Remember that C++11 only guarantees a single total order for SC operations, so the exchange and the load both need to be SC for C++11 to guarantee it.

So PowerPC has a bit of an unusual mapping from C++11 to asm for atomics. Most systems put the heavier barriers on stores, allowing seq-cst loads to be cheaper or only have a barrier on one side. I'm not sure if this was required for PowerPC's famously-weak memory ordering, or if another choice was possible.

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows some possible implementations on various architectures. It mentions multiple alternatives for ARM.

On AArch64, we get this for the question's original C++ version of thread1:

thread1():
    adrp    x0, .LANCHOR0
    mov     w1, 1
    add     x0, x0, :lo12:.LANCHOR0
.L2:
    ldaxr   w2, [x0]            @ load-linked with acquire semantics
    stlxr   w3, w1, [x0]        @ store-conditional with sc-release semantics
    cbnz    w3, .L2             @ retry until exchange succeeds

    add     x1, x0, 8           @ the compiler noticed the variables were next to each other
    ldar    w1, [x1]            @ load-acquire

    str     w1, [x0, 12]        @ r1 = load result
    ret

The reordering can't happen there because AArch64 acquire loads interact with release stores to give sequential consistency, not just plain acq/rel. Release stores can't reorder with later acquire loads.

(They can reorder with later plain loads, on paper and probably in some real hardware. AArch64 seq_cst can be cheaper than on other ISAs, if you avoid acquire loads right after release stores. But unfortunately it makes acq/rel worse than x86. This is fixed with ARMv8.3-A LDAPR, a load that's just acquire not sequential-acquire. It allows earlier stores, even STLR, to reorder with it. So you get just acq_rel, allowing StoreLoad reordering but not other reordering. (It's also an optional feature in ARMv8.2-A).)

On a machine that also or instead had plain-release LL/SC atomics, it's easy to see that an acq_rel doesn't stop later loads to different cache lines from becoming globally visible after the LL but before the SC of the exchange.

If exchange is implemented with a single transaction like on x86, so the load and store are adjacent in the global order of memory operations, then certainly no later operations can be reordered with an acq_rel exchange and it's basically equivalent to seq_cst.

But LL/SC doesn't have to be a true atomic transaction to give RMW atomicity for that location.

In fact, a single asm swap instruction could have relaxed or acq_rel semantics. SPARC64 needs membar instructions around its swap instruction, so unlike x86's xchg it's not seq-cst on its own. (SPARC has really nice / human readable instruction mnemonics, especially compared to PowerPC. Well basically anything is more readable that PowerPC.)

Thus it doesn't make sense for C++11 to require that it did: it would hurt an implementation on a CPU that didn't otherwise need a store-load barrier.

Thanks! Didn't know about AArch64 release-stores being sequential release (the seeming absense of dmb before ldaxr and after stlxr to prevent reordering was puzzling me). — Oleg Andreev, Oct 03 '18 at 22:02
Nice addition.. It shows the standards committee really understood what they were doing — LWimsey, Oct 03 '18 at 22:35
So if I understand correctly, even in I use seq_cst RWM on such a hypothetical architecture, it still would not be a barrier between two relaxed stores, right? I mean the following reordering seems possible: `st r0, [r1]; ll r2, [r3]; sc r4, [r3]; st r5, [r6]` -> `ll r2, [r3]; st r5, [r6]; st r0, [r1]; sc r4, [r3]`. Do I understand correctly that, the `std::atomic_thread_fence(seq_cst)` would prevent such a reordering? — Oleg Andreev, Oct 04 '18 at 08:04
@OlegAndreev: On a hypothetical ARM which also had plain-release store and store-conditional, the store part of a `seq_cst` RMW would be `stlxr` (sequential-release), not a relaxed store. If the architecture only had acq_rel and relaxed LL/SC, a seq-cst exchange would maybe use relaxed LL/SC and then a full barrier, because on every normal ISA any barrier that include StoreLoad also includes all the others. I don't know which 2 "relaxed stores" you're talking about, because the example has two release stores. — Peter Cordes, Oct 04 '18 at 08:09
I was talking about an imaginary ISA and the following code: `a.store(relaxed); b.exchange(acq_rel); c.store(relaxed);`, whether the `a` and `c` stores can get reordered. IIUC the seq_cst accesses must have the total ordering only around other seq_cst accesses, otherwise they are the same rules as acq/rel/acq_rel. — Oleg Andreev, Oct 04 '18 at 08:40
@OlegAndreev: a/c reordering is easily possible with an acq_rel exchange. With an seq_cst exchange I'm not 100% sure if another thread syncing with the seq-release store would be allowed to see the new value of `c` *before* the new value of `b`. But I think so, because when you're only seeing the old value of `b`, you haven't synchronized with that writer so C++11 doesn't guarantee it. So in asm on an ISA with multi-copy atomic cache coherency, a sequential-release store can reorder with later relaxed stores, I think. And you can still recover seq-cst with release stores, so it's plausbile — Peter Cordes, Oct 04 '18 at 09:04
@PeterCordes your remark about AArch64 "_But unfortunately it makes acq/rel worse than x86 because it doesn't have weaker instructions to give just acq_rel and allow StoreLoad reordering but not other reordering_" is outdated, LDAPR and friends were later introduced and (along with STLR) provide acquire/release semantics exactly. — Daniel Nitzan, Mar 27 '21 at 19:01
@DanielNitzan: Thanks, I became aware of that instruction a while ago, but didn't go searching out my old answers lamenting AArch64 only having seq_cst. (And I earlier had the misconception that `stlr` would be like a sequential-release store on x86, draining the store buffer on the spot before any later loads. But it's actually nowhere near that bad unless you run an `ldar` right after. So I've probably hinted or implied that misconception in some other answer. If you spot one, feel free to edit and/or let me know.) — Peter Cordes, Mar 27 '21 at 19:20

RbMm · Answer 3 · 2018-10-03T13:05:51.443

in Release-Acquire ordering for create synchronization point between 2 threads we need some atomic object M which will be the same in both operations

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

or in more details:

If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

The synchronization is established only between the threads releasing and acquiring the same atomic variable.

     N = u                |  if (M.load(acquire) == v)    :[B]
[A]: M.store(v, release)  |  assert(N == u)

here synchronization point on M store-release and load-acquire(which take value from store-release !). as result store N = u in thread A (before store-release on M) visible in B (N == u) after load-acquire on same M

if take example:

atomic<int> x, y;
int r1, r2;

void thread_A() {
  y.exchange(1, memory_order_acq_rel);
  r1 = x.load(memory_order_acquire);
}
void thread_B() {
  x.exchange(1, memory_order_acq_rel);
  r2 = y.load(memory_order_acquire);
}

what we can select for common atomic object M ? say x ? x.load(memory_order_acquire); will be synchronization point with x.exchange(1, memory_order_acq_rel) ( memory_order_acq_rel include memory_order_release (more strong) and exchange include store) if x.load load value from x.exchange and main will be synchronized loads after acquire (be in code after acquire nothing exist) with stores before release (but again before exchange nothing in code).

correct solution (look for almost exactly question ) can be next:

atomic<int> x, y;
int r1, r2;

void thread_A()
{
    x.exchange(1, memory_order_acq_rel); // [Ax]
    r1 = y.exchange(1, memory_order_acq_rel); // [Ay]
}

void thread_B()
{
    y.exchange(1, memory_order_acq_rel); // [By]
    r2 = x.exchange(1, memory_order_acq_rel); // [Bx]
}

assume that r1 == 0.

All modifications to any particular atomic variable occur in a total order that is specific to this one atomic variable.

we have 2 modification of y : [Ay] and [By]. because r1 == 0 this mean that [Ay] happens before [By] in total modification order of y. from this - [By] read value stored by [Ay]. so we have next:

A is write to x - [Ax]
A do store-release [Ay] to y after this ( acq_rel include release, exchange include store)
B load-acquire from y ([By] value stored by [Ay]
once the atomic load-acquire (on y) is completed, thread B is guaranteed to see everything thread A wrote to memory before store-release (on y). so it view side-effect of [Ax] - and r2 == 1

another possible solution use atomic_thread_fence

atomic<int> x, y;
int r1, r2;

void thread_A()
{
    x.store(1, memory_order_relaxed); // [A1]
    atomic_thread_fence(memory_order_acq_rel); // [A2]
    r1 = y.exchange(1, memory_order_relaxed); // [A3]
}

void thread_B()
{
    y.store(1, memory_order_relaxed); // [B1]
    atomic_thread_fence(memory_order_acq_rel); // [B2]
    r2 = x.exchange(1, memory_order_relaxed); // [B3]
}

again because all modifications of atomic variable y occur in a total order. [A3] will be before [B1] or visa versa.

if [B1] before [A3] - [A3] read value stored by [B1] => r1 == 1.
if [A3] before [B1] - the [B1] is read value stored by [A3] and from Fence-fence synchronization:

A release fence [A2] in thread A synchronizes-with an acquire fence [B2] in thread B, if:

There exists an atomic object y,
There exists an atomic write [A3] (with any memory order) that modifies y in thread A
[A2] is sequenced-before [A3] in thread A
There exists an atomic read [B1] (with any memory order) in thread B
[B1] reads the value written by [A3]
[B1] is sequenced-before [B2] in thread B

In this case, all stores ([A1]) that are sequenced-before [A2] in thread A will happen-before all loads ([B3]) from the same locations (x) made in thread B after [B2]

so [A1] (store 1 to x) will be before and have visible effect for [B3] (load form x and save result to r2 ). so will be loaded 1 from x and r2==1

[A1]: x = 1               |  if (y.load(relaxed) == 1) :[B1]
[A2]: ### release ###     |  ### acquire ###           :[B2]
[A3]: y.store(1, relaxed) |  assert(x == 1)            :[B3]

score 3 · Answer 4 · answered Aug 29 '19 at 16:23

3

As language lawyer reasonings are hard to follow, I thought I'd add how a programmer who understands atomics would reason about the second snippet in your question:

Since this is symmetrical code, it is enough to look at just one side. Since the question is about the value of r1 (r2), we start with looking at

r1 = x.load(std::memory_order_acquire);

Depending on what the value of r1 is, we can say something about the visibility of other values. However, since the value of r1 isn't tested - the acquire is irrelevant. In either case, the value of r1 can be any value that was ever written to it (in the past or future *)). Therefore it can be zero. Nevertheless, we can assume it to BE zero because we're interested in whether or not the outcome of the whole program can be 0 0, which is a sort of testing the value of r1.

Hence, assuming we read zero THEN we can say that if that zero was written by another thread with memory_order_release then every other write to memory done by that thread before the store release will also be visible to this thread. However, value of zero that we read is the initialization value of x, and initialization values are non-atomic - let alone a 'release' - and certainly there wasn't anything "ordered" in front of them in terms of writing that value to memory; so there is nothing we can say about the visibility of other memory locations. In other words, again, the 'acquire' is irrelevant.

So, we can get r1 = 0 and the fact that we used acquire is irrelevant. The same reasoning then holds of r2. So the result can be r1 = r2 = 0.

In fact, if you assume the value of r1 is 1 after the load acquire, and that that 1 was written by thread2 with memory order release (which MUST be the case, since is the only place where a value of 1 is ever written to x) then all we know is that everything written to memory by thread2 before that store release will also be visible to thread1 (provided thread1 read x == 1 thus!). But thread2 doesn't write ANYTHING before writing to x, so again the whole release-acquire relationship is irrelevant, even in the case of loading a value of 1.

*) However, it is possible with further reasoning to show that certain value can never occur because of inconsistency with the memory model - but that doesn't happen here.

answered Aug 29 '19 at 16:23

Carlo Wood

5,648
2
35
47

"_However, since the value of r1 isn't tested - the acquire is irrelevant_" Throw away acquire of X (acquire load then ignore value as in `(void)X.load(ack);`) is only irrelevant if there is no preceding load of X (in program order) whose value was not examined (and will not be examined ever). You seem to suggest that `(void)X.load(ack);` can be ignored, in general, by the compiler. It cannot w/o considering context. – curiousguy Sep 03 '19 at 21:42
No, I said "the acquire is irrelevant" - meaning it has no effect because it isn't used. With this I meant that you could replace the load with `r1 = x.load(std::memory_model_relaxed);`, (not leave out the whole load!). I'm sorry if that wasn't clear. – Carlo Wood Sep 04 '19 at 15:14
Yes but it isn't clear whether you say that as a general truth, that in any similar program *fragment* "the acquire is irrelevant" "since the value of r1 isn't tested" or in that particular program. You seem to suggest it's a rule that a compiler can ignore acquire when there is no conditional (or indexed access) in a program fragment. – curiousguy Sep 04 '19 at 20:56
What I'm trying to say is that if you read something with acquire but don't use the result, then you might as well have used relaxed. I doubt a compiler would use that as an optimization: not using a value is strange to begin with. At most it could print a warning I guess. I just added how I'd reason about THIS code snippet, it surely isn't enough to learn how to reason about atomics in general. – Carlo Wood Sep 06 '19 at 14:50
"_if you read_ (...)" What I'm trying to say is that you have not established that theorem, which does not seem universal. – curiousguy Sep 06 '19 at 15:02
Well, among others I wrote a tool that analyses the possible behavior of a C++ code snippet with respect to the C++ memory model; so I know something ;). – Carlo Wood Sep 06 '19 at 19:52
To be clear, your claim here is that "throw-away acquire is inherently useless". That is `(void)X.load(acq);` is a NOP by definition thus can be optimized away. Right? – curiousguy Sep 07 '19 at 03:25
"_and initialization values are non-atomic - let alone a 'release' - and certainly there wasn't anything "ordered" in front of them in terms of writing that value to memory_" Init is not atomic? What is that supposed to even mean? What if it was? Do you not see the init of a var you are using? Is not the init by definition **in our past**? – curiousguy Sep 07 '19 at 05:13

score 1 · Answer 5 · answered Oct 02 '18 at 15:05

1

In the original version, it is possible to see r1 == 0 && r2 == 0 because there is no requirement that the stores propogate to the other thread before it reads it. This is not a re-ordering of either thread's operations, but e.g. a read of stale cache.

Thread 1's cache   |   Thread 2's cache
  x == 0;          |     x == 0;
  y == 0;          |     y == 0;

y.exchange(1, std::memory_order_acq_rel); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2

The release on Thread 1 is ignored by Thread 2, and vice-versa. In the abstract machine there is not consistency with the values of x and y on the threads

Thread 1's cache   |   Thread 2's cache
  x == 0; // stale |     x == 1;
  y == 1;          |     y == 0; // stale

r1 = x.load(std::memory_order_relaxed); // Thread 1
r2 = y.load(std::memory_order_relaxed); // Thread 2

You need more threads to get "violations of causality" with acquire / release pairs, as the normal ordering rules, combined with the "becomes visible side effect in" rules force at least one of the loads to see 1.

Without loss of generality, let's assume that Thread 1 executes first.

Thread 1's cache   |   Thread 2's cache
  x == 0;          |     x == 0;
  y == 0;          |     y == 0;

y.exchange(1, std::memory_order_acq_rel); // Thread 1

Thread 1's cache   |   Thread 2's cache
  x == 0;          |     x == 0;
  y == 1;          |     y == 1; // sync

The release on Thread 1 forms a pair with the acquire on Thread 2, and the abstract machine describes a consistent y on both threads

r1 = x.load(std::memory_order_relaxed); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
r2 = y.load(std::memory_order_relaxed); // Thread 2

answered Oct 02 '18 at 15:05

Caleth

52,200
2
44
75

C++ doesn't define ordering in terms of caches, except for http://eel.is/c++draft/intro.races#19 which says it "This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations." So if you want to talk about caches, you're talking about what real CPUs do. **Real hardware has coherent caches**. Typically maintaining coherency via MESI or equivalent. A store (or atomic RMW) can't update Thread 1's cache until after *invalidating* Thread 2's cache. – Peter Cordes Aug 29 '20 at 05:46
The effect we see *is* caused by local StoreLoad reordering in each core (which `seq_cst` would block with a full barrier), not incoherent caches. This is a widespread misconception, but please stop spreading it. See also https://preshing.com/20120515/memory-reordering-caught-in-the-act/, and https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ – Peter Cordes Aug 29 '20 at 05:47
(The "more threads" / violation of causality you're talking about is the IRIW litmus test: [Will two atomic writes to different locations in different threads always be seen in the same order by other threads?](https://stackoverflow.com/a/50679223) - most real hardware can't create this effect, but some POWER CPUs can/do.) – Peter Cordes Aug 29 '20 at 05:49
@PeterCordes Are you sure there is no hardware that doesn't have cache coherence guarantees? Yes particular (very popular) implementations have stronger guarantees for these cases, but is that universal? – Caleth Aug 29 '20 at 10:09
I can't be 100% certain, but I've never heard of a C++ compiler that has to actually emit explicit flush instructions for `mo_relaxed` loads to see stores by other threads, only barriers to locally order operations on different locations wrt. each other. ARM hardware with heterogeneous microcontroller + DSP that aren't cache coherent exist, but C++ implementations don't have std::thread run threads across those cores. (ARM assumes that threads of a single program will be in the same "inner shareable" coherency domain, which those non-coherent cores aren't, even though they do share memory.) – Peter Cordes Aug 29 '20 at 10:15
In any case, I'd still downvote this answer for not stating that it was talking about a very obscure possibility that doesn't apply to any mainstream ISA. Cache coherency is necessary for roll-your-own atomics with `volatile` to work at all, so every multi-core system that can run the Linux kernel is cache-coherent, across all ISAs. I don't think there are any Linux ISAs or config options to make the `WRITE_ONCE` macro use inline asm for explicit coherence asm instructions, instead of just casting to volatile. I've also looked at GCC `atomic` asm for ARM, MIPS, POWER, and some other ISAs. – Peter Cordes Aug 29 '20 at 10:21
Or to put it another way, non-cache-coherent shared memory is used as a cluster interconnect *between* systems for efficient message-passing, but each coherency domain is normally treated as a separate system, with its own kernel. You don't boot a single SMP kernel or run C++ `std::thread`s across cores without coherent caches, because that would make *every* atomic operation ridiculously expensive. – Peter Cordes Aug 29 '20 at 10:37

vrqq · Answer 6 · 2020-08-29T05:35:39.343

I try to explain it in other word.

Imaginate that each thread running in the different CPU Core simultaneously, thread1 running in Core A, and thread2 running in Core B.

The core B cannot know the REAL running order in core A. The meaning of memory order is just, the running result to show to core B, from core A.

std::atomic<int> x, y;
int r1, r2, var1, var2;
void thread1() { //Core A
  var1 = 99;                                  //(0)
  y.exchange(1, std::memory_order_acq_rel);   //(1)
  r1 = x.load(std::memory_order_acquire);     //(2)
}
void thread2() { //Core B
  var2 = 999;                                 //(2.5)
  x.exchange(1, std::memory_order_acq_rel);   //(3)
  r2 = y.load(std::memory_order_acquire);     //(4)
}

For example, the (4) is just a REQUEST for (1). (which have code like 'variable y with memory_order_release') And (4) in core B apply A for a specific order: (0)->(1)->(4).

For different REQUEST, they may see different sequence in other thread. ( If now we have core C and some atomic variable interactive with core A, the core C may seen different result with core B.)

OK now there's a detail explaination step by step: (for code above)

We start in core B : (2.5)

(2.5)var2 = 999;
(3)acq: find variable 'x' with 'memory_order_release', nothing. Now the order in Core A we can guess [(0),(1),(2)] or [(0),(2),(1)] are all legal, so there's no limit to us (B) to reorder (3) and (4).
(3)rel: find var 'x' with 'memory_order_acquire', found (2), so make a ordered show list to core A : [var2=999, x.exchange(1)]
(4) find var y with 'memory_order_release', ok found it at (1). So now we stand at core B, we can see the source code which Core displayed to me: 'There's must have var1=99 before y.exchange(1)'.
The idea is: we can see the source code which have var1=99 before y.exchange(1), because we make a REQUEST to other cores, and core A response that result to me. (the REQUEST is y.load(std::acquire)) If there have some other core who also want to observe the source code of A, they can not find that conclusion.
We can never know the real running order for (0) (1) (2).
- The order for A itself can ensure the right result (seems like singal thread)
- The request from B also don't have any affect on the real running order for A.
This is also applied for B (2.5) (3) (4)

That is, the operation for specific core really do, but didn't tell other cores, so the 'local cache in other cores' might be wrong.

So there have a chance for (0, 0) with the code in question.

What exact rules in the C++ memory model prevent reordering before acquire operations?

6 Answers6

Linked