C++ standard: can relaxed atomic stores be lifted above a mutex lock?

Question

Is there any wording in the standard that guarantees that relaxed stores to atomics won't be lifted above the locking of a mutex? If not, is there any wording that explicitly says that it's kosher for the compiler or CPU to do so?

For example, take the following program (which could potentially use acq/rel for foo_has_been_set and avoid the lock, and/or make foo itself atomic. It's written this way to illustrate this question.)

std::mutex mu;
int foo = 0;  // Guarded by mu
std::atomic<bool> foo_has_been_set{false};

void SetFoo() {
  mu.lock();
  foo = 1;
  foo_has_been_set.store(true, std::memory_order_relaxed);
  mu.unlock();
}

void CheckFoo() {
  if (foo_has_been_set.load(std::memory_order_relaxed)) {
    mu.lock();
    assert(foo == 1);
    mu.unlock();
  }
}

Is it possible for CheckFoo to crash in the above program if another thread is calling SetFoo concurrently, or is there some guarantee that the store to foo_has_been_set can't be lifted above the call to mu.lock by the compiler and CPU?

This is related to an older question, but it's not 100% clear to me that the answer there applies to this. In particular, the counter-example in that question's answer may apply to two concurrent calls to SetFoo, but I'm interested in the case where the compiler knows that there is one call to SetFoo and one call to CheckFoo. Is that guaranteed to be safe?

I'm looking for specific citations in the standard.

jacobsa · Accepted Answer · 2017-08-06T10:41:10.547

9

I think I've figured out the particular partial order edges that guarantee the program can't crash. In the answer below I'm referencing version N4659 of the draft standard.

The code involved for the writer thread A and reader thread B is:

A1: mu.lock()
A2: foo = 1
A3: foo_has_been_set.store(relaxed)
A4: mu.unlock()

B1: foo_has_been_set.load(relaxed) <-- (stop if false)
B2: mu.lock()
B3: assert(foo == 1)
B4: mu.unlock()

We seek a proof that if B3 executes, then A2 happens before B3, as defined in [intro.races]/10. By [intro.races]/10.2, it's sufficient to prove that A2 inter-thread happens before B3.

Because lock and unlock operations on a given mutex happen in a single total order ([thread.mutex.requirements.mutex]/5), we must have either A1 or B2 coming first. The two cases:

Assume that A1 happens before B2. Then by [thread.mutex.class]/1 and [thread.mutex.requirements.mutex]/25, we know that A4 will synchronize with B2. Therefore by [intro.races]/9.1, A4 inter-thread happens before B2. Since B2 is sequenced before B3, by [intro.races]/9.3.1 we know that A4 inter-thread happens before B3. Since A2 is sequenced before A4, by [intro.races]/9.3.2, A2 inter-thread happens before B3.
Assume that B2 happens before A1. Then by the same logic as above, we know that B4 synchronizes with A1. So since A1 is sequenced before A3, by [intro.races]/9.3.1, B4 inter-thread happens before A3. Therefore since B1 is sequenced before B4, by [intro.races]/9.3.2, B1 inter-thread happens before A3. Therefore by [intro.races]/10.2, B1 happens before A3. But then according to [intro.races]/16, B1 must take its value from the pre-A3 state. Therefore the load will return false, and B2 will never run in the first place. In other words, this case can't happen.

So if B3 executes at all (case 1), A2 happens before B3 and the assert will pass. ∎

edited Aug 06 '17 at 10:41

answered Aug 04 '17 at 04:54

jacobsa

5,719
1
28
60

As indicated in my answer, I believe B1 can happen before A3 is visible because of relaxed memory order and no barrier has taken place. However if B1 happens after A3 then B2 must happen after A4 and by then (and only then) B3 must be happening after A2.So IF the `assert()` is evaluated it will succeed. – Persixty Aug 04 '17 at 09:28
1

Totally agreed, and in my proof I make the assumption that A3 sees a true value so the assert is evaluated (grep "moot"); the other case is uninteresting. – jacobsa Aug 05 '17 at 08:06
_"Since B2 is sequenced before A3"_ – Shouldn't it be "B2 is sequenced before B3"? – Anton Aug 05 '17 at 22:46
Indeed, fixed. Thanks. – jacobsa Aug 05 '17 at 23:13
1

As for B1 not happening before A3, I believe the relevant clause is [\[intro.races\]/16](https://timsong-cpp.github.io/cppwp/n4659/intro.races#16). If the evaluation B1 of `foo_has_been_set` _happened before_ the modification A3, the computed value would have to be taken from a different modification that precedes A3. But there is no other modification that sets `foo_has_been_set` to `true`. – Anton Aug 06 '17 at 10:18
Of course! Thanks so much; that's exactly what I was looking for. – jacobsa Aug 06 '17 at 10:23
Yes, this is correct. The reverse happens-before case (if the assert would fail then B1 must happen-before A3, so the assert is not run) is the important aspect. – Anthony Williams Aug 08 '17 at 22:19

LWimsey · Answer 2 · 2017-08-03T07:20:23.450

3

No memory operation inside a mutex protected region can 'escape' from that area. That applies to all memory operations, atomic and non-atomic.

In section 1.10.1:

a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex Correspondingly, a call that releases the same mutex will perform a release operation on those same locations

Furthermore, in section 1.10.1.6:

All operations on a given mutex occur in a single total order. Each mutex acquisition “reads the value written” by the last mutex release.

And in 30.4.3.1

A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents

This means, acquiring (locking) a mutex sets a one-way barrier that prevents operations that are sequenced after the acquire (inside the protected area) from moving up across the mutex lock.

Releasing (unlocking) a mutex sets a one-way barrier that prevents operations that are sequenced before the release (inside the protected area) from moving down across the mutex unlock.

In addition, memory operations that are released by a mutex are synchronized (visible) with another thread that acquires the same mutex.

In your example, foo_has_been_set is checked in CheckFoo.. If it reads true you know that the value 1 has been assigned to foo by SetFoo, but it is not synchronized yet. The mutex lock that follows will acquire foo, synchronization is complete and the assert cannot fire.

edited Aug 03 '17 at 07:20

answered Aug 03 '17 at 07:15

LWimsey

6,189
2
25
53

Thanks. What I'm looking for is the part of the standard that guarantees that your sentence "acquiring (locking) a mutex sets a one-way barrier that prevents operations that are sequenced after the acquire (inside the protected area) from moving up across the mutex lock" is true. Do you have a citation for that specific part? – jacobsa Aug 03 '17 at 21:33
@jacobsa The standard mentions 'acquire opereration` in a few different contexts; mutex acquire, acquire operation on an atomic variable and with standalone fences. They all behave equivalently in terms of memory ordering, but I cannot find the exact wording in the standard that defines what an 'acquire operation' is. However, what the standard does say is that a mutex release synchronizes with a mutex acquire (30.4.3.2-25) and it puts it in a 'happens-before' context (1.10.1-9.1). That implies memory ordering as described or it would constitute a data race. – LWimsey Aug 03 '17 at 23:05
I totally agree that `mutex::lock` is an acquire operation, and I agree that the unlock in `SetFoo` synchronizes with the lock in `CheckFoo`, assuming that the total order from the mutex puts the former before the latter. But if the compiler were free to lift the write to `foo_has_been_set` to above the lock in `SetFoo`, then they would no longer necessarily synchronize because they could happen in the opposite order. So my question stands: what guarantees the compiler can't lift the atomic write to above the mutex lock? – jacobsa Aug 04 '17 at 00:06
3

I believe the answer lies in your statement: "_if the compiler were free to lift the write to foo_has_been_set to above the lock in SetFoo, then they would no longer necessarily synchronize because they could happen in the opposite order_" - That is exactly why the compiler is not allowed to do that because it would violate the synchronization requirement that applies to a mutex unlock/lock sequence – LWimsey Aug 04 '17 at 01:59
No, if it lifted it above the lock that would simply allow the critical sections to happen in the opposite order. They would continue to synchronize with each other (in the technical sense of [intro.races]), but possibly in the opposite order. – jacobsa Aug 04 '17 at 02:43
3

Yes, I understand, but my point is that that cannot happen based on the mutex ordering rules. If you are concerned about the relaxed store being able to be lifted above the lock, why aren't you concerned about `foo = 1` following that same pattern ? A relaxed store is not a magical thing that can be placed anywhere because the standard says it is 'relaxed'. It is just an atomic operation with no ordering constraints imposed by itself, just like a non-atomic store has no ordering constraints. An atomic operation being relaxed does not mean it can ignore ordering rules imposed by a mutex. – LWimsey Aug 04 '17 at 03:18
The difference with hoisting `foo = 1` is that it would cause a data race, whereas hoisting the atomic store wouldn't. – jacobsa Aug 04 '17 at 04:41
It won't happen in both cases.. period – LWimsey Aug 04 '17 at 04:45
Agreed, but I'm looking for a proof based on the relations defined in [intro.races]. – jacobsa Aug 04 '17 at 04:47

score 1 · Answer 3 · answered Dec 15 '19 at 15:45

The standard does not directly guarantee that, but you can read it between the lines of [thread.mutex.requirements.mutex].:

For purposes of determining the existence of a data race, these behave as atomic operations ([intro.multithread]).
The lock and unlock operations on a single mutex shall appear to occur in a single total order.

Now the second sentence looks like a hard guarantee, but it really isn't. Single total order is very nice, but it only means that there is a well-defined single total order of of acquiring and releasing one particular mutex. Alone by itself, that doesn't mean that the effects of any atomic operations, or related non-atomic operations should or must be globally visible at some particular point related to the mutex. Or, whatever. The only thing that is guaranteed is about the order of code execution (specifically, the execution of a single pair of functions, lock and unlock), nothing is being said about what may or may not happen with data, or otherwise.
One can, however, read between the lines that this is nevertheless the very intention from the "behave as atomic operations" part.

From other places, it is also pretty clear that this is the exact idea and that an implementation is expected to work that way, without explicitly saying that it must. For example, [intro.races] reads:

[ Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations.

Note the unlucky little, harmless word "Note:". Notes are not normative. So, while it's clear that this is how it's intended to be understood (mutex lock = acquire; unlock = release), this is not actually a guarantee.

I think the best, although non-straightforward guarantee comes from this sentence in [thread.mutex.requirements.general]:

A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents.

So that's what a mutex does (without saying how exactly). It protects against data races. Fullstop.

Thus, no matter what subtleties one comes up with and no matter what else is written or isn't explicitly said, using a mutex protects against data races (... of any kind, since no specific type is given). That's what is written. So, in conclusion, as long as you use a mutex, you are good to go even with relaxed ordering or no atomic ops at all. Loads and stores (of any kind) cannot be moved around because then you couldn't be sure no data races occur. Which, however, is exactly what a mutex protects against.
Thus, without saying so, this says that a mutex must be a full barrier.

The OP's self answer points out that [`mutex.unlock()` *synchronizes-with* subsequent lock operations that obtain ownership on the same object.](https://timsong-cpp.github.io/cppwp/n4659/thread#mutex.requirements.mutex-25). That's the normative language that the note about acq / rel is describing, I think. Operations after the next lock can't happen too soon (acquire) and operations before this unlock can't happen later (release). — Peter Cordes, Dec 15 '19 at 16:17
@PeterCordes: Does that, however, provide _any_ guarantee about data integrity or visibility? I only understand that the execution of `lock` and `unlock` (the very function calls!) has a well-defined total order, if on the same mutex object. So, I think in the strictest, most pedantic way, this doesn't guarantee anything data-wise (deliberately disregarding the rather obvious intent, which is obviously that this guarantee _is provided_). — Damon, Dec 15 '19 at 16:26
Hmm, I forgot the question details while writing my last comment. It guarantees that it would be fine to read the relaxed atomic *after* taking the lock: the non-atomic and relaxed would either have both happened or both not happened. I don't see any plausible mechanism for it to create synchronization between the lock/unlock pair without the `unlock` acting like a release operation, but yes there may be a lack of normative language to that effect. — Peter Cordes, Dec 15 '19 at 16:56
Note that a mutex unlock doesn't have to be a *full* barrier, just a release barrier. (e.g. it doesn't necessarily have to drain the store buffer on a real CPU, so later operations after an unlock can effectively become part of the critical section. Implementation that use OS-assisted sleep/wake as a fallback instead of just spinning do tend to use an atomic RMW as part of the unlock as well, though. Unlike a simple spinlock where in asm unlock really can be just a release store, with only acquire requiring an atomic RMW.) — Peter Cordes, Dec 15 '19 at 16:58

score 0 · Answer 4 · answered Aug 03 '17 at 06:51

The answer seem to lie in http://eel.is/c++draft/intro.multithread#intro.races-3

The two pertinent parts are

[...] In addition, there are relaxed atomic operations, which are not synchronization operations [...]

and

[...] performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. [...]

While relaxed orders atomics are not considered synchronization operations, that's all the standard has to say about them in this context. Since they are still memory locations, the general rule of them being governed by other synchronization operations still applies.

So in conclusion, the standard does not seem to have anything specifically in there to prevent the reordering you described, but the wording as it stands would prevent it naturally.

Edit: Woops, I linked to the draft. The C++11 paragraph covering this is 1.10-5, using the same language.

I agree that wording guarantees that the write can't be sunk below the call to `mutex::unlock`, which will involve a release operation. But my question was about whether the write can be lifted above the call to `mutex::lock`, which is not covered by that wording. — jacobsa, Aug 03 '17 at 21:34

score 0 · Answer 5 · edited Dec 15 '19 at 14:45

CheckFoo() cannot cause the program to crash (i.e. trigger the assert()) but there is also no guarantee the assert() will ever be executed.

If the condition at the start of CheckFoo() triggers (see below) the visible value of foo will be 1 because of the memory barriers and synchronization between mu.unlock() in SetFoo() and mu.lock() in CheckFoo().

I believe that is covered by the description of mutex cited in other answers.

However there is no guarantee that the if condition (foo_has_been_set.load(std::memory_order_relaxed))) will ever be true. Relaxed memory order makes no guarantees and only the atomicity of the operation is assured. Consequently in the absence of some other barrier there's no guarantee when the relaxed store in SetFoo() will be visible in CheckFoo() but if it is visible it will only be because the store was executed and then following the mu.lock() must be ordered after mu.unlock() and the writes before it visible.

Please note this argument relies on the fact that foo_has_been_set is only ever set from false to true. If there were another function called UnsetFoo() that set it back to false:

void UnsetFoo() {
  mu.lock();
  foo = 0;
  foo_has_been_set.store(false, std::memory_order_relaxed);
  mu.unlock();
}

That was called from the other (or yet a third) thread then there's no guarantee that checking foo_has_been_set without synchronization will guarantee that foo is set.

To be clear (and assuming foo_has_been_set is never unset):

void CheckFoo() {
  if (foo_has_been_set.load(std::memory_order_relaxed)) {
    assert(foo == 1); //<- All bets are off.  data-race UB
    mu.lock();
    assert(foo == 1); //Guaranteed to succeed.
    mu.unlock();
  }
}

In practice on any real platform on any long running application it is probably inevitable that the relax store will eventually become visible to the other thread. But there is no formal guarantee regarding if or when that will happen unless other barriers exist to assure it.

Formal References:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf

Refer to the notes at the end of p.13 and start of p.14 particularly notes 17 - 20. They are essentially assuring coherence of 'relaxed' operations. Their visibility is relaxed but the visibility that occurs will be coherent and use of the phrase 'happens before' is within the overall principle of program ordering and particularly acquire and release barriers of mutexes. Note 19 is particularly relevant:

The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations.

The 'relaxed' store not becoming visible is not realistic on a real platform (which you mention as well).. Indeed, it is not guaranteed by the standard (which says 'it should become visible'), but that guarantee does not exist for any memory ordering model, including seq/cst. The relaxed load is unordered with respect to the mutex and therefore the value of `foo_has_been_set` may be missed, but that is the logical equivalent of the `CheckFoo` thread running a few clock cycles earlier than `SetFoo` in which case it would miss it too. — LWimsey, Aug 04 '17 at 18:32

score 0 · Answer 6 · edited Dec 15 '19 at 07:34

Reordering within the critical section is of course possible:

void SetFoo() {
  mu.lock();
  // REORDERED:
  foo_has_been_set.store(true, std::memory_order_relaxed);
  PAUSE(); //imagine scheduler pause here 
  foo = 1;
  mu.unlock();
}

Now, the question is CheckFoo - can the read of foo_has_been_set fall into the lock? Normally a read like that can (things can fall into locks, just not out), but the lock should never be taken if the if is false, so it would be a strange ordering. Does anything say "speculative locks" are not allowed? Or can the CPU speculate that the if is true before reading foo_has_been_set?

void CheckFoo() {
    // REORDER???
    mu.lock();
    if (foo_has_been_set.load(std::memory_order_relaxed)) {
        assert(foo == 1);
    }
    mu.unlock();
}

That ordering is probably not OK, but only because of "logic order" not memory order. If the mu.lock() was inlined (and became some atomic ops) what stops them from being reordered?

I'm not too worried about your current code, but I worry about any real code that uses something like this. It is too close to wrong.

ie if the OP code was the real code, you would just change foo to atomic, and get rid of the rest. So the real code must be different. More complicated? ...

CPUs can't make speculative stores visible to other threads. That includes speculatively taking a lock. (Once mis-speculation has "infected" other cores, they'd all have to roll back on detection of mis-speculation). ISO C++ even forbids it indirectly, by saying out-of-thin-air values for relaxed atomics *should* be impossible. [What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?](//stackoverflow.com/a/58049046) — Peter Cordes, Dec 15 '19 at 07:41

C++ standard: can relaxed atomic stores be lifted above a mutex lock?

6 Answers6

Linked