Let's say we have a processor with two cores (C0 and C1) and a cache line starting at address k that is owned by C0 initially. If C1 issues a store instruction on a 8-byte slot at line k, will that affect the throughput of the following instructions that are being executed on C1?
The intel optimziation manual has the following paragraph
When an instruction writes data to a memory location [...], the processor ensures that it has the line containing this memory location is in its L1d cache [...]. If the cache line is not there, it fetches from the next levels using a RFO request [...] RFO and storing the data happens after instruction retirement. Therefore, the store latency usually does not affect the store instruction itself
With reference to the following code,
// core c0
foo();
line(k)->at(i)->store(kConstant, std::memory_order_release);
bar();
baz();
The quote from the intel manual makes me assume that in the code above, the execution of the code will look as if the store was essentially a no-op, and would not impact the latency between the end of foo() and the start of bar(). In contrast, for the following code,
// core c0
foo();
bar(line(k)->at(i)->load(std::memory_order_acquire));
baz();
The latency between the end of foo() and the start of bar() would be impacted by the load, as the following code has the result of the load as a dependency.
This question is mostly concerned with how intel processors (in the Broadwell family or newer) work for the case above. Also, in particular, for how C++ code that looks like the above gets compiled down to assembly for those processors.