I'm using C (more exactly: C11 with gcc) for developing some low-latency software for x64 (more exactly: Intel CPUs only). I don't care about portability or any other architecture in general.
I know that volatile is in general not the first choice for data synchronization. However, those three facts seem to be true:
- volatileenforces writing data to memory and as well reading from memory (=so it's not allowed to "cache" the value in a register and it also implies that some optimizations cannot be done by the compiler)
- volatileaccesses must not be reordered by the compiler
- 4 byte (or even 8 byte) values are always atomically written on x64 (same is true for reading)
Now I have this code:
typedef struct {
    double some_data;
    double more_data;
    char even_more_data[123];
} Data;
static volatile Data data;
static volatile int data_ready = 0;
void thread1()
{
    while (true) {
        while (data_ready) ;
        const Data x = f(...); // prepare some data
        data         = x;      // write it
        data_ready   = 1;      // signal that the data is ready  
    }
}
void thread2()
{
    while (true) {
        while (!data_ready) ;
        const Data x = data; // copy data
        data_ready   = 0;    // signal that data is copied
        g(x);                // process data
    }
}
thread1 is a producer of Data and thread2 is a consumer of Data. Note that is used those facts:
- datais written before- data_ready. So when- thread2reads- data_readyand it's 1, then we know that- datais also available (guarantee for the ordering of- volatile)
- thread2first reads and stores- dataand then sets- data_readyto 0, so- thread1can again produce some data and store it.
- data_readycannot have a weird state, because reading and writing an- int(with 4 bytes) is automatically atomic on x64
This way was the fastest option I've finally had. Note that both threads are pinned to cores (which are isolated). They are busy polling on data_ready, because it's important for me to process the data as fast as possible.
Atomics and mutexes were slower, so I used this implementation.
My question is finally if it's possible that this does not behave as I expect it? I cannot find anything wrong in the shown logic, but I know that volatile is a tricky beast.
Thanks a lot
