I have read bunch of spinlock implementations for x86/amd64 architecture, including
glibc's pthread_spin_lock/pthread_spin_unlock. Roughly speaking they use cmpxchg
instruction to acquire the lock, and just use regular MOV instruction to release
the unlock. How come there is no need to flush store-buffer before the lock is released.
Consider following two threads running on different cores, and statement s100 runs immediately after s3.
thread 1:
s1: pthread_spin_lock(&mylock)
s2: x = 100
s3: pthread_spin_unlock() // call a function which contains "MOV mylock, 1"
thread 2:
s100: pthread_spin_lock(&mylock)
s200: assert(x == 100)
s300: pthread_spin_unlock(&mylock)
Is the s200 guaranteed true? Is it possible that by the time s100 acquire the lock, x's is still not yet flushed from store-buffer to cache?
I'm wondering:
- Is the call-overhead (of
pthread_spin_unlock()) sufficient for covering the time of flushing store-buffer to cache? - Does the
cmpxchgor any instruction with implicit or explicitLOCKprefix magically flush store-buffers on other cores?
If the s200 is not guaranteed true, what is the most inexpensive way to fix it?
- insert
mfenceinstruction prior to theMOVinstruction. - replace the
MOVinstruction with atomicfetch-and-and/orinstruction, - or others?
Profuse thanks in advance!