Take this simple function that increments an integer under a lock implemented by std::mutex:
#include <mutex>
std::mutex m;
void inc(int& i) {
    std::unique_lock<std::mutex> lock(m);
    i++;
}
I would expect this (after inlining) to compile in a straightforward way to a call of m.lock() an increment of i and then m.unlock().
Checking the generated assembly for recent versions of gcc and clang, however, we see an extra complication. Taking the gcc version first:
inc(int&):
  mov eax, OFFSET FLAT:__gthrw___pthread_key_create(unsigned int*, void (*)(void*))
  test rax, rax
  je .L2
  push rbx
  mov rbx, rdi
  mov edi, OFFSET FLAT:m
  call __gthrw_pthread_mutex_lock(pthread_mutex_t*)
  test eax, eax
  jne .L10
  add DWORD PTR [rbx], 1
  mov edi, OFFSET FLAT:m
  pop rbx
  jmp __gthrw_pthread_mutex_unlock(pthread_mutex_t*)
.L2:
  add DWORD PTR [rdi], 1
  ret
.L10:
  mov edi, eax
  call std::__throw_system_error(int)
It's the first couple of lines that are interesting. The assembled code examines the address of __gthrw___pthread_key_create (which is the implementation for pthread_key_create - a function to create a thread-local storage key), and if it is zero, it branches to .L2 which implements the increment in a single instruction without any locking at all.
If it is non-zero it proceeds as expected: locking the mutex, doing the increment, and unlocking.
clang does even more: it checks the address of the function twice, once before the lock and once before the unlock:
inc(int&): # @inc(int&)
  push rbx
  mov rbx, rdi
  mov eax, __pthread_key_create
  test rax, rax
  je .LBB0_4
  mov edi, m
  call pthread_mutex_lock
  test eax, eax
  jne .LBB0_6
  inc dword ptr [rbx]
  mov eax, __pthread_key_create
  test rax, rax
  je .LBB0_5
  mov edi, m
  pop rbx
  jmp pthread_mutex_unlock # TAILCALL
.LBB0_4:
  inc dword ptr [rbx]
.LBB0_5:
  pop rbx
  ret
.LBB0_6:
  mov edi, eax
  call std::__throw_system_error(int)
What's the purpose of this check?
Perhaps it is to support the case where the object file is ultimately complied into a binary without pthreads support and then to fall back to a version without locking in that case? I couldn't find any documentation on this behavior.
 
    