Cost of thread-safe local static variable initialization in C++11?

Question

We know that local static variable initialization is thread-safe in C++11, and modern compilers fully support this. (Is local static variable initialization thread-safe in C++11?)

What is the cost of making it thread-safe? I understand that this could very well be compiler implementation dependent.

Context: I have a multi-threaded application (10 threads) accessing a singleton object pool instance via the following function at very high rates, and I'm concerned about its performance implications.

template <class T>
ObjectPool<T>* ObjectPool<T>::GetInst()
{
    static ObjectPool<T> instance;
    return &instance;
}

Just a warning: When your application exits, constructors for static variables will be called. Including for singleton objects that are still in use by another thread. — gnasher729, Jul 15 '16 at 08:12
How about measuring it with another known technique called "Double checked locking" with atomics ? Then you would have some benchmark and make an educated guess on the cost. — Arunmu, Jul 15 '16 at 08:13
@Arunmu That's a good idea, I'll try it and see. I was hoping if someone could shed some light on how compilers actually implement it. — Sunanda, Jul 15 '16 at 08:20
Pass it to argument of the thread, so you don't have any overhead from this part. — Jarod42, Jul 15 '16 at 08:21
@gnasher729 Thanks (you mean destructors?). I've made sure that all threads exit before exiting the application. — Sunanda, Jul 15 '16 at 08:21
@Sunanda How it's implemented is basically up to the compiler and platform. But, from what I know most of them use some form of DCLP to achieve the thread safety. So, its' kind of safe to assume that it would be more efficient than whatever DCLP version that you can come up with. — Arunmu, Jul 15 '16 at 08:30
@Arunmu Thanks. Since there's some synchronization overhead, I might have to manually ensure to initialize the object first and then access it without any synchronization. — Sunanda, Jul 15 '16 at 09:35
@Arunmu I haven't used it before, I'll have a look at that too — Sunanda, Jul 15 '16 at 09:45

score 4 · Accepted Answer · answered Jul 15 '16 at 08:34

A look at the generated assembler code helps.

Source

#include <vector>

std::vector<int> &get(){
  static std::vector<int> v;
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movzbl  guard variable for get()::v(%rip), %eax
        testb   %al, %al
        je      .L15
        movl    get()::v, %eax
        ret
.L15:
        subq    $8, %rsp
        movl    guard variable for get()::v, %edi
        call    __cxa_guard_acquire
        testl   %eax, %eax
        je      .L6
        movl    guard variable for get()::v, %edi
        movq    $0, get()::v(%rip)
        movq    $0, get()::v+8(%rip)
        movq    $0, get()::v+16(%rip)
        call    __cxa_guard_release
        movl    $__dso_handle, %edx
        movl    get()::v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        call    __cxa_atexit
.L6:
        movl    get()::v, %eax
        addq    $8, %rsp
        ret
main:
        subq    $8, %rsp
        call    get()
        movq    8(%rax), %rdx
        subq    (%rax), %rdx
        addq    $8, %rsp
        movq    %rdx, %rax
        sarq    $2, %rax
        ret

Compared to

Source

#include <vector>

static std::vector<int> v;
std::vector<int> &get(){
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movl    v, %eax
        ret
main:
        movq    v+8(%rip), %rax
        subq    v(%rip), %rax
        sarq    $2, %rax
        ret
        movl    $__dso_handle, %edx
        movl    v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        movq    $0, v(%rip)
        movq    $0, v+8(%rip)
        movq    $0, v+16(%rip)
        jmp     __cxa_atexit

I'm not that great with assembler, but I can see that in the first version v has a lock around it and get is not inlined whereas in the second version get is essentially gone.
You can play around with various compilers and optimization flags, but it seems no compiler is able to inline or optimize out the locks, even though the program is obviously single threaded.
You can add static to get which makes gcc inline get while preserving the lock.

To know how much these locks and additional instructions cost for your compiler, flags, platform and surrounding code you would need to make a proper benchmark.
I would expect the locks to have some overhead and be significantly slower than the inlined code, which becomes insignificant when you actually do work with the vector, but you can never be sure without measuring.

The meaning of `static` is different here than a static variable in a function! Here it's just a local scoped variable and is something that you shouldn't use anymore in C++. So this doesn't answer the question at all! — jaques-sam, Nov 15 '21 at 11:32
@DrumM We're not comparing `static` meanings here, we're comparing magic static vs not magic static. Making `v` global is a reasonable way to disable the magic static while leaving the rest more or less the same. An alternative would be to just remove `static` from the original code, but I feel like that changes the meaning even more. Do you have a better idea for disabling the magic static without changing anything else? — nwp, Nov 16 '21 at 11:16
But you’re comparing s with s here… As said the meaning of a global static is different from a static variable in a function, so its magic is also different. The question was with `static`, there is no other way no. — jaques-sam, Nov 18 '21 at 10:19
"As said the meaning of a global static is different from a static variable in a function, so its magic is also different." Yes. That's the whole point. Like I said before, we want to compare a magic static to a not magic static. The magic being different is required to make this comparison. The question was not "What does a magic static do?", the question was "What is the cost of a magic static?" and answering by comparing a magic static with a non-magic static seems completely reasonable to me. If the global static had the same magic properties there would be no point in comparing them. — nwp, Nov 18 '21 at 14:01

score 2 · Answer 2 · answered Jul 15 '16 at 08:40

2

From my experience, this is exactly as costly as a regular mutex (critical section). If the code is called very frequently, consider using a normal global variable instead.

answered Jul 15 '16 at 08:40

Sven Nilsson

1,861
10
11

The code looks like `GetInst` is part of a class `ObjectPool`, which means the suggested global variable could instead be a `private` `static` variable in the class. As long as `T` doesn't access other global variable in its constructor this should be fine. – nwp Jul 15 '16 at 08:45
Thanks, seems like this is the way to avoid any synchronization overhead – Sunanda Jul 15 '16 at 09:31

Validus Oculus · Answer 3 · 2020-08-21T19:50:05.857

Explained extensively here https://www.youtube.com/watch?v=B3WWsKFePiM by Jason Turner.

I put a sample code to illustrate the video. Since thread-safety is the main issue, I tried to call the method from multiple threads to see its effects.

You can think that compiler is implementing double-checking lock for you even though they can do whatever they want to ensure thread-safety. But they will at least add a branch to distinguish first time initialization unless optimizer does initialization at the global scope eagerly.

https://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_C++11

#include <iostream>
#include <string>
#include <vector>
#include <thread>

struct Temp
{
  // Everytime this method is called, compiler has to check whether `name` is
  // constructed or not due to init-at-first-use idiom. This at least would 
  // involve an atomic load operation and maybe a lock acquisition.
  static const std::string& name() {
    static const std::string name = "name";
    return name;
  }

  // Following does not create contention. Profiler showed little bit of
  // performance improvement.
  const std::string& ref_name = name();
  const std::string& get_name_ref() const {
    return ref_name;
  }
};

int main(int, char**)
{
  Temp tmp;

  constexpr int num_worker = 8;
  std::vector<std::thread> threads;
  for (int i = 0; i < num_worker; ++i) {
    threads.emplace_back([&](){
      for (int i = 0; i < 10000000; ++i) {
        // name() is almost 5s slower
        printf("%zu\n", tmp.get_name_ref().size());
      }
    });
  }

  for (int i = 0; i < num_worker; ++i) {
    threads[i].join();
  }

  return 0;
}

The name() version is 5s slower than get_name_ref() on my machine.

$ time ./test > /dev/null

Also I used compiler explorer to see what gcc generates. Following proves double checking lock pattern: Pay attention to atomic loads and guards acquired.

name ()
{
  bool retval.0;
  bool retval.1;
  bool D.25443;
  struct allocator D.25437;
  const struct string & D.29013;
  static const struct string name;

  _1 = __atomic_load_1 (&_ZGVZL4namevE4name, 2);
  retval.0 = _1 == 0;
  if (retval.0 != 0) goto <D.29003>; else goto <D.29004>;
  <D.29003>:
  _2 = __cxa_guard_acquire (&_ZGVZL4namevE4name);
  retval.1 = _2 != 0;
  if (retval.1 != 0) goto <D.29006>; else goto <D.29007>;
  <D.29006>:
  D.25443 = 0;
  try
    {
      std::allocator<char>::allocator (&D.25437);
      try
        {
          try
            {
              std::__cxx11::basic_string<char>::basic_string (&name, "name", &D.25437);
              D.25443 = 1;
              __cxa_guard_release (&_ZGVZL4namevE4name);
              __cxa_atexit (__dt_comp , &name, &__dso_handle);
            }
          finally
            {
              std::allocator<char>::~allocator (&D.25437);
            }
        }
      finally
        {
          D.25437 = {CLOBBER};
        }
    }
  catch
    {
      if (D.25443 != 0) goto <D.29008>; else goto <D.29009>;
      <D.29008>:
      goto <D.29010>;
      <D.29009>:
      __cxa_guard_abort (&_ZGVZL4namevE4name);
      <D.29010>:
    }
  goto <D.29011>;
  <D.29007>:
  <D.29011>:
  goto <D.29012>;
  <D.29004>:
  <D.29012>:
  D.29013 = &name;
  return D.29013;
}

Cost of thread-safe local static variable initialization in C++11?

3 Answers3

Source

Assembler

Source

Assembler

Linked