I am trying to test the fastest way to call a function pointer to get around templates for a finite amount of arguments. I wrote this benchmark: https://gcc.godbolt.org/z/T1qzTd
I am noticing that function pointers to class member functions have a lot of added overhead that I am having trouble understanding. What I mean is the following:
With a struct bar and function foo defined as follows:
template<uint64_t r>
struct bar {
    template<uint64_t n>
    uint64_t __attribute__((noinline))
    foo() {
        return r * n;
    }
    
    // ... function pointers with pointers to versions of foo below
The first option (in #define DO_DIRECT in the godbolt code) calls the templated function by indexing into an array of function pointers to class member function defined as
   /* all of this inside of struct bar */
   typedef uint64_t (bar::*foo_wrapper_direct)();
   const foo_wrapper_direct call_foo_direct[NUM_FUNCS] = {
      &bar::foo<0>,
      // a bunch more function pointers to templated foo...
   };
   // to call templated foo for non compile time input
   uint64_t __attribute__((noinline)) foo_direct(uint64_t v) {
      return (this->*call_foo_direct[v])();
   }
   
The assembly for this, however, appears to have a TON of fluff:
bar<9ul>::foo_direct(unsigned long):
        salq    $4, %rsi
        movq    264(%rsi,%rdi), %r8
        movq    256(%rsi,%rdi), %rax
        addq    %rdi, %r8
        testb   $1, %al
        je      .L96
        movq    (%r8), %rdx
        movq    -1(%rdx,%rax), %rax
.L96:
        movq    %r8, %rdi
        jmp     *%rax
Which I am having trouble understanding.
In contrast the #define DO_INDIRECT method defined as:
// forward declare bar and call_foo_wrapper
template<uint64_t r>
struct bar;
template<uint64_t r, uint64_t n>
uint64_t call_foo_wrapper(bar<r> * b);
/* inside of struct bar */
typedef uint64_t (*foo_wrapper_indirect)(bar<r> *);
const foo_wrapper_indirect call_foo_indirect[NUM_FUNCS] = {
    &call_foo_wrapper<r, 0>
    // a lot more templated versions of foo ...
};
uint64_t __attribute__((noinline)) foo_indirect(uint64_t v) {
    return call_foo_indirect[v](this);
}
/* no longer inside struct bar */
template<uint64_t r, uint64_t n>
uint64_t
call_foo_wrapper(bar<r> * b) {
    return b->template foo<n>();
}
has some very simple assembly:
bar<9ul>::foo_indirect(unsigned long):
        jmp     *(%rdi,%rsi,8)
I am trying to understand why the DO_DIRECT method using function pointers directly to the class member function has so much fluff, and how, if possible, I can change it so remove the fluff.
Note: I have the __attribute__((noinline)) just to make it easier to examine the assembly.
Thank you.
p.s if there is a better way of converting runtime parameters to template parameters I would appreciate a link the an example / manpage.
 
     
    