Sadly, shrd is horribly slow (3 clock latency on a range of devices).
Taking sn_a as the shrd version:
   lea    0x1(%rdi),%rax       # sn_a
   imul   %rdi
   shrd   $0x1,%rdx,%rax
# if you want %rdx:%rax need shr $1, $rdx here
   retq   
and sn_b as my suggested alternative:
   lea    0x1(%rdi),%rax       # sn_b
   or     $0x1,%rdi
   shr    %rax
   imul   %rdi                 # %rdx:%rax is 128 bit result
   retq   
And the (largely) empty sn_e:
   mov    %rdi,%rax            # sn_e
   retq   
I got the following clock counts, per iteration of the timing loop (see below):
        Ryzen 7   i7 (Coffee-Lake)
  sn_a:  11.00     11.00
  sn_b:   8.05      8.27      -- yay :-)
  sn_e:   5.00      5.00
I believe that:
            Ryzen 7             i7 Coffee-Lake
        latency throughput   latency throughput  
  shrd     3       1/3          3       1/3
  imul     3       1/2          3       1/1  -- 128 bit result
  imul     2       1/2          3       1/1  --  64 bit result
where throughput is instructions/clocks.  I believe the 128 bit imul delivers the ls 64 bits 1 clock earlier, or the ms 64 bits one clock later.
I think, what we see in the timings is -3 clocks by removing the shrd, +1 clock for the shr $1 and or $1 (in parallel), -1 clock not using %rdx.
Incidentally, both sn_a and sn_b return 0 for UINT64_MAX !  Mind you, the result overflows uint64_t way earlier than that ! 
FWIW, my timing loop looks like this:
  uint64_t  n ;
  uint64_t  r ;
  uint64_t  m ;
  m = zz ;                      // static volatile uint64_t zz = 0
  r = 0 ;
  n = 0 ;
  qpmc_read_start(...) ;        // magic to read rdpmc 
  do
    {
      n += 1 ;
      r += sigma_n(n + (r & m)) ;
    }
  while (n < 1000000000) ;
  qpmc_read_stop(....) ;        // magic to read rdpmc 
Where the + (r & m) sets up a dependency so that the input to the function being timed depends on the result of the previous call.  The r += collects a result which is later printed -- which helps persuade the compiler to not optimize away the loop.
The loop compiles to:
<sigma_timing_run+64>:          // 64 byte aligned
   mov    %r12,%rdi
   inc    %rbx
   and    %r13,%rdi
   add    %rbx,%rdi
   callq  *%rbp
   add    %rax,%r12
   cmp    $0x3b9aca00,%rbx
   jne    <sigma_timing_run+64>
Replacing the + (r & m) by + (n & m) removes the dependency, but the loop is:
<sigma_timing_run+64>:          // 64 byte aligned
   inc    %rbx
   mov    %r13,%rdi
   and    %rbx,%rdi
   add    %rbx,%rdi
   callq  *%rbp
   add    %rax,%r12
   cmp    $0x3b9aca00,%rbx
   jne    0x481040 <sigma_timing_run+64>
which is the same as the loop with the dependency, but the timings are: 
        Ryzen 7   i7 (Coffee-Lake)
  sn_a:   5.56      5.00
  sn_b:   5.00      5.00
  sn_e:   5.00      5.00
Are these devices wonderful, or what ?