Supposed we have some repetitions of the same asm that contains RDTSC such as
    volatile size_t tick1;
    asm ( "rdtsc\n"           // Returns the time in EDX:EAX.
          "shl $32, %%rdx\n"  // Shift the upper bits left.
          "or %%rdx, %q0"     // 'Or' in the lower bits.
          : "=a" (tick1)
          : 
          : "rdx");
    
    this_thread::sleep_for(1s);
    volatile size_t tick2;    
    asm ( "rdtsc\n"          // clang's optimizer just thinks this asm yields
          "shl $32, %%rdx\n" // the same bits as above, so it just loads
          "or %%rdx, %q0"    // the result to qword ptr [rsp + 8]
          : "=a" (tick2)     // 
          :                  //   mov     qword ptr [rsp + 8], rbx
          : "rdx");
    printf("tick2 - tick1 diff : %zu cycles\n", tick2 - tick1);
    printf("CPU Clock Speed    : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);
Clang++'s optimizer (even with `-O1` ) thinks those two asm blocks yield the same :
tick2 - tick1 diff : 0 cycles
CPU Clock Speed    : 0.00 GHz
tick1              : bd806adf8b2
this_thread::sleep_for(1s)
tick2              : bd806adf8b2
When turn off Clang's optimizer, the 2nd block yields progressing ticks as expected :
tick2 - tick1 diff : 2900160778 cycles
CPU Clock Speed    : 2.90 GHz
tick1              : 14ab6ab3391c
this_thread::sleep_for(1s)
tick2              : 14ac17902a26
1st GCC g++ "seems" not to affect from this.
tick2 - tick1 diff : 2900226898 cycles
CPU Clock Speed    : 2.90 GHz
tick1              : 20e40010d8a8
this_thread::sleep_for(1s)
tick2              : 20e4aceecbfa
[LIVE]
However, let's add tick3 with the exact asm right after tick2
    volatile size_t tick1;
    asm ( "rdtsc\n"           // Returns the time in EDX:EAX.
          "shl $32, %%rdx\n"  // Shift the upper bits left.
          "or %%rdx, %q0"     // 'Or' in the lower bits.
          : "=a" (tick1)
          : 
          : "rdx");
    
    this_thread::sleep_for(1s);
    volatile size_t tick2;    
    asm ( "rdtsc\n"          // clang's optimizer just thinks this asm yields
          "shl $32, %%rdx\n" // the same bits as above, so it just loads
          "or %%rdx, %q0"    // the result to qword ptr [rsp + 8]
          : "=a" (tick2)     // 
          :                  //   mov     qword ptr [rsp + 8], rbx
          : "rdx");
    volatile size_t tick3;
    asm ( "rdtsc\n"          
          "shl $32, %%rdx\n"   
          "or %%rdx, %q0"    
          : "=a" (tick3)
          : 
          : "rdx");
It turns out that GCC thinks tick3's asm must produce the same value as tick2 because there are "obviously" no external side effects, so it just reload from tick2 . Even that's wrong, well, it has a very strong point though.
tick2 - tick1 diff : 2900209182 cycles
CPU Clock Speed    : 2.90 GHz
tick1              : 5670bd15088e
this_thread::sleep_for(1s)
tick2              : 567169f2b6ac
tick3              : 567169f2b6ac
[LIVE]
In C mode, the optimizers of both GCC and Clang affect with this. 
In other words, even with -O1 both optimize out the repetitions of asm blocks containing rdtsc
tick2 - tick1 diff : 0 cycles
CPU Clock Speed    : 0.00 GHz
tick1              : 324ab8f5dd2a
thrd_sleep(&(struct timespec){.tv_sec=1}, nullptr)
tick2              : 324ab8f5dd2a
tick3_rdx          : 324b65d3368c
[LIVE]
It turns out that all optimizers can do common-subexpression elimination on identical non-volatile asm statements, so an asm statement for RDTSC needs to be volatile.
 
     
    