I'm confused whether rdtscp monotonically increments in a multi-core environment. According to the document: __rdtscp, rdtscp seems a processor-based instruction and can prevent reordering of instructions around the call.
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset.
rdtscp definitely increments monotonically on the same CPU core, but is this rdtscp timestamp guaranteed monotonic across different CPU cores? I believe there is no such absolute guarantee. For example,
Thread on CPU core#0                   Thread on CPU core#1
unsigned int ui;
uint64_t t11 = __rdtscp(&ui); 
uint64_t t12 = __rdtscp(&ui);  
uint64_t t13 = __rdtscp(&ui);         
                                       unsigned int ui;
                                       uint64_t t21 = __rdtscp(&ui);
                                       uint64_t t22 = __rdtscp(&ui);
                                       uint64_t t23 = __rdtscp(&ui);
By my understanding, we can have a decisive conclusion t13 > t12 > t11, but we cannot guarantee t21 > t13.
I want to write a script to test if my understanding is correct or not, but I don't know how to construct an example to validate my hypothesis.
// file name: rdtscptest.cpp
// g++ rdtscptest.cpp -g -lpthread -Wall -O0 -o run
#include <chrono>
#include <thread>
#include <iostream>
#include <string>
#include <string.h>
#include <vector>
#include <x86intrin.h>
using namespace std;
void test(int tid) {
    std::this_thread::sleep_for (std::chrono::seconds (tid));
    unsigned int ui;
    uint64_t tid_unique_ = __rdtscp(&ui);
    std::cout << "tid: " << tid << ", counter: " << tid_unique_ << ", ui: " << ui << std::endl;
    std::this_thread::sleep_for (std::chrono::seconds (1));
}
int main() {
    size_t trd_cnt = 3 ;
    std::vector<std::thread> threads(trd_cnt);
    for (size_t i=0; i< trd_cnt; i++) {
        // three threads with tid: 0, 1, 2
        // force different threads to run on different cpu cores
        threads[i] = std::thread(test, i);  
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(i, &cpuset);
        int rc = pthread_setaffinity_np(threads[i].native_handle(),
                                        sizeof(cpu_set_t), &cpuset);
        if (rc != 0) {
            std::cout << "Error calling pthread_setaffinity_np, code: " << rc << "\n";
        }
    }
    for (size_t i=0; i< trd_cnt; i++) {
        threads[i].join() ;
    }
    return 0;
}
So, two questions here:
- Is my understanding correct or not?
- How to construct an example to validate it?
==========updated, according to comments
__rdtscp will (always?) increment across cores on advanced cpus
 
     
    