The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf does this, along with resetting them to zero before starting a program.
The fixed counters (like the programmable counters) have bits that control whether
they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf kernel code leaves them set to count neither when nothing is using them.
If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs), or get perf to do it for you by still running your program under perf. e.g. perf stat ./a.out
If you use perf stat -e instructions:u ./perf ; echo $?, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc once. Otherwise, e.g. with the default -e instructions (not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.
The exit status is only 8 bits wide, so this little hack to avoid printf or write() only works for very small counts.
It also means its pointless to construct the full 64-bit rdpmc result: the high 32 bits of the inputs don't affect the low 8 bits of a sub result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.
Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.
segment .text
global _start
_start:
mov ecx, 1<<30 ; fixed counter: instructions
rdpmc
mov edi, eax ; start
mov edx, 10
.loop:
dec edx
jnz .loop
rdpmc ; ecx = same counter as before
sub eax, edi ; end - start
mov edi, eax
mov eax, 231
syscall ; sys_exit_group(rdpmc). sys_exit isn't wrong, but glibc uses exit_group.
Running this under perf stat ./a.out or perf stat -e instructions:u ./a.out, we always get 23 from echo $? (instructions:u shows 30, which is 1 more than the actual number of instructions this program runs, including syscall)
23 instructions is exactly the number of instructions strictly after the first rdpmc, but including the 2nd rdpmc.
If we comment out the first rdpmc and run it under perf stat -e instructions:u, we consistently get 26 as the exit status, and 29 from perf. rdpmc is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start). I wonder if the sysret in the kernel gets counted as a "user" instruction.
But with the first rdpmc commented out, running under perf stat -e instructions (not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.
But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0 / or ecx, 1<<30 works, because unlike xor-zeroing, and ecx,0 is not dependency-breaking.
Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.
PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf requires is documented in perf_event_open(2):
echo 2 | sudo tee /sys/devices/cpu/rdpmc # enable RDPMC always, not just when a perf event is open