Perf output is less than the number of actual instruction

Question

I tried to count the number of instructions of add loop application in RISC-V FPGA, using very simple RV32IM core with Linux 5.4.0 buildroot.

add.c:

int main()
{
    int a = 0;

    for (int i = 0; i < 1024*1024; i++)
        a++;
    printf("RESULT:                 %d\n", a);
    return a;
}

I used -O0 compile option so that the loop really loop, and the resulting dump file is following:

000103c8 <main>:
   103c8:   fe010113            addi    sp,sp,-32
   103cc:   00812e23            sw  s0,28(sp)
   103d0:   02010413            addi    s0,sp,32
   103d4:   fe042623            sw  zero,-20(s0)
   103d8:   fe042423            sw  zero,-24(s0)
   103dc:   01c0006f            j   103f8 <main+0x30>
   103e0:   fec42783            lw  a5,-20(s0)
   103e4:   00178793            addi    a5,a5,1 # 12001 <__TMC_END__+0x1>
   103e8:   fef42623            sw  a5,-20(s0)
   103ec:   fe842783            lw  a5,-24(s0)
   103f0:   00178793            addi    a5,a5,1
   103f4:   fef42423            sw  a5,-24(s0)
   103f8:   fe842703            lw  a4,-24(s0)
   103fc:   001007b7            lui a5,0x100
   10400:   fef740e3            blt a4,a5,103e0 <main+0x18>
   10404:   fec42783            lw  a5,-20(s0)
   10408:   00078513            mv  a0,a5
   1040c:   01c12403            lw  s0,28(sp)
   10410:   02010113            addi    sp,sp,32
   10414:   00008067            ret

As you can see, the application loops from 103e0 ~ 10400, which is 9 instructions, so the number of total instruction must be at least 9 * 1024^2 But the result of perf stat is pretty weird

RESULT:                 1048576
    
     Performance counter stats for './add.out':

           3170.45 msec task-clock                #    0.841 CPUs utilized          
                20      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                38      page-faults               #    0.012 K/sec                  
         156192046      cycles                    #    0.049 GHz                      (11.17%)
           8482441      instructions              #    0.05  insn per cycle           (11.12%)
           1145775      branches                  #    0.361 M/sec                    (11.25%)

       3.771031341 seconds time elapsed

       0.075933000 seconds user
       3.559385000 seconds sys

The total number of instructions perf counted was lower than 9 * 1024^2. Difference is about 10%. How is this happening? I think the output of perf should be larger than that, because perf tool measures not only overall add.out, but also overhead of perf itself and context-switching.

Did you check for bug reports in the HW perf counters of the softcore you're using? On Intel hardware, `perf stat --all-user` results are pretty much exact, off by 1 instruction for hand-written asm static executables where you can easily count how many total instruction will execute in user-space before an exit_group `syscall` ([3 instead of 2 for this](https://stackoverflow.com/a/54356617/224132)). — Peter Cordes, May 15 '22 at 14:09
Undercounting seems weird to me, too, especially if you're not trying to limit it to user-space counts only. Perhaps RISC-V perf counters put something into a buffer or queue for the kernel to collect (when a counter wraps around), but the kernel isn't collecting soon enough and some get lost? Or something about save/restore of counters on context switches doesn't work perfectly? — Peter Cordes, May 15 '22 at 14:10
@PeterCordes HW counter implementation of core is correct, I checked it with verilator simulation. And, I don't know why, bit the perf I build has no --all-user option. If it is buffering or context switch related problem, is there any way to solve it? Or even any way to check if it really is. The whole perf tool is too vast to analyze. Do you have any idea if there is any document about the operation not the tutorial of Perf? — lemoncake, May 15 '22 at 14:45
It uses the kernel's PAPI system calls, like `perf_event_open`. If there's a software bug anywhere, it's almost certainly in the details of Linux's "driver" for the details of the perf counters in RV32, or in that core in particular. I don't know anything about those details, only guessing that the hardware might have similar functionality to what I know about Intel CPUs, with counters that can be programmed to count a certain event, and raise an interrupt when the counter wraps. (Or the count value can be read from them.) — Peter Cordes, May 15 '22 at 14:51
If your perf is too old for the `--all-user` option, perhaps `-e task-clock,cycles:u,instructions:u` to count user-space cycles and instructions. If RV32 supports user vs. kernel counting. And like I said, you're already undercounting, and if this works it'll count even less. — Peter Cordes, May 15 '22 at 14:51

Perf output is less than the number of actual instruction

0 Answers0