I tried to count the number of instructions of add loop application in RISC-V FPGA, using very simple RV32IM core with Linux 5.4.0 buildroot.
add.c:
int main()
{
    int a = 0;
    for (int i = 0; i < 1024*1024; i++)
        a++;
    printf("RESULT:                 %d\n", a);
    return a;
}
I used -O0 compile option so that the loop really loop, and the resulting dump file is following:
000103c8 <main>:
   103c8:   fe010113            addi    sp,sp,-32
   103cc:   00812e23            sw  s0,28(sp)
   103d0:   02010413            addi    s0,sp,32
   103d4:   fe042623            sw  zero,-20(s0)
   103d8:   fe042423            sw  zero,-24(s0)
   103dc:   01c0006f            j   103f8 <main+0x30>
   103e0:   fec42783            lw  a5,-20(s0)
   103e4:   00178793            addi    a5,a5,1 # 12001 <__TMC_END__+0x1>
   103e8:   fef42623            sw  a5,-20(s0)
   103ec:   fe842783            lw  a5,-24(s0)
   103f0:   00178793            addi    a5,a5,1
   103f4:   fef42423            sw  a5,-24(s0)
   103f8:   fe842703            lw  a4,-24(s0)
   103fc:   001007b7            lui a5,0x100
   10400:   fef740e3            blt a4,a5,103e0 <main+0x18>
   10404:   fec42783            lw  a5,-20(s0)
   10408:   00078513            mv  a0,a5
   1040c:   01c12403            lw  s0,28(sp)
   10410:   02010113            addi    sp,sp,32
   10414:   00008067            ret
As you can see, the application loops from 103e0 ~ 10400, which is 9 instructions, so the number of total instruction must be at least 9 * 1024^2 But the result of perf stat is pretty weird
RESULT:                 1048576
    
     Performance counter stats for './add.out':
           3170.45 msec task-clock                #    0.841 CPUs utilized          
                20      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                38      page-faults               #    0.012 K/sec                  
         156192046      cycles                    #    0.049 GHz                      (11.17%)
           8482441      instructions              #    0.05  insn per cycle           (11.12%)
           1145775      branches                  #    0.361 M/sec                    (11.25%)
       3.771031341 seconds time elapsed
       0.075933000 seconds user
       3.559385000 seconds sys
The total number of instructions perf counted was lower than 9 * 1024^2. Difference is about 10%. How is this happening? I think the output of perf should be larger than that, because perf tool measures not only overall add.out, but also overhead of perf itself and context-switching.
