The load and store bandwidth here is defined as the capacity for data transferring from one level cache to another level per cycle. For example, the wikichip claims that the L2 cache has a 64 B/cycle bandwidth to L1$.
This is a good answer for calculating the L3 cache's overall bandwidth, but it assumes that each request involves a 64-byte transfer. I know it is 64B/cycle or 32B/cycle from wikichip, but I want to prove it.
The first trivial attempt I made is listed below. First, flush the cache and then try to load. Obviously, it failed, because it measures the time transferring from memory to cache.
for (int page = 0; page < length/512; page++)
{           
    asm volatile("mfence");
    for (int i = 0; i < 64; i++){
        flush(&shm[page*512+i*8]);
    }
    asm volatile("mfence");             
    int temp, l1;
    l1 = page*512 + 8*0;
    for (int i = 0; i < 64; i++){
        temp = shm[l1];
        l1 += 8;
    }
}
To fix this problem, I can use eviction sets, which make data reside only in the L3 cache. However, the fastest load-to-use time far outweighs the time transferring on the bus. For example, the fastest load-to-use time for the L3 cache is 42 cycles, while the L3 cache has a 32 B/cycle bandwidth to the L2 cache, which means that the bus won't become a bottleneck. This method seems impracticable.
Then I tried AVX2 listed below. The vmovntdq uses a non-temporal hint to prevent caching of the data during the write to memory. Every instruction stores 256 bytes. Besides, I assume that it will use the bus from L1 to L2, L2 to L3, and L3 to memory. I don't know if this assumption is reasonable. If it is, we can measure the bandwidth approximately. The smallest bandwidth among the three equals IPC*256 byte.
However, the IPC is from 0.09 to 0.10, which means that the CPU executes one vmovntdq every ten cycles. It can't reach the bottleneck of the bus. Fails again.
AvxLoops:
    push    rbp
    mov     rbp,rsp
    mov rax,2000
    vmovaps ymm0, [rsi] 
.loop:
    vmovntdq [rdi], ymm0
    vmovntdq [rdi+32], ymm0
    vmovntdq [rdi+64], ymm0
    vmovntdq [rdi+96], ymm0
    vmovntdq [rdi+128], ymm0
    vmovntdq [rdi+160], ymm0
    vmovntdq [rdi+192], ymm0
    vmovntdq [rdi+224], ymm0
    add rdi,32
    dec rax
    cmp rax,0
    jge .loop
    mov     rsp,rbp
    pop     rbp
    ret
Any good ideas? How to measure the load and store bandwidth (only in the bus) of the cache?
 
    