I am doing a detailed code analysis for which I want to measure the total number of bank conflicts per warp.
The nvvp documentation lists this metric, which was the only one I could find related to bank conflicts:
shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed
When I profile the metric using nvprof (or nvvp) I get a result like this:
Invocations            Metric Name                        Metric Description                Min         Max         Avg
Device "Tesla K20m (0)"
Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
301                    shared_replay_overhead             Shared Memory Replay Overhead    0.089730    0.089730    0.089730
I need to utilize this value 0.089730 or devise some other method to arrive at a measurement of number of bank conflicts.
I understand that this value is the 'average' taken across all the warps that are executing. If I had to measure the total number of bank conflicts per warp, is there a way to do it using the nvprof results?
Possible approaches that came to my mind:
- By using 
shared_replay_overheadresults and using them in a formula to calculate the number of bank conflicts. I am guessing I have to apply some sort of formula likeshared_replay_overhead * Total number of warps launchedwhere I know theTotal number of warps launchedin advance, but I can't figure out what. - By first detecting that it's a four-way bank conflict, eight-way bank conflict, etc, and then multiplying 
4/8by the number of times the shared memory operation takes place (how to measure that?). 
This probably requires a fairly good technical knowledge about the GPU architecture as well, in addition to nvprof results, which I don't think I have yet. For the record, my GPU is of Kepler architecture, SM 3.5.
Even if I can measure the number of bank conflicts per block instead of per warp, it will suffice. After that I can do the necessary calculations to get the value on a per-warp basis.