I have trouble believing that profile result. In this code
16      for (int x = 1; x < w + 1; x++, pg++, ps += n_bins, psu += n_bins) {
17          s += *pg;
18          *ps = *psu + s;
19      }
it says the lion's share of time is on line 18, very little on 17, and next to nothing on line 16.
Yet it is also doing a comparison, two increments, and three adds on every iteration.
Cache-misses might explain it, but there's no harm in double-checking, which I do with this technique.
Regardless, the loop could be unrolled, for example:
int x = w;
while(x >= 4){
  s += pg[0];
  ps[n_bins*0] = psu[n_bins*0] + s;
  s += pg[1];
  ps[n_bins*1] = psu[n_bins*1] + s;
  s += pg[2];
  ps[n_bins*2] = psu[n_bins*2] + s;
  s += pg[3];
  ps[n_bins*3] = psu[n_bins*3] + s;
  x -= 4;
  pg += 4;
  ps += n_bins*4;
  psu += n_bins*4;
}
for(; --x >= 0;){
  s += *pg;
  *ps = *psu + s;
  pg++;
  ps += n_bins;
  psu += n_bins;
}
If n_bins happens to be a constant, this could enable the compiler to do some more optimizing of the code in the while loop.