CUDA multiple dynamically allocated shared arrays in single kernel

Question

I have the following problem. I am trying to divide a shared array into smaller arrays and then use these arrays in other device functions. In my kernel function I do,

for (int block_x = 0; block_x < blockDim.x; block_x++) {
  for (int block_y = 0; block_y < blockDim.y; block_y++) {
    //set up shared memory block
    extern __shared__ vec3f share[];
    vec3f *sh_pos = share;
    vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
    vec3f *sh_density = &sh_velocity[blockDim.x*blockDim.y];
    vec3f *sh_pressure = &sh_density[blockDim.x*blockDim.y];
    //index by 2d threadidx's
    unsigned int index = (block_x * blockDim.x + threadIdx.x) + blockDim.x * gridDim.x * (block_y * blockDim.y + threadIdx.y);
    sh_pos[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].position();
    sh_velocity[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].velocity();
    sh_pressure[blockDim.x * threadIdx.x + threadIdx.y].x = oldParticles[index].pressure();
    sh_density[blockDim.x * threadIdx.x + threadIdx.y].x = oldParticles[index].density();
    __syncthreads();
    d_force_pressure(oldParticles[arr_pos],c_kernel_support);
    __syncthreads();
  }
}

As far as I can tell, all the sh_ arrays get filled with zeros and not the values that I want. I can't tell what I am doing wrong. Note that vec3f is vector of floats just like the float3 datatype. Also, I didn't think I could mix in floats for density and pressure so I just made them vectors and am using a single component. Then, for example my d_force_pressure function is,

__device__ void d_force_pressure(particle& d_particle, float h) {
  extern __shared__ vec3f share[];
  vec3f *sh_pos = share;
  vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
  vec3f *sh_density = &sh_velocity[blockDim.x*blockDim.y];
  vec3f *sh_pressure = &sh_density[blockDim.x*blockDim.y];
  for (int i = 0; i < blockDim.x * blockDim.y; i++) {
    vec3f diffPos = d_particle.position() - sh_pos[i];
    d_particle.force() += GradFuncion(diffPos,h) * -1.0 * c_particle_mass *  (d_particle.pressure()+sh_pressure[i].x)/(2.0*sh_density[i].x);
  }  
}

After calls to this function I get NaNs since I am dividing by zero (sh_density[i].x is as far as I can tell, 0). Also is this in general, the correct way to load shared memory?

Kernel is called by

dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), (int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), 1);
int sharedMemSize = block.x*block.y*4*sizeof(vec3f);
force_kernel<<< grid,block,sharedMemSize  >>>(particle_ptrs[1],particle_ptrs[0],time_step);

Are you sure that there are no CUDA runtime errors occurring when you try running your code? Do you have *comprehensive* API and kernel launch error checking as per [this](http://stackoverflow.com/q/14038589/681865)? — talonmies, May 25 '13 at 14:33

score 1 · Answer 1 · edited May 23 '17 at 11:43

This is kind of a follow-up answer.

As per the comments made by @RobertCrovella, I went on to run cuda-memcheck. Believe it or not, this actually showed no errors. However, when I changed a constant in my code (that controls the sizing of some arrays) cuda-memcheck showed errors related to the question posted here write error. This made me re-check the way I was filling the shared arrays. Basically what needed changing was the

for (int block_x = 0; block_x < blockDim.x; block_x++) {
  for (int block_y = 0; block_y < blockDim.y; block_y++) {

to

for (int block_x = 0; block_x < gridDim.x; block_x++) {
  for (int block_y = 0; block_y < gridDim.y; block_y++) {

I believe that this then gives the right position for the index variable. I basically learned that whenever you are using shared memory and notice things running slow, its a good idea to use cuda-memcheck.

score 0 · Accepted Answer · answered May 25 '13 at 14:30

I indicated in your previous question that you don't want to do this:

dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), (int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), 1);

you want to do this:

dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x))), (int)ceil(sqrt(float(max_particles)) / (float(block.y))), 1);

The x grid direction should be scaled by the threadblock x dimension, not the threadblock x dimension * threadblock y dimension. However the code I posted in my previous answer also had this error, even though I pointed it out in the comments, I forgot to fix it.

Furthermore, this indexing doesn't look right to me:

sh_velocity[blockDim.x * threadIdx.x + threadIdx.y]

I think it should be:

sh_velocity[blockDim.x * threadIdx.y + threadIdx.x]

You have several examples of that.

You haven't posted a complete executable. There certainly may be more issues than the ones I've pointed out above. If I have to go through all the vec3f -> float3 conversion work I did in your last question, well, someone else can help you then. If you write a simple reproducer that doesn't depend on a bunch of code I don't have, I can try to help further. More than likely, if you do that, you'll discover the problem yourself.

Have you put cuda error checking into your code like I suggested in my last answer?

You might also want to run your code through cuda-memcheck:

cuda-memcheck ./mycode

I think I figured it out. It primarily had to do with the grid setup (I used the fix that you posted above). I also made the changes you suggested for indexing the `sh_` arrays. I feel like the two indexing methods are equivalent for a square block (which is what I am using) but differ if the block is rectangular. I did implement error-checking. It seems like things are working right now. I'll re-post if I run into any more issues. The simulation runs slow but I think that's because of my gpu which is a 6ish year old 8600M GT and the plethora of N^2 calculations. — kmarsh, May 25 '13 at 23:29

CUDA multiple dynamically allocated shared arrays in single kernel

2 Answers2