using cuda atomicAdd to port this piece of code

Question

This is my sequential code:

float foo(float* in1, float* in2, float in3, unsigned int size) {
    float tmp = 0.f;
        for (int i = 0; i<size; i++)
          if(in2[i]>0)tmp += (in1[i]/in3 - (in2[i] /in3)*(in2[i] /in3));
    return tmp;
}

This is my effort to port it to CUDA:

__global__ void kernel_foo(float* tmp, const float* in1, const float* 
                           in2, float in3,  unsigned int size) {
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < size) {
        if(in2[i]>0){
        atomicAdd(tmp, in1[i]/in3 - (in2[i] /in3)*(in2[i] /in3));
        }
    }
}

void launch_kernel_foo(float* tmp, const float* in1, const float* in2,
                       float in3,  unsigned int size) {
  kernel_foo<<<(size+255)/256,256>>>(tmp, in1, in2, in3, size);
}

but it does't work to generate correct results. Could anyone tell me where is the mistake?

What doesn't work? What is the expected result and what is the current result? Have you initialized the device variable `tmp` to zero before launching the kernel? — sgarizvi, Feb 16 '16 at 08:55
Of course it does work, but its result is wrong, No i have not initialized the device variable tmp, how to could i initialize it with zero? — R.GH, Feb 16 '16 at 08:58
But you have written in the question that **it doesn't work**. Initialize `tmp` to `0` before launching the kernel using `cudaMemset(tmp, 0, sizeof(float));`. — sgarizvi, Feb 16 '16 at 09:00
excuse me, i had forgotten it, now i edited it, yes i must be cudaMemset may i ask you write an example for use of cudaMemset? — R.GH, Feb 16 '16 at 09:04

score 3 · Answer 1 · edited May 23 '17 at 12:06

3

The reason for incorrect result is that the output variable tmp has not been initialized before launching the kernel. It should be initialized to 0 before performing calculations. It can be done as follows.

void launch_kernel_foo(float* tmp, const float* in1, const float* in2,
                       float in3,  unsigned int size) {
  cudaMemset(tmp, 0, sizeof(float));
  kernel_foo<<<(size+255)/256,256>>>(tmp, in1, in2, in3, size);
}

It is highly recommend to add CUDA error checking in your code as described in this post.

edited May 23 '17 at 12:06

Community

1
1

answered Feb 16 '16 at 09:11

sgarizvi

16,623
9
64
98

2

@R.GH: It is a simple proposition. Provide a complete MCVE and someone might be able to give you a definite answer to your question. If you do don't, the amount of help you can realistically expect to receive here is limited. It is completely your choice. – talonmies Feb 16 '16 at 16:02

using cuda atomicAdd to port this piece of code

1 Answers1