Concurrently initializing many arrays with random numbers using Curand and CUDA kernel

Question

I am trying to initialize 100 elements of each these parallel arrays with randomly generated numbers concurrently on the GPU. However, my routine is not producing a variety of random numbers. When I debug the code in Visual Studio I see one number for every element in the array. The object of this code is to optimize the CImg FilledTriangles routine to use the GPU where it can.

What am I doing wrong and how can I fix it? Here is my code:

__global__ void initCurand(curandState* state, unsigned long seed)
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &state[idx]);
    __syncthreads();
}

/*
 * CUDA kernel that will execute 100 threads in parallel
*/

__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc, float* opacity
                                ,float * angle, unsigned char** color, int height, int width, curandState* state){

    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    curandState localState = state[idx];
    __syncthreads();

    posx[idx] = (float)(curand_uniform(&localState)*width);
    posy[idx] = (float)(curand_uniform(&localState)*height);
    rayon[idx] = (float)(10 + curand_uniform(&localState)*50);
    angle[idx] = (float)(curand_uniform(&localState)*360);
    veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
    color[idx][0] = (unsigned char)(curand_uniform(&localState)*255);
    color[idx][1] = (unsigned char)(curand_uniform(&localState)*255);
    color[idx][2] = (unsigned char)(curand_uniform(&localState)*255);
    opacity[idx] = (float)(0.3 + 1.5*curand_uniform(&localState));
}

Here is the host code that prepares and calls these kernels: I am trying to create 100 threads (for each element) on one block in a grid.

 // launch grid of threads
      dim3 dimBlock(100);
      dim3 dimGrid(1);

      initCurand<<<dimBlock,dimGrid>>>(devState, unsigned(time(nullptr)));
      // synchronize the device and the host
    cudaDeviceSynchronize();
     initializeArrays<<<dimBlock, dimGrid>>>(d_posx, d_posy, d_rayon, d_veloc, d_opacity, d_angle,d_color, img0.height(), img0.width(), devState);

Preliminaries:

  // Define random properties (pos, size, colors, ..) for all triangles that will be displayed.
    float posx[100], posy[100], rayon[100], angle[100], veloc[100], opacity[100];
    // Define the same properties but for the device
    float* d_posx;
    float* d_posy;
    float* d_rayon;
    float* d_angle;
    float* d_veloc;
    float* d_opacity;
    //unsigned char d_color[100][3];
    unsigned char** d_color;
    curandState* devState;
    cudaError_t err;

    // allocate memory on the device for the device arrays
    err = cudaMalloc((void**)&d_posx, 100 * sizeof(float));
    err = cudaMalloc((void**)&d_posy, 100 * sizeof(float));
    err = cudaMalloc((void**)&d_rayon, 100 * sizeof(float));
    err = cudaMalloc((void**)&d_angle, 100 * sizeof(float));
    err = cudaMalloc((void**)&d_veloc, 100 * sizeof(float));
    err = cudaMalloc((void**)&d_opacity, 100 * sizeof(float));
    err = cudaMalloc((void**)&devState, 100*sizeof(curandState));
    errCheck(err);
    size_t pitch;
    //allocated the device memory for source array  
    err = cudaMallocPitch(&d_color, &pitch, 3 * sizeof(unsigned char),100);

Getting the results:

// get the populated arrays back to the host for use
     err = cudaMemcpy(posx,d_posx, 100 * sizeof(float), cudaMemcpyDeviceToHost);
     err = cudaMemcpy(posy,d_posy, 100 * sizeof(float), cudaMemcpyDeviceToHost);
     err = cudaMemcpy(rayon,d_rayon, 100 * sizeof(float), cudaMemcpyDeviceToHost);
     err = cudaMemcpy(veloc,d_veloc, 100 * sizeof(float), cudaMemcpyDeviceToHost);
     err = cudaMemcpy(opacity,d_opacity, 100 * sizeof(float), cudaMemcpyDeviceToHost);
     err = cudaMemcpy(angle,d_angle, 100 * sizeof(float), cudaMemcpyDeviceToHost);
     err = cudaMemcpy2D(color,pitch,d_color,100, 100 *sizeof(unsigned char),3, cudaMemcpyDeviceToHost);

score 1 · Accepted Answer · edited May 23 '17 at 11:51

definitely you will need to make a change from this:

err = cudaMalloc((void**)&devState, 100*sizeof(float));

to this:

err = cudaMalloc((void**)&devState, 100*sizeof(curandState));

If you ran your code through cuda-memcheck, you would have discovered this. Your initCurand kernel had plenty of out-of-bounds accesses due to this.

You should also be doing error checking on all cuda calls and all kernel launches. I believe your second kernel call is failing due to a messed up operation on your color[][] array.

Normally when we create an array with cudaMallocPitch, we need to access it using the pitch parameter. C doubly-subscripted arrays by themselves won't work, because C has no inherent knowledge of the actual array width.

I was able to fix it by making the following changes:

__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc, float* opacity,float * angle, unsigned char* color, int height, int width, curandState* state, size_t pitch){

    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    curandState localState = state[idx];
    __syncthreads();

    posx[idx] = (float)(curand_uniform(&localState)*width);
    posy[idx] = (float)(curand_uniform(&localState)*height);
    rayon[idx] = (float)(10 + curand_uniform(&localState)*50);
    angle[idx] = (float)(curand_uniform(&localState)*360);
    veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
    color[idx*pitch] = (unsigned char)(curand_uniform(&localState)*255);
    color[(idx*pitch)+1] = (unsigned char)(curand_uniform(&localState)*255);
    color[(idx*pitch)+2] = (unsigned char)(curand_uniform(&localState)*255);
    opacity[idx] = (float)(0.3 + 1.5*curand_uniform(&localState));
}

and

 initializeArrays<<<dimBlock, dimGrid>>>(d_posx, d_posy, d_rayon, d_veloc, d_opacity, d_angle,d_color, img0.height(), img0.width(), devState, pitch);

and

unsigned char* d_color;

with those changes, I was able to eliminate the errors I found and the code spit out various random values. I haven't inspected all the values, but that should get you started.

Thanks! I can't believe i missed that! i'm fairly new to CUDA. I'll update my question to include that change. — Shayan Zafar, Mar 06 '13 at 05:05
actually with my additional edits I think you should be able to get it working — Robert Crovella, Mar 06 '13 at 05:24
I'm still getting errors but those are probably due to the color arrays structure being changed and how it is called. I'll make those changes and i'll see what happens. Thank you very much for your help, I still have a lot to learn in CUDA. — Shayan Zafar, Mar 06 '13 at 05:35
For future readers: There is also an index calculation mistake: `int idx = threadIdx.x + blockIdx.x * blockDim.x;` should be `int idx = threadIdx.x * blockDim.x + blockIdx.x;` or the random numbers will be highly correlated. — Flamefire, Sep 24 '15 at 08:17
I don't agree with the above comment. The given indexing is the standard indexing method to create a globally unique thread index in CUDA. The suggested modification will create a situation which is broken in a variety of ways. If there is a concern about generated random number correlation between threads, it can be addressed with appropriate manipulation of seed passed to teach thread at initialization time, or using other methods. — Robert Crovella, Sep 24 '15 at 14:37

Concurrently initializing many arrays with random numbers using Curand and CUDA kernel

1 Answers1