CUDA: sum-reduction --- data lost in call to device function

Question

I am writing a CUDA sum reduction code taking the sum of the absolute values of an array starting on element begin_index through end_index (I am using one block with a variable number of threads). However, when I pass the array A to reduce_fabs(), everything from A[4] and higher indexes end up somehow being set to zero. Here is the code and the function calls and more explanation of what I have tried following.

This the kernel that calls the sum-reduction kernel:

__device__ void tridiag(float *A,int *dim, float *diag,float *offdiag) {

    A[0] = 1.0f; A[1] = 2.0f; A[2] = 3.0f;
    A[3] = 4.0f; A[4] = 5.0f; A[5] = 6.0f;
    diag[0] = reduce_fabs(A,0,3);
    __syncthreads();
    diag[1] = reduce_fabs(A,0,4);

    return;
}

This is the sum-reduction kernel:

__device__ float reduce_fabs(float *v, int begin_index, int end_index) {
    extern __shared__ float sum_array[];
    int tid = threadIdx.x;
    if(tid >= begin_index && tid <= end_index) {
        sum_array[tid-begin_index] = fabs(v[tid]);
        sum_array[tid+end_index-begin_index+1] = 0;
    }
    __syncthreads();
    for(int j=1;j<=(end_index-begin_index);j*=2) {
        if((tie-begin_index)%(2*j) == 0 && tie >= begin_index && tid <= end_index) {
            sum_array[(tie-begin_index)] += sum_array[(tie-begin_index)+j];
        }
        __syncthreads();
    }

    return sum_array[0];
}

Having seen the code, a more specific description of the problem is that when I pass the array A to reduce_fabs(), the value of element 4 becomes A[4] = 0 which is wrong. A[0],A[1],A[2], and A[3] are fine for any end_index, but every end_index larger than 3 results in the elements of A beyond 3 being set to zero upon passage to the reduction kernel.

Here is what I tried already:

I tried diag[0] = A[4] to make sure the initial assignment was working. It was.
I eliminated the summation portion of reduce_fabs(), stopping after the first __syncthreads() and the problem still persisted.
I eliminated the zeroing of elements beyond those of interest for the sum; i.e. I commented //sum_array[tid+end_index-begin_index+1] = 0 (the summation was commented out for this part too). No success.
I eliminated if(tie >= begin_index && tid <= end_index) so that sum_array was assigned for all threads (summation also commented out) to no avail.

Here is a main function to run it:

#include<stdio.h>
#include<stdlib.h>
#include<cuda.h>

int main() {
    int n = 10;
    float *A = (float *)malloc(n*sizeof(*A));
    float *diag = (float *)malloc(n*sizeof(*A));
    float *offdiag = (float *)malloc(n*sizeof(*A));

    int *p_n;
    float *p_A, *p_diag, *p_offdiag;

    cudaMalloc((void**) &p_A,n*sizeof(float));
    cudaMalloc((void**) &p_diag,n*sizeof(float));
    cudaMalloc((void**) &p_offdiag,n*sizeof(float));
    cudaMalloc((void**) &p_n,sizeof(int));

    cudaMemcpy(p_n,&n,n*sizeof(int),cudaMemcpyHostToDevice);

    tridiag<<<1,n>>>(p_A,p_n,p_diag,p_offdiag);

    cudaMemcpy(A,p_A,n*sizeof(float),cudaMemcpyDeviceToHost);
    cudaMemcpy(diag,p_diag,n*sizeof(float),cudaMemcpyDeviceToHost);

    printf("A[0] = %f  A[1] = %f  A[2] = %f\n",A[0],A[1],A[2]);
    printf("A[3] = %f  A[4] = %f  A[5] = %f\n",A[3],A[4],A[4]);
    printf("diag[0] = %f  diag[1] = %f\m",diag[0],diag[1]);

    cudaFree(p_A);
    cudaFree(p_diag);
    cudaFree(p_offdiag);
    free(A);
    free(diag);
    free(offdiag);

    return 0;
}

Can you provide a simple, complete reproducer? Are you doing [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on all cuda api calls and kernel calls? Have you tried running your code with `cuda-memcheck` to see if there are any access problems in the device code? — Robert Crovella, Jun 13 '13 at 04:27
You are launching the kernel with a suitably sized dynamic shared memory allocation, I presume? — talonmies, Jun 13 '13 at 06:03
@RobertCrovella I included a main function (I assume that is what you mean by reproducer?). I will try those other things tomorrow since I cannot access the computer with the gpu remotely. — HSK, Jun 13 '13 at 06:41
@talonmies Possibly not, all relevant memory allocations are included in the code I posted (which I edited to include a main function with calls to cudaMalloc) -- I just looked it up and it appears I need to change my function call to tridiag<<<1,n,num_bytes_shared>>>(...). I think you are right, I will report back tomorrow. — HSK, Jun 13 '13 at 06:44
@talonmies Thanks, that was the problem. Now I have other memory overwrite issues but I will leave that to a new thread if I don't figure it out. — HSK, Jun 13 '13 at 21:25

CUDA: sum-reduction --- data lost in call to device function

0 Answers0