can't enter into global function using cuda

Question

I have written a code on Nsight that compiles and can be executed but the first launch can't be completed.

The strange thing is that when I run it in debug mode, it works perfectly but it is too slow.

Here is the part of the code before entering the function that access the GPU (where i think there is an error I can't find) :

void parallelAction (int * dataReturned, char * data, unsigned char * descBase, int range, int cardBase, int streamIdx)
{
    size_t inputBytes = range*128*sizeof(unsigned char);
    size_t baseBytes = cardBase*128*sizeof(unsigned char);
    size_t outputBytes = range*sizeof(int);

    unsigned char * data_d;
    unsigned char * descBase_d;
    int * cardBase_d;
    int * dataReturned_d;

    cudaMalloc((void **) &data_d, inputBytes);  
    cudaMalloc((void **) &descBase_d, baseBytes);
    cudaMalloc((void **) &cardBase_d, sizeof(int));
    cudaMalloc((void **) &dataReturned_d, outputBytes);

    int blockSize = 196;
    int nBlocks = range/blockSize + (range%blockSize == 0?0:1);

    cudaMemcpy(data_d, data, inputBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(descBase_d, descBase, baseBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(cardBase_d, &cardBase, sizeof(int), cudaMemcpyHostToDevice);

    FindClosestDescriptor<<< nBlocks, blockSize >>>(dataReturned_d, data_d, descBase_d, cardBase_d);

    cudaMemcpy(dataReturned, dataReturned_d, outputBytes, cudaMemcpyDeviceToHost);

    cudaFree(data_d);
    cudaFree(descBase_d);
    cudaFree(cardBase_d);
    cudaFree(dataReturned_d);
}

And the function entering the GPU (I don't think the error is here) :

__global__ void FindClosestDescriptor(int * dataReturned, unsigned char * data, unsigned char * base, int *cardBase)
{
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned char descriptor1[128], descriptor2[128];
    int part = 0;
    int result = 0;
    int winner = 0;
    int minDistance = 0;
    int itelimit = *cardBase;
    for (int k = 0; k < 128; k++)
    {
        descriptor1[k] = data[idx*128+k];

    }
    // initialize minDistance
    for (int k = 0; k < 128; k++)
    {
        descriptor2[k] = base[k];
    }

    for (int k = 0; k < 128; k++)
    {
        part = (descriptor1[k]-descriptor2[k]);
        part *= part;
        minDistance += part;
    }

    // test all descriptors in the base :
    for (int i = 1; i < itelimit; i++)
    {
        result = 0;
        for (int k = 0; k < 128; k++)
        {
            descriptor2[k] = base[i*128+k];
            // Calculate squared l2 distance :
            part = (descriptor1[k]-descriptor2[k]);
            part *= part;
            result += part;
        }

        // Compare to minDistance
        if (result < minDistance)
        {
            minDistance = result;
            winner = i;
        }
    }

    // Write the result in dataReturned
    dataReturned[idx] = winner;
}

Thank you in advance if you can help me.

EDIT : the last cudaMemcpy returns the error "the launch timed out and was terminated".

@UchiaItachi it does take a void ** so you send the address of the pointer. — doctorlove, Aug 21 '13 at 12:32
no it takes `void *` only, it not `cardBase_d` which is a pointer but it is `cardBase` which an integer in the function argument. Misread it. — Uchia Itachi, Aug 21 '13 at 12:33
yes it is reaching it, it goes through the function without entering it then terminate — user2682877, Aug 21 '13 at 12:36
probably your kernel is not executing at all, or not complete. Please add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code. After you have sorted out any reported errors, run your code with `cuda-memcheck`. — Robert Crovella, Aug 21 '13 at 12:37
Try to use [error handling routines](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html#group__CUDART__ERROR) after every cuda api call to see what is really happening. Your blocksize may be too large due to the large number of registers used in the kernel. — kangshiyin, Aug 21 '13 at 12:38
One thing which you can enusre is that check the return values of `cudaMalloc` and `cudaMemcpy` and see if it returns `cudaSuccess` — Uchia Itachi, Aug 21 '13 at 12:39
yes there is no statement, but in debug mode it works, i don't get why — user2682877, Aug 21 '13 at 12:40
the last cudamemcpy returns the error "the launch timed out and was terminated", im trying to find why. — user2682877, Aug 21 '13 at 12:58
You may be hitting a windows TDR event ie. Your kernel is taking too long. Under windows by default kernel execution is limited to a few seconds. — Robert Crovella, Aug 21 '13 at 13:17

score 0 · Accepted Answer · answered Aug 21 '13 at 13:52

linux has a watchdog mechanism. If your kernel runs for a long time (you say it is slow in debug mode) you can hit the linux watchdog, and receive the "launch timed out and was terminated" error.

In this case you have several things you might try. The options are covered here.

can't enter into global function using cuda

1 Answers1

Linked

Related

can't enter into __global__ function using cuda

1 Answers1

Linked

Related

can't enter into global function using cuda