Unspecified launch failure on Memcpy

Question

I'm encountering an "unspecified launch failure" when running my program in Cuda . I've checked the errors .

The program is a solver of a differential equation . It iterates TOTAL_ITER times . ROOM_X ans ROOM_Y are the width and height of the matrices .

Here is the header, its name is "sole :

#define ITER_BETWEEN_SAVES 10000
#define TOTAL_ITER 10000
#define ROOM_X 2048
#define ROOM_Y 2048
#define SOURCE_DIM_X 200
#define SOURCE_DIM_Y 1000
#define ALPHA 1.11e-4
#define DELTA_T 10
#define H 0.1
#include <stdio.h>

void Matrix(float* M);
void SolverCPU(float* M1, float* M2);
__global__ void SolverGPU(float* M1, float* M2);

Here is the kernel and a function that fill a matrix :

#include "solver.h"
#include<cuda.h>

void Matrix(float* M)
{
  for (int j = 0; j < SOURCE_DIM_Y; ++j) {
    for (int i = 0; i <  SOURCE_DIM_X; ++i) {
    M[(i+(ROOM_X/2 - SOURCE_DIM_X/2)) + ROOM_X * (j+(ROOM_Y/2 - SOURCE_DIM_Y/2))] = 100;
    }
  }
}

    __global__ void SolverGPU(float* M1,float *M2)  {
   int i =threadIdx.x + blockIdx.x * blockDim.x;
       int j = threadIdx.y + blockIdx.y * blockDim.y;

        float M1_Index = M1[i + ROOM_X * j];
        float M1_IndexUp = M1[i+1 + ROOM_X * j];
        float M1_IndexDown =M1[i-1 + ROOM_X * j];
        float M1_IndexLeft = M1[i + ROOM_X * (j+1)];
        float M1_IndexRight = M1[i + ROOM_X *(j-1)];


        M2[i + ROOM_X * j] = M1_Index + (ALPHA * DELTA_T / (H*H)) * (M1_IndexUp + M1_IndexDown + M1_IndexLeft +M1_IndexRight - 4*M1_Index);     

}

And here is the main

int main(int argc, char* argv[] ){

    float *M1_h, *M1_d,*M2_h, *M2_d;
    int size = ROOM_X * ROOM_Y * sizeof(float);
    cudaError_t err = cudaSuccess;  

    //Allocating Memories on Host
    M1_h = (float *)malloc(size);
    M2_h = (float *)malloc(size);

    //Allocating Memories on Host
    err=cudaMalloc((void**)&M1_d, size);
    if (err != cudaSuccess) { 
        fprintf(stderr, "Failed to allocate array_d ... %s .\n", cudaGetErrorString(err)); 
        exit(EXIT_FAILURE); 
    }

    err=cudaMalloc((void**)&M2_d, size);    
    if (err != cudaSuccess) { 
        fprintf(stderr, "Failed to allocate array_d ... %s .\n", cudaGetErrorString(err)); 
        exit(EXIT_FAILURE); 
    }

    //Filling the Matrix
    Matrix(M1_h);


    //Copy on Device

    err = cudaMemcpy(M1_d, M1_h, size, cudaMemcpyHostToDevice);
    if(err !=0){
        printf("%s-%d\n",cudaGetErrorString(err),1);
        getchar();  
    }

    err=cudaMemcpy(M2_d, M2_h, size, cudaMemcpyHostToDevice);
    if(err !=0){
        printf("%s-%d",cudaGetErrorString(err),2);
        getchar();  
    }

    dim3 dimGrid(64,64);
    dim3 dimBlock(32,32);


    //SolverGPU<< <threadsPerBlock, numBlocks >> >(M1_d,M2_d);
    for(int i=0;i<TOTAL_ITER;i++) { 
    if (i%2==0) 
    SolverGPU<< <dimGrid,dimBlock >> >(M1_d,M2_d);
    else
    SolverGPU<< <dimGrid,dimBlock >> >(M2_d,M1_d);
    }   

    err=cudaMemcpy(M1_h, M1_d, size, cudaMemcpyDeviceToHost);
    if(err !=0){
        printf("%s-%d",cudaGetErrorString(err),3);
        getchar();  
    }   

    cudaFree(M1_d);
    cudaFree(M2_d);

    free(M1_h);
    free(M2_h);
    return 0;   

}

There's no problem at compilation .

Whne I check my errors, the "unspecified launch failure" appears on the memcpy AFTER the kernel .

Ok, so I've read that it's usually due to the kernel which doesn't run properly . But I can't find the error (s) in the kernel ... I guess that's the error is quite simple , but can't figure to find it .

A "why doesn't this work" question with uncompilable code is a complete waste of everyone's time. Vote to close. — talonmies, Dec 03 '14 at 17:00
thanks for your kind response ! :) Indeed I forgot to put a function to remove some lines of code . Really sorry for that . I've edited my post . I hope that it will compile now . — Ptit Sualty, Dec 03 '14 at 17:11
How would anyone be able to compile it without knowing what `ROOM_X` is, for example? [SO expects](http://stackoverflow.com/help/on-topic) an [MCVE](http://stackoverflow.com/help/mcve). It should be a *complete code*. If you want to test if you've written the question correctly, start a new empty project, and copy the code out of the question you have posted, without adding anything or changing anything (that is, after all, what others are going to do). Then see if you can compile it. If you cannot, your question/MCVE is incomplete. — Robert Crovella, Dec 03 '14 at 17:16
I did it . Thanks for the explanation . I guess I could have found it by myself, I suppose that I'm a bit tired .x) Now this code should compile . — Ptit Sualty, Dec 03 '14 at 17:29

Robert Crovella · Accepted Answer · 2022-03-11T15:06:02.547

When I compile and run your code, I get:

an illegal memory access was encountered-3

printed out.

You may indeed be getting "unspecified launch failure" instead. The exact error reporting will depend on CUDA version, GPU, and platform. But we can proceed forward regardless.

Either message indicates that the kernel launched but encountered an error, and so failed to complete successfully. You can debug kernel execution problems using a debugger, such as cuda-gdb on linux, or Nsight VSE on windows. But we don't need to pull out the debugger just yet.

A useful tool is cuda-memcheck. (On newer GPUs, e.g. cc7.0 or newer, you should use compute-sanitizer instead of cuda-memcheck, but otherwise the process here is identical.) If we run your program with cuda-memcheck, we get some additional output that indicates that the kernel is doing invalid global reads of size 4. This means that you are making an out-of-bounds memory access. We can get additional clarity if we recompile your code adding the -lineinfo switch (or alternatively with -G), and then re-run your code with cuda-memcheck. Now we get output that looks like this:

$ nvcc -arch=sm_20 -lineinfo -o t615 t615.cu
$ cuda-memcheck ./t615 |more
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000070 in /home/bob/misc/t615.cu:34:SolverGPU(float*, float*)
=========     by thread (31,0,0) in block (3,0,0)
=========     Address 0x4024fe1fc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x150a7d]
=========     Host Frame:./t615 [0x11ef8]
=========     Host Frame:./t615 [0x3b143]
=========     Host Frame:./t615 [0x297d]
=========     Host Frame:./t615 (__gxx_personality_v0 + 0x378) [0x26a0]
=========     Host Frame:./t615 (__gxx_personality_v0 + 0x397) [0x26bf]
=========     Host Frame:./t615 [0x2889]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf4) [0x1d994]
=========     Host Frame:./t615 (__gxx_personality_v0 + 0x111) [0x2439]
=========
--More--

(and there is much more error output)

This means that the very first error encountered by your kernel was an invalid global read of size 4 (i.e. an out of bounds access trying to read an int or float quantity, for example). With the lineinfo information, we can see that this occurred:

=========     at 0x00000070 in /home/bob/misc/t615.cu:34:SolverGPU(float*, float*)

i.e. at line 34 in the file. This line happens to be this line of kernel code:

    float M1_IndexRight = M1[i + ROOM_X *(j-1)];

we could debug further, perhaps using in-kernel printf statements to discover where the problem is. But we already have a clue that we are indexing out-of-bounds, so let's inspect the indexing:

  i + ROOM_X *(j-1)

what does this evaluate to when i=0 and j=0 (ie. for thread (0,0) in your 2D thread array)? It evaluates to -2048 (i.e. -ROOM_X) which is an illegal index. Trying to read from M1[-2048] will create a fault.

You've got lots of complicated indexing going on in your kernel, so I'm pretty sure there are other errors as well. You can use a similar method to track those down (perhaps using printf to spit out the computed indexes, or else testing the indexes for validity).

Although the above description uses cuda-memcheck, the compute-sanitizer tool works similarly, and is the recommended one at the time of this edit.

For another example of how to use this method to narrow down the source of a problem, see here.

Unspecified launch failure on Memcpy

1 Answers1

Linked