Why is my CUDA code not working properly for zero filling a large matrix?

Question

It is a simple CUDA code for initializing a big matrix (filling in zeros).

I output the first 1*3 matrix, if the code works. It should be all zeros.

If I set the matrix size to be small, then the program works properly. But when I make the size larger (> 43200 * 2400), what is inside the matrix are all garbage.

I had cudaDeviceSynchronize() append at the end of each CUDA functions already.

I am using NVIDIA Quadro K4200, Xeon E5-2630 with Ubuntu 14.04.

Thanks for anyone helping me here.

Attached below is my full code.

#include <stdio.h>
#include <math.h>
#include <iostream>
#include <cuComplex.h>

#define BLOCK_SIZE 16 // change it to 16 to get maximum performance


// populate the matrix using first row
__global__ void RepmatKernel (cuComplex *Mat, const unsigned int N, const unsigned int Cols) 
{
    unsigned int i = (unsigned int)blockIdx.x * (unsigned int)blockDim.x + (unsigned int)threadIdx.x;
    if (i < N) 
    {
        Mat[i].x = 0;
        Mat[i].y = 0;
    }
}

// main routine
int main ()
{

  const unsigned int Rows = 43200;
  const unsigned int Cols = 2400;

  const unsigned int Num_thrd = 256; // max threads per block 

  unsigned int Mat_size = Rows * Cols; // size of array

  cuComplex *vec; // supposedly the input

  cuComplex *mat_debug; // for debug

  vec = new cuComplex [Cols];

  mat_debug = new cuComplex [Rows*Cols];

  cuComplex *mat_in_d;  // device array

  //input in host array
  for(unsigned int i = 0; i < Cols; i++)
  {
      vec[i].x = 3*i+4;
      vec[i].y = 0.2*i+1;
  }

  const unsigned int size_mat_d =    Rows * Cols * sizeof(cuComplex); 

  //create device array cudaMalloc ( (void **)&array_name, sizeofmatrixinbytes) ;
  if (cudaMalloc((void **) &mat_in_d ,  size_mat_d) != cudaSuccess) std::cout<<"Error allocating GPU";
  cudaDeviceSynchronize() ;

  //copy host array to device array; cudaMemcpy ( dest , source , WIDTH , direction )
  cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice ) ;
  cudaDeviceSynchronize() ;


// ========================================================================
  cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
  cudaDeviceSynchronize() ;

  std::cout<<"before repmat="<<std::endl;
  std::cout<<"[";
  for(unsigned int i = 0; i < 3; i++)
  {
    std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i,  ";
    std::cout<<";"<<std::endl;
  }
  std::cout<<"]"<<std::endl;
// ==========================================================================

  RepmatKernel<<<(unsigned int)ceil((float)(Mat_size)/(float)(Num_thrd)),
               (Num_thrd)>>>(mat_in_d,
                     Mat_size,
                     Cols);
  cudaDeviceSynchronize();

// ========================================================================
  cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
  cudaDeviceSynchronize() ;

  std::cout<<"after repmat="<<std::endl;
  std::cout<<"[";
  for(unsigned int i = 0; i < 3; i++)
  {

    std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i,  ";
    std::cout<<";"<<std::endl;
  }
  std::cout<<"]"<<std::endl;
// ==========================================================================



  cudaFree(mat_in_d);


  delete [] vec; 

  delete [] mat_debug;

  return 0;
}

Do all of memcpy's and kernel launches return successful status? — void_ptr, Jan 08 '15 at 03:02
How can I explicitly check this? Something like cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess ? I will try and post the results. Thanks! — Tony Mao, Jan 08 '15 at 03:05
You can wrap them all with `checkCudaErrors()` if you include `helper_cuda.h`. — mty, Jan 08 '15 at 03:10
Yes, you may want to familiarize yourself with a concept of runtime error checking. Doing this is always a good first step in answering the question "why does not this code work as expected?" — void_ptr, Jan 08 '15 at 03:11
Take a look at [proper cuda error checking](http://stackoverflow.com/questions/14038589). You can also run your code with `cuda-memcheck` to get a quick read on any errors. Also, be sure you are compiling with a proper arch switch for your GPU, such as `nvcc -arch=sm_30 ...` If you don't, `nvcc` will compile for some lower architecture, and your kernel will not launch because the first launch config parameter (`ceil((float)(Mat_size)/(float)(Num_thrd))` is large enough to require a cc3.0 compilation target. It will fail to launch if you compile for a lower target, with your 4300x2400 size. — Robert Crovella, Jan 08 '15 at 03:25
@RobertCrovella: Is there any danger is mixing `new` with `cudaMalloc`? Gives me the willies for some reason. . . — user14717, Jan 08 '15 at 03:55
@RobertCrovella I add -m64 -gencode arch=compute30,code=sm_30 but it gives me run time error that I can not allocate GPU memory with that size of matrix. — Tony Mao, Jan 08 '15 at 05:21
Not sure it can be sorted out in comments. First you say it worked when you add `-arch=sm_30` now you say it doesn't. Perhaps you should edit your question to show the code you are actually running now that produces the error, and show the exact error output text that the program produces, and exact compile command. You can edit your question with all this. I took your original code and was able to compile and run successfully just by adding `-arch=sm_30`. There is still the error that @NickThompson pointed out in your `cudaMemcpy` operation, but that doesn't prevent the kernel from running. — Robert Crovella, Jan 08 '15 at 06:18
I got my code work now. The problem is actually not in my code but in my environment. I had a badly installed driver and this prevents me to allocate memory in CUDA. The code works fine for small size matrix because I assume there is there should be a lot of free space in the GPU. — Tony Mao, Jan 08 '15 at 20:08
I basically purge the CUDA using `sudo apt-get --purge remove nvidia-*`, then I reinstall things and the driver. — Tony Mao, Jan 08 '15 at 20:09

score 1 · Answer 1 · edited May 23 '17 at 10:26

Your call to cudaMalloc states that there is a problem, but doesn't actually terminate the computation. You should put a

if (cudaMalloc((void **) &mat_in_d ,  size_mat_d) != cudaSuccess) 
{
    std::cout<<"Error allocating GPU\n";
    return 1;
}

so that the computation actually stops when you overflow the memory, rather than attempt to work anyway with only a warning to std::cout. Even better would be to use an error handling macro.

Another problem is here:

cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice );

First, mat_in_d is size Rows * Cols * sizeof(cuComplex), but you are only copying Cols bytes into it. Even if you only wanted to copy vec into the first part of the mat_in_d vector, you'd need to change this to

cudaMemcpy ( mat_in_d , vec , Cols*sizeof(cuComplex) , cudaMemcpyHostToDevice );

At this point, you'd expect the first Cols entries of you matrix to be reasonable, at the rest to be garbage. (Making the suggested change shows that this is indeed the case; why you would want to do this is a better question).

Next comes your kernel call, whose entire goal is to set the entries of Mat to zero. This should be done with cudaMemset, i.e., just use

cudaMemset(mat_in_d, 0, Mat_size*sizeof(cuComplex));

We could look more carefully at the execution configuration to see what went wrong with your kernel call, but for now this fixes your problem.

Thanks for your help! I fixed these problems. Currently the program can not allocate space for that big matrix. cudaMalloc throws error. I used -arch=sm_30 but that doesn't help. — Tony Mao, Jan 08 '15 at 05:41
I recommend querying how much RAM is on your card using `cudaGetDeviceProperties` before calling `cudaMalloc`. Your card probably just doesn't have enough RAM. — user14717, Jan 08 '15 at 13:06
To check errors on the malloc you should get the error string for more information. Try this instead of your malloc... cudaError_t err = cudaMalloc((void **) &mat_in_d , size_mat_d); if(err != cudaSuccess) std::cout<<"Error allocating GPU ::"< — Christian Sarofeen, Jan 08 '15 at 18:47

mty · Answer 2 · 2015-01-08T03:14:48.960

0

For debugging CUDA errors; I find a header from samples, helper_cuda.h, quite convenient. I almost always include this header, which is located in the common directory of samples, in my projects.

Then, wrapping all CUDA calls with checkCudaErrors(), like checkCudaErrors(cudaMalloc((void **) &mat_in_d , size_mat_d)); gives explicit error messages.

In my case, since just mat_in_d is close to 1 GB and my GPU's memory is only 512 MB, it failed for sure and threw cudaErrorMemoryAllocation. However, an NVIDIA Quadro K4200 should not fail that easily!

Did you check the actual available memory information using cudaMemGetInfo ?

edited Jan 08 '15 at 03:14

answered Jan 08 '15 at 03:09

mty

780
6
15

Thanks a lot for the answers! I am trying to add the helper_cuda.h but it looks this file can not be found. How should I install this? Thanks again! – Tony Mao Jan 08 '15 at 03:58
it should come with your cuda toolkit installation (samples), what is your operating system? – mty Jan 08 '15 at 04:39
It is Ubuntu 14.04. I follow the http://www.r-tutor.com/gpu-computing/cuda-installation/cuda6.5-ubuntuexport%20CUDA_HOME=/usr/local/cuda-6.5%20export%20LD_LIBRARY_PATH=$%7BCUDA_HOME%7D/lib64%20PATH=$%7BCUDA_HOME%7D/bin:$%7BPATH%7D%20export%20PATH and it is currently being installed now. Hopefully I can include directly. I will post here if I have further problems. Thanks! – Tony Mao Jan 08 '15 at 04:45
Alright, in your compilation step, include the flag `-I/usr/local/cuda/samples/common/inc` or `-I/usr/local/cuda-6.5/samples/common/inc` and in your `.cu` file, add the line `#include `. – mty Jan 08 '15 at 04:52
So, if your source file is named `test.cu`, you can compile it with: `/usr/local/cuda/bin/nvcc -m64 -gencode arch=compute_30,code=sm_30 -I/usr/local/cuda/samples/common/inc test.cu` – mty Jan 08 '15 at 05:00
I used your way to do the the compilation /usr/local/cuda/bin/nvcc -m64 -gencode arch=compute_30,code=sm_30 .. But I can't not allocate that big matrix. – Tony Mao Jan 08 '15 at 05:26
Yes, it probably still won't. But if you wrap all your CUDA calls with `checkCudaErrors()` like explained above, you will get meaningful error messages about the location and the nature of your problem. – mty Jan 08 '15 at 05:31
But I got many errors like cudaError_t err = cudalastError(), cudalastError was not declared in this scope. – Tony Mao Jan 08 '15 at 05:38
I updated my driver and the error becomes: could not insert 'nvidia_311_uvm': Invalid argument – Tony Mao Jan 08 '15 at 06:18
The above error is due to my improper installation of my GPU driver. I purge CUDA things using `sudo apt-get --purge remove nvidia-*` and reinstall everything as well as the driver and it works fine now. – Tony Mao Jan 08 '15 at 20:11

Why is my CUDA code not working properly for zero filling a large matrix?

2 Answers2