cudaMallocPitch and cudaMemcpy2D

Question

I have an error when transfering C++ 2D array into CUDA 1D array. Let me show my source code.

int main(void)
{
      float h_arr[1024][256];
      float *d_arr;

      // --- Some codes to populate h_arr

      // --- cudaMallocPitch
      size_t pitch;
      cudaMallocPitch((void**)&d_arr, &pitch, 256, 1024);

      // --- Copy array to device
      cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
}

I tried to run the code, but it pops up an error.

How to use cudaMallocPitch() and cudaMemcpy2D() properly?

score 12 · Answer 1 · edited Feb 02 '19 at 09:28

Talonmies has already satisfactorily answered this question. Here, some further explanation that could be useful to the Community.

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.

CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.

Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch is the following:

cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);

where

devPtr is an output pointer to float (float *devPtr).
devPitch is a size_t output variable denoting the length, in bytes, of the padded row.
Nrows and Ncols are size_t input variables representing the matrix size.

Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nrows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses. CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch. Under the above hypotheses (single precision 2D matrix), the syntax is the following:

cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)

where

devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
Nrows and Ncols are size_t input variables representing the matrix size.

Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to

cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)

The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example:

int    tidx = blockIdx.x*blockDim.x + threadIdx.x;
int    tidy = blockIdx.y*blockDim.y + threadIdx.y;

if ((tidx < Ncols) && (tidy < Nrows))
{
    float *row_a = (float *)((char*)devPtr + tidy * pitch);
    row_a[tidx] = row_a[tidx] * tidx * tidy;
}

In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence). The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.

Below, I'm providing a fully worked example to show these concepts.

#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>

#define BLOCKSIZE_x 16
#define BLOCKSIZE_y 16

#define Nrows 3
#define Ncols 5

/*****************/
/* CUDA MEMCHECK */
/*****************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
        if (abort) { getch(); exit(code); }
    }
}

/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int hostPtr, int b){ return ((hostPtr % b) != 0) ? (hostPtr / b + 1) : (hostPtr / b); }

/******************/
/* TEST KERNEL 2D */
/******************/
__global__ void test_kernel_2D(float *devPtr, size_t pitch)
{
    int    tidx = blockIdx.x*blockDim.x + threadIdx.x;
    int    tidy = blockIdx.y*blockDim.y + threadIdx.y;

    if ((tidx < Ncols) && (tidy < Nrows))
    {
        float *row_a = (float *)((char*)devPtr + tidy * pitch);
        row_a[tidx] = row_a[tidx] * tidx * tidy;
    }
}

/********/
/* MAIN */
/********/
int main()
{
    float hostPtr[Nrows][Ncols];
    float *devPtr;
    size_t pitch;

    for (int i = 0; i < Nrows; i++)
        for (int j = 0; j < Ncols; j++) {
            hostPtr[i][j] = 1.f;
            //printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
        }

    // --- 2D pitched allocation and host->device memcopy
    gpuErrchk(cudaMallocPitch(&devPtr, &pitch, Ncols * sizeof(float), Nrows));
    gpuErrchk(cudaMemcpy2D(devPtr, pitch, hostPtr, Ncols*sizeof(float), Ncols*sizeof(float), Nrows, cudaMemcpyHostToDevice));

    dim3 gridSize(iDivUp(Ncols, BLOCKSIZE_x), iDivUp(Nrows, BLOCKSIZE_y));
    dim3 blockSize(BLOCKSIZE_y, BLOCKSIZE_x);

    test_kernel_2D << <gridSize, blockSize >> >(devPtr, pitch);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy2D(hostPtr, Ncols * sizeof(float), devPtr, pitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost));

    for (int i = 0; i < Nrows; i++) 
        for (int j = 0; j < Ncols; j++) 
            printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);

    return 0;    
}

Just want to clarify whether we really need to have cudaDeviceSynchronize() after the kernel call? — MuneshSingh, Feb 05 '18 at 09:46
@MuneshSingh The synchronization after the kernel call is related to [CUDA error check](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — Vitality, Feb 05 '18 at 17:46
Hi @JackOLantern , thanks for the explanation. I have question: What is the float hostPtr[Nrows][Ncols]; was a Pointer i.e. float **hostPtr, dynamic array instead of the static array, why does the code fails? — aspo, Jul 21 '18 at 21:41
in your answer, first section, there is a typo, hostPtr = Ncols * sizeof(float) has to be changed to hostPitch = Ncols * sizeof(float) — user2348209, Jan 16 '22 at 17:27

score 6 · Accepted Answer · edited Feb 13 '19 at 22:13

The cudaMallocPitch call you have written looks ok, but this:

  cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);

is incorrect. Quoting from the documentation

Copies a matrix (height rows of width bytes each) from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. dpitch and spitch are the widths in memory in bytes of the 2D arrays pointed to by dst and src, including any padding added to the end of each row. The memory areas may not overlap. width must not exceed either dpitch or spitch. Calling cudaMemcpy2D() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. cudaMemcpy2D() returns an error if dpitch or spitch exceeds the maximum allowed.

So the source pitch and width to copy must be specified in bytes. Your host matrix has a pitch of sizeof(float) * 256 bytes, and because the source pitch and the width of the source you will copy are the same, this means your cudaMemcpy2Dcall should look like:

 cudaMemcpy2D(d_arr, pitch, h_arr, 256*sizeof(float), 
                256*sizeof(float), 1024, cudaMemcpyHostToDevice);

The [doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g32bd7a39135594788a542ae72217775c) says that `cudaMallocPitch` also accepts the width in **bytes** so OP's `cudaMallocPitch` call is not correct — Anti Earth, Jun 13 '19 at 15:17

cudaMallocPitch and cudaMemcpy2D

2 Answers2

Linked