As Robert pointed out in comments, this is a pretty common question which comes up regularly, and my rather old answer highlights most of the important points, although it probably isn't the canonical example we probably should have. 
The really short answer is that you need to build the array of device pointers in host memory first, and then copy that array to the device. Turning your code into a trivial example to allocate memory on the device, gets you something like this:
class Image{
    public:
    int rows;
    int cols;
    int ** pixels; //2D array
    __host__ __device__
    Image() {};
    __host__ __device__
    Image(int r, int c, int** p) : rows(r), cols(c), pixels(p) {};
};
__global__ void intialiseImage(Image image, const int p_val)
{
    int i = blockIdx.y*blockDim.y + threadIdx.y;
    int j = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < image.rows && j < image.cols) {
        image.pixels[i][j] = p_val;
    }
}
int** makeDeviceImage(Image& dev_image, const int rows, const int cols)
{
    int** h_pixels = new int*[rows];
    for(int i=0; i<rows; i++) {
        cudaMalloc((void **)&h_pixels[i], sizeof(int) * size_t(cols));
    }
    int** d_pixels;
    cudaMalloc((void**)&d_pixels, sizeof(int*) * size_t(rows));
    cudaMemcpy(d_pixels, &h_pixels[0], sizeof(int*) * size_t(rows), cudaMemcpyHostToDevice);
    dev_image = Image(rows, cols, d_pixels);
    return h_pixels;
}
int main(void)
{
    int rows = 16, cols = 32;
    Image dev_image;
    int** dev_pixels = makeDeviceImage(dev_image, rows, cols);
    intialiseImage<<<rows, cols>>>(dev_image, 128);
    cudaDeviceSynchronize();
    cudaDeviceReset();
    return 0;
}
I will leave the copy code as an exercise for the reader (hint: the array of pointers the function returns is extremely useful there), but there is one comment worth making. Have a look at this profiler output for that code:
>nvprof a.exe
==5148== NVPROF is profiling process 5148, command: a.exe
==5148== Profiling application: a.exe
==5148== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 75.82%  2.2070us         1  2.2070us  2.2070us  2.2070us  intialiseImage(Image, int)
 24.18%     704ns         1     704ns     704ns     704ns  [CUDA memcpy HtoD]
==5148== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.33%  309.01ms        17  18.177ms  20.099us  308.62ms  cudaMalloc
  0.50%  1.5438ms        83  18.599us     427ns  732.97us  cuDeviceGetAttribute
  0.07%  202.70us         1  202.70us  202.70us  202.70us  cuDeviceGetName
  0.04%  136.84us         1  136.84us  136.84us  136.84us  cudaDeviceSynchronize
  0.03%  92.370us         1  92.370us  92.370us  92.370us  cudaMemcpy
  0.02%  76.974us         1  76.974us  76.974us  76.974us  cudaLaunch
  0.01%  24.375us         1  24.375us  24.375us  24.375us  cuDeviceTotalMem
  0.00%  5.5580us         2  2.7790us  2.5650us  2.9930us  cuDeviceGetCount
  0.00%  4.2760us         1  4.2760us  4.2760us  4.2760us  cudaConfigureCall
  0.00%  3.4220us         2  1.7110us     856ns  2.5660us  cudaSetupArgument
  0.00%  2.5660us         2  1.2830us  1.2830us  1.2830us  cuDeviceGet
On the platform I tested (Windows 8, mobile class Fermi GPU), the kernel to write a value into the image takes about 2us. A cudaMalloc call takes at least 20us. And there are 17 malloc calls to allocate this trivially small array. The overhead of working with arrays of pointers in CUDA is significant and I wouldn't recommend it if performance is your first priority.