This question may be useful for background.
Perhaps you don't know what a pitched allocation is.  A pitched allocation looks like this:
X  X  X  P  P  P
X  X  X  P  P  P
X  X  X  P  P  P
The above could represent storage for a 3x3 array (elements represented by X) that is pitched (pitched value of 6 elements, pitch "elements" represented by P).
You'll have no luck accessing such a storage arrangement if you don't follow the guidelines given in the reference manual for cudaMallocPitch.  In-kernel access to such a pitched allocation should be done as follows:
T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
You'll note that the above formula depends on the pitch value that was provided at the point of cudaMallocPitch.  If you don't pass that value to your kernel, you won't have any luck with this.
Because you are not doing that, the proximal reason for your observation:
the code will print (1 2 3 0 0 0 0 0 0)
is because your indexing is reading just the first "row" of that pitched allocation,  and the P elements are showing up as zero (although that's not guaranteed.)
We can fix your code simply by implementing the suggestions given in the reference manual:
$ cat t2153.cu
#include <cstdio>
const size_t N = 3;
__global__ void kernal_print(double* d_A, size_t my_N, size_t pitch){
//   int xIdx = threadIdx.x + blockDim.x * blockIdx.x;
//   int yIdx = threadIdx.y + blockDim.y * blockIdx.y;
   printf("\n");
   for(int row = 0; row < my_N; row++)
     for (int col = 0; col < my_N; col++){
       double* pElement = (double *)((char*)d_A + row * pitch) + col;
       printf("%f, ",*pElement);
     }
   printf("\n");
}
void function(){
   double A[N][N];
   for (size_t row = 0; row < N; row++)
     for (size_t col = 0; col < N; col++)
       A[row][col] = row*N+col+1;
   double* d_A;
   size_t pitch;
   cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
   cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);
   int threadnum = 1;
   int blocknum = 1;
   kernal_print<<<blocknum, threadnum>>>(d_A, N, pitch);
   cudaDeviceSynchronize();
}
int main(){
  function();
}
$ nvcc -o t2153 t2153.cu
$ compute-sanitizer ./t2153
========= COMPUTE-SANITIZER
1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000,
========= ERROR SUMMARY: 0 errors
$
A few comments:
- The usage of the term 2D can have varied interpretations.
- Using a pitched allocation is not necessary for 2D work, and it may also have no practical value (not making your code simpler or more performant).
- For further discussion of the varied ways of doing "2D work", please read the answer I linked.
- This sort of allocation: double A[N][N];may give you trouble for largeN, because it is a stack-based allocation.  Instead, use a dynamic allocation (which may affect a number of the methods you use to handle it.)  There are various questions covering this, such as this one.