I'm running into (what I believe are) shared-memory bank conflicts in a CUDA kernel. The code itself is fairly complex, but I reproduced it in the simple example attached below.
In this case it is simplified to a simple copy from global -> shared -> global memory, of a 2D array of size 16x16, using a shared-memory array which might be padded at the right side (variable ng). 
If I compile the code with ng=0 and examine the shared memory access pattern with NVVP, it tells me that there are "no issues". With e.g. ng=2 I get "Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1" at the lines marked with "NVVP warning". I don't understand why (or more specific: why the padding causes the warnings).
EDIT as mentioned by Greg Smith below, on Kepler there are 32 banks of 8 bytes wide (http://gpgpu.org/wp/wp-content/uploads/2013/09/08-opti-smem-instr.pdf, slide 18). But I don't see how that changes the problem.
If I understand things correctly, with 32 banks (B1, B2, ..) of 4 bytes, doubles (D01, D02, ..) are stored as:
B1   B2   B3   B4   B5    ..   B31
----------------------------------
D01       D02       D03   ..   D15
D16       D17       D18   ..   D31
D32       D33       D34   ..   D47
Without the padding, half warps write (as[ijs] = in[ij]) to shared-memory D01 .. D15, D16 .. D31, etc. With padding (of size 2) the first half warp writes to D01 .. D15, the second after the padding to D18 .. D33, which still shouldn't cause a bank conflict?
Any idea what might be going wrong here?
Simplified example (tested with cuda 6.5.14):
// Compiled with nvcc -O3 -arch=sm_35 -lineinfo 
__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)
{
    extern __shared__ double as[];
    const int ij=threadIdx.x + threadIdx.y*blockDim.x;
    const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);
    as[ijs] = in[ij]; // NVVP warning
    __syncthreads();
    out[ij] = as[ijs]; // NVVP warning
}
int main()
{
    const int itot = 16;
    const int jtot = 16;
    const int ng = 2;
    const int ncells = itot * jtot;
    double *in  = new double[ncells];
    double *out = new double[ncells];
    double *tmp = new double[ncells];
    for(int n=0; n<ncells; ++n)
        in[n]  = 0.001 * (std::rand() % 1000) - 0.5;
    double *ind, *outd;
    cudaMalloc((void **)&ind,  ncells*sizeof(double));
    cudaMalloc((void **)&outd, ncells*sizeof(double));
    cudaMemcpy(ind,  in,  ncells*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);
    dim3 gridGPU (1, 1 , 1);
    dim3 blockGPU(16, 16, 1);
    copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);
    cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);
    return 0;
}