kernel using AoS is faster than using SoA

Question

I have two versions of a kernel that performs the same task -fill a linked cell list-, the difference between both kernels is the datatype to store particle position, the first one using a float array to store the positions (4 float per particle due to 128bit reads/writes), and the second uses a vec3f structure array to store the positions (a structure which holds 3 floats).

Doing some tests using nvprof, I found that the second kernel (which uses vec3f) ran faster than the first one:

 Time(%)      Time   Calls       Avg       Min       Max  Name
   42.88    37.26s       2    18.63s   23.97us    37.26s  adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
   11.00     3.93s       2     1.97s   25.00us     3.93s  adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)

The tests are done trying to fill a linked cell list using 256 and 512000 particles.

My question is, what happened here? I supposed that float array should do a better memory access due to coalesced memory, versus the use of vec3f structure array which has unaligned memory. I missunderstood anything?

These are the kernels, the first kernel:

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellnAtoms,
                                                 float *pos, 
                                                 int nAtoms, 
                                                 vec3f origin, 
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};

    cell.x =  floor ((_pos.x - origin.x)/h.x);
    cell.y =  floor ((_pos.y - origin.y)/h.y);
    cell.z =  floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellnAtoms[c], 1);
}

And this is the second kernel:

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellNAtoms,
                                                 vec3f *pos,
                                                 int nAtoms,
                                                 vec3f origin,
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = pos[idx];

    cell.x = floor ((_pos.x - origin.x)/h.x);
    cell.y = floor ((_pos.y - origin.y)/h.y);
    cell.z = floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellNAtoms[c], 1);
}

This is the vec3f structure:

typedef struct _vec3f {float x, y, z} vec3f;

Are you doing proper [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on your kernel calls? It's possible that you haven't allocated memory correctly for the "fast" case, and the kernel is trying to access out of bounds, and therefore failing shortly after launched. SO expects that for questions like these you provide a [SSCCE](http://sscce.org/), not just kernel code. It's quite possible the difference is explained outside of the code you have shown here. — Robert Crovella, Aug 08 '13 at 22:19
Well then, to explain the timing difference I suggest posting a SSCCE, as both cases you have shown are actually AoS, as indicated by the answers you've received. — Robert Crovella, Aug 08 '13 at 22:34
Thanks @RobertCrovella for the advice to make an SSCCE test case. In the new test case I'm getting the same execute time for both kernels, so I guess that the _problem_ is outside the kernels. — Carlos Ríos, Aug 09 '13 at 18:03

score 5 · Answer 1 · answered Aug 08 '13 at 22:32

This is not an example of AoS vs. SoA. Let's look at the important lines of code and the data structures implicit in them.

Your first, "SoA" or "slow" case:

vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
                                      ^                    ^                    ^
                                      |                    |                    |
                               These values are stored in *adjacent* memory locations

So an individual thread is accessing successively pos[idx*4] plus the 2 locations right after it. This is how a structure gets stored! What you're calling a structure of Arrays is in fact an array of structures, the way it is stored in memory. To have a valid "SoA" case, your code would need to look something like this:

vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
                                 ^
                                 |
               Adjacent threads will read adjacent values for pos1, pos2, and pos3
                    leading to *coalesced* access.

Your "AoS" or "fast" doesn't really have a different storage format.

oh, now I see the real differnece, thanks. Well I done a test using both kernels and a new one using a valid SoA case, using nvprof I got that the three kernels took the almost same time to run. — Carlos Ríos, Aug 09 '13 at 18:35

score 1 · Answer 2 · answered Aug 08 '13 at 22:30

To my mind both of your approaches are actually AoS, the only difference is that the first approach is AoS with a structure of four elements while the second approach only uses three elements. This is why your second solution is preferable.

If you really want to have SoA in your first solution you would have to organize the pos array as follows:

vec3f _pos = (vec3f){(float)pos[idx], (float)pos[N + idx], (float)pos[2 * N + idx]};

kernel using AoS is faster than using SoA

2 Answers2

Linked