I got this class,
Approach 1:
typedef float v4sf __attribute__ (vector_size(16))
class Unit
{
    public:
    Unit(int num)
    {
        u = new float[num];
        v = new float[num];
    }
    void update()
    {
        for(int i =0 ; i < num; i+=4)
        {
            *(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
            //many other equations
        }
    }
    float*u,*v,*t; //and many other variables
}
Approach 2:
Same as approach 1. Except that in approach 2, v,u, and all other variables are allocated on a big chunk pre-allocated on heap, using placement new.
typedef float v4sf __attribute__ (vector_size(16))
class Unit
{
    public:
    Unit(int num)
    {
        buffer = new char[num*sizeof(*u) + sizeof(*v)  /*..and so on for other variables..*/]
        u = new(buffer) float[num];
        v = new(buffer+sizeof(float)*num) float[num];
        //And so on for other variables
    }
    void update()
    {
        for(int i =0 ; i < num; i+=4)
        {
            *(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
            //many other equations
        }
    }
    char* buffer;
    float*u,*v,*t; //and many other variables
}
However, approach 2 is 2x faster. Why is that?
There are around 12 float variables and num is 500K. update() is called 1k times. The speed doesnt factor in the memory allocation. I measure the speed like this:
double start = getTime();
for( int i = 0; i < 1000; i++)
{
   unit->update();
}
double end = getTime();
cout<<end - start;
And this is around 2x faster in approach 2.
Compiler options: gcc -msse4 -o3 -ftree-vectorize.
L1 cache is 256K, Ram is 8GB, pagesize is 4K.
Edit: Corrected the mistake in allocating the variables in approach 2. All variables are allocated in different sections, correctly. Processor is Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Edit: added the source here - Source. Approach 1) gives 69.58s , Approach 2) gives 46.74s. Though not 2x faster, it is still fast.
 
     
     
     
    