Currently I have this function to swap the bytes of a data in order to change endianness.
template<typename Type, unsigned int Half = sizeof(Type)/2, unsigned int End = sizeof(Type)-1> 
inline void swapBytes(Type& x)
{
    char* c = reinterpret_cast<char*>(&x);
    char tmp;
    for (unsigned int i = 0; i < Half; ++i) {
        tmp = c[i];
        c[i] = c[End-i];
        c[End-i] = tmp;
    }
}
This function will be called by some algorithms of mine several million times. Consequently, every single instruction that can be avoided would be a good thing.
My question is : how can this function be optimized ?
 
    