On PowerPC (PS3), transferring between the vector registers and the floating point registers also passed through memory, which could lead to expensive cache misses and thus required minimizing unnecessary conversions.
Is this true for other modern architectures? I'm especially curious about mobile, where my understanding is that the memory latency is by far the limiting factor.
NOTE: This is for a low level 3D math library using SSE intrinsics (and eventually others), and I'm trying to optimize for memory latency.