Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:
#include <stdio.h>
void main () {
double x[2] = {0.0, 1.0};
asm volatile (
"movupd (%[x]), %%xmm0\n\t"
"shufpd $1, %%xmm0, %%xmm0\n\t" /* method 1 */
//"pshufd $78, %%xmm0, %%xmm0\n\t" /* method 2 */
"movupd %%xmm0, (%[x])\n\t"
:
: [x] "r" (x)
: "xmm0", "memory");
printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
}
After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?
As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.