I wrote a simple code that copies array of structs to another array in C#. .NET Core 2.0, Console application, 64 bit executable, Release mode, Windows 10, Intel i7 7700k. Assembly is taken by breaking in Visual Studio and observing Disassembly window.
struct MyStruct
{
    public float F1;
    public float F2;
    public float F3;
    public float F4;
}
class Program
{
    private static MyStruct[] arr1 = new MyStruct[1024];
    private static MyStruct[] arr2 = new MyStruct[1024];
    static void Main(string[] args)
    {
        for (int i = 0; i < arr1.Length; i++)
            arr1[i] = arr2[i];
    }
}
I was expecting this code in assembly to copy src memory to register and then copy to destination array.
In assembly i saw the following (loop boilerplate ommited):
00007FFB33C704DC  vmovdqu     xmm0,xmmword ptr [rdx]  
00007FFB33C704E1  vmovdqu     xmmword ptr [rsp+30h],xmm0  
00007FFB33C704E8  cmp         esi,dword ptr [rax+8]  
00007FFB33C704EB  jae         00007FFB33C7051E  
00007FFB33C704ED  lea         rax,[rax+rcx+10h]  
00007FFB33C704F2  vmovdqu     xmm0,xmmword ptr [rsp+30h]  
00007FFB33C704F9  vmovdqu     xmmword ptr [rax],xmm0  
It copied every struct to stack and only then from stack to destination array.
If i reduce struct size from 128 bit to 64 bit everything becomes fine:
00007FFB33C804D8  vmovss      xmm0,dword ptr [rdx]  
00007FFB33C804DD  vmovss      xmm1,dword ptr [rdx+4]  
00007FFB33C804E3  cmp         esi,dword ptr [rax+8]  
00007FFB33C804E6  jae         00007FFB33C80518  
00007FFB33C804E8  lea         rax,[rax+rcx*8+10h]  
00007FFB33C804ED  vmovss      dword ptr [rax],xmm0  
00007FFB33C804F2  vmovss      dword ptr [rax+4],xmm1  
Why can't it copy 128 bit structure without using stack ?
