I believe it has to do with how the compiler translates this to CIL.
Not really. Performance doesn't directly depend on the CIL code, because that's not what's actually executed. What's executed is the JITed native code, so you should look at that when you're interested in performance.
So, let's look the the code generated for the DoSomething(int[]) loop:
mov         eax,dword ptr [ebx+4] ; get the length of the array
cmp         eax,0       ; if it's 0
jbe         0000018C    ; jump to code that throws IndexOutOfRangeException
cmp         eax,1       ; if it's 1, etc.
jbe         0000018C 
cmp         eax,2 
jbe         0000018C 
cmp         eax,3 
jbe         0000018C 
cmp         eax,4 
jbe         0000018C 
inc         esi         ; i++
cmp         esi,0F4240h ; if i < 1000000
jl          000000B7    ; loop again
What's interesting about this code is that there is no useful work done at all, most of the code is array bounds checking (why the code hasn't been optimized to perform this checking only once before the loop, I have no idea).
Also notice that the code is inlined, you're not paying the cost of a function call.
This code takes around 1.7 ms on my computer.
So, how does the loop for DoSomething() look like?
mov         ecx,dword ptr [ebp-10h]  ; access this
call        dword ptr ds:[001637F4h] ; call DoSomething()
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          00000120                 ; loop again
Okay, so this actually calls the method, no inlining this time. What does the method itself look like?
mov         eax,dword ptr [ecx+4] ; access this._arg1
cmp         dword ptr [eax+4],0   ; if its length is 0
jbe         00000022 ; jump to code that throws IndexOutOfRangeException
cmp         dword ptr [eax+4],1   ; etc.
jbe         00000022 
cmp         dword ptr [eax+4],2 
jbe         00000022 
cmp         dword ptr [eax+4],3 
jbe         00000022 
cmp         dword ptr [eax+4],4 
jbe         00000022 
ret                               ; bounds checks successful, return
Comparing with the previous version (and ignoring the overhead of the function call for now), this does three different memory accesses instead of just one, which could explain some of the performance difference. (I think the five accesses to eax+4 should be counted only as one, because otherwise the compiler would optimize them.)
This code runs in about 3.0 ms for me.
How much overhead does the method call take? We can check that by adding [MethodImpl(MethodImplOptions.NoInlining)] to the previously inlined DoSomething(int[]). The assembly now looks like this:
mov         ecx,dword ptr [ebp-10h]  ; access this
mov         edx,dword ptr [ebp-14h]  ; access r
call        dword ptr ds:[002937E8h] ; call DoSomething(int[])
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          000000A0                 ; loop again
Notice that r is now no longer kept in a register, it's instead on the stack, which will add another slowdown.
Now DoSomething(int[]):
push        ebp                   ; save ebp from caller to stack
mov         ebp,esp               ; write our own ebp
mov         eax,dword ptr [edx+4] ; read the length of the array
cmp         eax,0    ; if it's 0
jbe         00000021 ; jump to code that throws IndexOutOfRangeException
cmp         eax,1    ; etc.
jbe         00000021 
cmp         eax,2 
jbe         00000021 
cmp         eax,3 
jbe         00000021 
cmp         eax,4 
jbe         00000021 
pop         ebp      ; restore ebp
ret                  ; return
This code runs in about 3.2 ms for me. That's even slower than DoSomething(). What's going on?
Turns out, [MethodImpl(MethodImplOptions.NoInlining)] seems to cause those unnecessary ebp instructions. If I add that attribute to DoSomething(), it runs in 3.3 ms.
This means the difference between stack access and heap access is pretty small (but still measurable). The fact that the array pointer could be kept in a register when the method was inlined was probably more significant.
So, the conclusion is that the big difference you're seeing is because of inlining. The JIT compiler decided inline the code for DoSomething(int[]), but not for DoSomething(), which allowed the code for DoSomething(int[]) to be very efficient. The most likely reason for that is because the IL for DoSomething() is much longer (21 bytes vs. 46 bytes).
Also, you're not really measuring what you wrote (array accesses and multiplications), because that could be optimized out. So be careful with devising your microbenchmarks, so that the compiler can't ignore the code you actually wanted to measure.