I have a matrix struct on C# with the multiplications operations implemented without using SSE intrinsics. As I don't have access to the code at this very moment, I'll try to specify details as much as I can rather than copy/pasting the definition. I can edit the post in the morning to include relevant definitions if need be.
The struct has 16 floats defined as M11, M12, M13, ..., M43, M44' with the sequential layout specified: [StructLayout(LayoutKind.Sequential)]
The C++ function is declared with the attribute specification
[DllImport("cppCode.dll", EntryPoint = "MatrixMultiply", CallingConvention = CallingConvention::Cdecl]
I'm trying to make a call to a C++ function using P/Invoke for optimizing the multiplications. My question is about passing the parameters. As mentioned on MSDN, the cost is 10 to 30 cycles of CPU + marshalling if the type passed is not blittable.
The function call on C# looks like
MatrixMultiply(ref matrix1, ref matrix2, out matrix_out);
and the C++ counterpart receives them with mat*, with mat being the matching C++ struct with 4x vec4s.
static extern void MatrixMultiply(mat* m1, mat* m2, mat* out) { *out = *m1 * *m2; }
When the calculations are profiled, the gain is quite minimal - a microsecond or two - on the average case. However, the worst case became worse, from 150us with C# multiplication to 400us with C++ multiplication, which leads me to think that the overhead for calling a function from the exported dll almost eliminates the gain from SSE instructions.
As I have limited familiarity with C#, I can't tell for sure what's going on. Am I doing something wrong? Is there a faster approach for C#/C++ communication in this particular case?