Mono C#/C++ Interop, optimizing matrix multiplications - minimal gain due to overhead?

Question

I have a matrix struct on C# with the multiplications operations implemented without using SSE intrinsics. As I don't have access to the code at this very moment, I'll try to specify details as much as I can rather than copy/pasting the definition. I can edit the post in the morning to include relevant definitions if need be.

The struct has 16 floats defined as M11, M12, M13, ..., M43, M44' with the sequential layout specified: [StructLayout(LayoutKind.Sequential)]

The C++ function is declared with the attribute specification [DllImport("cppCode.dll", EntryPoint = "MatrixMultiply", CallingConvention = CallingConvention::Cdecl]

I'm trying to make a call to a C++ function using P/Invoke for optimizing the multiplications. My question is about passing the parameters. As mentioned on MSDN, the cost is 10 to 30 cycles of CPU + marshalling if the type passed is not blittable.

The function call on C# looks like

MatrixMultiply(ref matrix1, ref matrix2, out matrix_out);

and the C++ counterpart receives them with mat*, with mat being the matching C++ struct with 4x vec4s.

static extern void MatrixMultiply(mat* m1, mat* m2, mat* out) { *out = *m1 * *m2; }

When the calculations are profiled, the gain is quite minimal - a microsecond or two - on the average case. However, the worst case became worse, from 150us with C# multiplication to 400us with C++ multiplication, which leads me to think that the overhead for calling a function from the exported dll almost eliminates the gain from SSE instructions.

As I have limited familiarity with C#, I can't tell for sure what's going on. Am I doing something wrong? Is there a faster approach for C#/C++ communication in this particular case?

You could use the types from [`System.Numerics.Vectors`](https://msdn.microsoft.com/en-us/library/dn858218(v=vs.111).aspx) which utilize SIMD. There's even a `Matrix4x4` class. — cbr, Jun 07 '17 at 07:56
[Struct, I mean](https://msdn.microsoft.com/en-us/library/system.numerics.matrix4x4(v=vs.111).aspx) — cbr, Jun 07 '17 at 08:50
@cubrr The [Matarix4x4 class doesn't support SIMD](https://stackoverflow.com/a/31907264/2034041) - I've checked disassembly to make sure - it's on Vectors for now. So I've tried storing the C# matrix as 4 SIMD:Vec4's and implemented Matrix multiplication the naive way. Didn't have time to profile yet (so i'll know for sure tomorrow whether its faster or not), but I expect better performance with the same naive matrix multiplication the original implementation uses because we will be using SIMD registers. — Varaquilex, Jun 08 '17 at 06:29

score 0 · Accepted Answer · answered Jun 13 '17 at 01:17

Your best bet would be minimizing the p/Invoke calls if Numerics won't provide a good enough solution. Instead of calling Multiply(m1, m2, m_out) for every multiplication, try to concatenate matrices in one call on the C++ side where possible like this:

void MatrixConcat3(m1, m2, m3, m_out);
void MatrixConcat4(m1, m2, m3, m4, m_out);
void MatrixConcat5(m1, m2, m3, m4, m5, m_out);
...

That would reduce the overhead of making multiple calls.

Mono C#/C++ Interop, optimizing matrix multiplications - minimal gain due to overhead?

1 Answers1