GCC optimization flags for matrix/vector operations

Question

I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.

The flags that I have found so far which improve matrix code.

-O3/O4
-funroll-loops
-ffast-math

score 20 · Accepted Answer · edited Jun 20 '20 at 09:12

First of all, I don't recommend using -ffast-math for the following reasons:

It has been proved that the performance actually degrades when using this option in most (if not all) cases. So "fast math" is not actually that fast.
This option breaks strict IEEE compliance on floating-point operations which ultimately results in accumulation of computational errors of unpredictable nature.
You may well get different results in different environments and the difference may be substantial. The term environment (in this case) implies the combination of: hardware, OS, compiler. Which means that the diversity of situations when you can get unexpected results has exponential growth.
Another sad consequence is that programs which link against the library built with this option might expect correct (IEEE compliant) floating-point math, and this is where their expectations break, but it will be very tough to figure out why.
Finally, have a look at this article.

For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:

-Ofast

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.

-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).

I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).

Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.

Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.

Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.

I definitely recommend you use Eigen. It's very easy to learn. And while this post addresses many of the usual suspects to consider while optimizing, Eigen goes a lot further and does a lot of very advanced stuff. — Nicu Stiurca, Apr 17 '13 at 20:48
Thank you! This helped out a lot. I am actually trying to compare the best speedup the compiler optimized code gives(which is my question) with my hand-optimized code for matrix operations. — laxy, Apr 17 '13 at 22:35
@Haroogan: thanks for that tip. I accept your answer totally :). I am sorry I cannot upvote yet. — laxy, Apr 17 '13 at 22:52
`ffast-math` leading to slower code can imho be considered a compiler bug and a very surprising one at that (maybe when compiled without `march=native`?). Auto vectorization and parallelization in my experience isn't especially useful.. even the newest ICC (which is generally the best) has problems with simple code and small, harmless changes can have surprising effects. If you rely on the performance use compiler intrinsics and write it explicitly imo. +1 for Eigen though ;) — Voo, Apr 18 '13 at 07:53
I found [an ICC help thread](https://software.intel.com/en-us/forums/intel-c-compiler/topic/300622) about `-ffast-math` performance inversion but it's a bit over my head. From [another thread](https://stackoverflow.com/a/40181792/1043529), I see "That means no vectorization [is] available unless you have very efficient horizontal vector adds"; in point #5, maybe David's code gets vectorized in both cases or neither, but the values fall into some edge case where IEEE floats outperform the implementation's (generally) fast float representation? (Interested academically - sounds like a headache.) — John P, Oct 27 '17 at 13:15

GCC optimization flags for matrix/vector operations

1 Answers1

Linked