bcc64 optimizations -O1 vs -O2 still slower than bcc32 by 40% and more

Question

I have a product consisting of a VCL executable plus a Standard C++ DLL, all built with C++ Builder XE4. I publish in 32-bit and 64-bit versions.

When doing performance testing with release builds, the 64-bit version runs much more slowly... 40% more slowly.

I understand that I need to have optimizations turned on for the performance testing to be meaningful. XE4 allows me to set (mutually exclusively):

-O1 = smallest possible code -O2 = fastest possible code

I have built using each of these, but the results are unchanged.

I see from postings here that Linux/g++ programmers use -O3 (smallest AND fastest?) (see 64-bit executable runs slower than 32-bit version). But -O3 is not an option for my environment.

Are there other compiler settings I should be looking at?

Thanks for your help.

Peter Cordes · Answer 1 · 2015-07-31T21:44:55.617

The main downside of 64bit mode is that pointers double in size. Alignment rules might also lead classes/structs to be bigger. Maybe your code just barely fit into cache in 32bit mode, but not 64. This is esp. likely if your code uses a lot of pointers.

Another possibility is that you call some external library, and your 32bit version of it has some asm speedups, but the 64bit version doesn't.

Use a profiler to see what's actually slow in your 64bit version. For Windows, Intel's VTUNE is maybe a good choice. You can see where your code is having a lot of cache misses. Comparing total cache misses between 32bit and 64bit should shed some light.

Re: -O1 vs. -O2: Different compilers have different meanings for options. gcc and clang have:

-Os: optimize for code size
-O0: minimal / no optimization (most things get stored/reloaded from RAM after every step)
-O1: some optimization without taking a lot of extra compile time
-O2: more optimizations
-O3: even more optimizations, including auto-vectorizing

Clang doesn't seem to document its optimization options, so I assume it mirrors gcc. (There are options to report on optimizations it did, and to use profile-guided optimization.) See the latest version of the gcc manual (online) for more descriptions of optimization options: e.g.

-Ofast: -O3 -ffast-math (and maybe "unsafe" optimizations.)
-Og: optimize without breaking debugging. Recommended for the edit/compile/debug cycle.
-funroll-loops: can help in some tight loops, but isn't enabled even at -O3. Don't use for everything, because larger code size can lead to I-cache misses which hurt more. -fprofile-use does enable this, so ideally just use PGO.
-fblah-blah: there are a ton more specific options. Usually just use -O3 to pick the recommended set.

Thank you, Peter, for this information. My DLL does call another external DLL (in fact, that is the condition that causes the performance issue; as long as that call is never made, the performance between 32-bit and 64-bit are the same). I use AQTime for profiling, but it can only profile 32-bit apps. I'll look into VTUNE to see if I can make it work. Again, thank you so much! — Kathleen, Jul 30 '15 at 17:56
re: AQTime, I should have said "the version I'm using (8.10) can profile only 32-bit builds from XE4". I'm purchasing the upgrade today, which I am told will be able to profile my 64-bit XE4 builds, as well. — Kathleen, Jul 30 '15 at 20:06

bcc64 optimizations -O1 vs -O2 still slower than bcc32 by 40% and more

1 Answers1