The purpose of the question is to ask about possible causes regarding the program's behaviour as a function of icc 2019's compilation flags, considering two phenomena and the information provided in the notes below.
A program can run three types of simulations, let's name them S1, S2 and S3.
Compiled (and ran) on Intel Xeon Gold 6126 nodes the program has the following behaviour, expressed as
A ± B
where A is the mean time, B is the standard deviation, and the units are microseconds.
When compiled with -O3:
S1: 104.7612 ± 108.7875EDIT: it's 198.4268 ± 3.5362S2: 3.8355 ± 1.3025EDIT: it's 3.7734 ± 0.1851S3: 11.8315 ± 3.5765EDIT: it's 11.4969 ± 1.313
When compiled with -O3 -march=native:
S1: 102.0844 ± 105.1637EDIT: it's 193.8428 ± 3.0464S2: 3.7368±1.1518EDIT: it's 3.6966 ± 0.1821S3: 12.6182 ± 3.2796EDIT: it's 12.2893 ± 0.2156
When compiled with -O3 -xCORE-AVX512:
S1: 101.4781 ± 104.0695EDIT: it's 192.977±3.0254S2: 3.722 ± 1.1538EDIT: it's 3.6816 ± 0.162S3: 12.3629 ± 3.3131EDIT: it's 12.0307 ± 0.2232
Two conclusions:
-xCORE-AVX512produces code that is more performant than-march=native- the program's simulation called S3 DECREASES its performance when compiled considering the architecture.
Note1: the standard deviation is huge, but repeated tests yield always similar values for the mean that leave the overall ranking unchanged.
Note2: the code runs on 24 processors and Xeon Gold 6126 has 12 physical cores. It's hyper-threading but each two threads per core DO NOT share memory.
Note3: the functions of S3 are "very sequential", i.e. cannot be vectorized.
There is no MWE. Sorry, the code is huge and cannot be posted here.
EDIT: print-related outliers were to blame for the large deviation. The means were slightly changed but the trend and hierarchies remains.