I would like to know under which conditions invariant parts of nested loops can be optimized.
For doing so, I wrote two functions one of which implements the factorization of three nested loops while the other doesn't.
The non-factorized function looks like:
template<int k>
double __attribute__ ((noinline)) evaluate(const double u[], const double phi[])
{
  double f = 0.;
  for (int i3 = 0;i3<k;++i3)
    for (int i2 = 0;i2<k;++i2)
      for (int i1 = 0;i1<k;++i1)
        f += u[i1+k*(i2+k*i3)] * phi[i1] * phi[i2] * phi[i3];
  return f;
}
While the factorized function is:
template<int k>
double __attribute__ ((noinline)) evaluate_fact(const double u[], const double phi[])
{
  double f3 = 0.;
  for (int i3 = 0;i3<k;++i3)
    {
      double f2 = 0.;
      for (int i2 = 0;i2<k;++i2)
        {
          double f1 = 0.;
          for (int i1 = 0;i1<k;++i1)
            {
              f1 += u[i1+k*(i2+k*i3)] * phi[i1];
            }
          f2 += f1  * phi[i2];
        }
      f3 += f2 * phi[i3];
    }
  return f3;
}
That I call with the following main:
int main()
{
  const static unsigned int k=20;
  double u[k*k*k];
  double phi[k];
  phi[0] = 1.;
  for (unsigned int i=1;i<k;++i)
    phi[i] = phi[i-1]*.333;
  double e = 0.;
  for (unsigned int i=0;i<1000;++i)
    {
      e += evaluate<k>(u, phi);
      //e += evaluate_fact<k>(u, phi);
    }
  std::cout << "Evaluate " << e << std::endl;
}
For a small k both functions generate the same assembly code but after a certain size k~=10 the assembly does not look the same anymore and callgrind shows more operations being performed in the non-factorized version.
How should I write my code (if at all possible), or what should I tell GCC such that evaluate() is optimized to evaluate_fact() ???
I am using GCC 7.1.0. with flags -Ofast -fmove-loop-invariants
Using -funroll-loops does not help unless I add --param max-completely-peeled-insns=10000 --param max-completely-peel-times=10000 but that is a completely different thing because it is basically unrolling everything, the assembly is extensive.
Using -fassociative-math doesn't help either.
This paper claims that: "Traditional loop-invariant code motion, which is commonly applied by general-purpose compilers, only checks invariance with respect to the innermost loop." Does that apply to my code?
Thanks!
