Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).
So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with
int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];
which computes something close to String.hashCode() (set M1=31 and use a char[]). The computation is pretty trivial and for t.length=1000 takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).
Observe that each two steps a gets multiplied by M2 = M1*M1 and added something. This leads to this piece of code
int a = 0;
for (int i=0; i<N; i+=2) {
    a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.
This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.
Unlike java, gcc -O3 chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops shows).
So my question1 is: What prevents such an optimization?
Googling didn't work, I got "associative arrays" and "associative operators" only.
Update
I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.
Update 2
As Java already unrolls the loop, all the hard work is done. What we get is something like
...pre-loop
for (int i=0; i<N; i+=2) {
    a2 = M1 * a + t[i];
    a = M1 * a2 + t[i+1];
}
...post-loop
where the interesting part can be rewritten like
    a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add
This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for ints even in case of overflow or whatever, but not applicable to floating point).
    a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add
So far we gained nothing, but it allows us to fold the constants
    a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add
and gain even more by regrouping the sum
    a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add