In problem 5.5, each calculation of xpwr depends on the previous calculation of xpwr (in the absence of aggressive optimization by the compiler, such as loop unrolling), so it must wait the full latency of a multiplication.
The calculation of result depends on a previous addition to result and a multiplication that can be done in parallel. The addition has a lower latency than the multiplication, so it does not limit the computation.
In more detail, what will typically happen for the floating-point operations in the CPU (ignoring loads, integer operations, and so on) is something like:
- Cycle 0:
a[i] * xpwr will be started (for i = 0).
- Cycle 0:
result += a[i] * xpwr cannot be started because it has to wait for a[i] * xpwr.
- Cycle 0:
x * xpwr is started. (I assume we can start both of these in cycle 0 because the text says the capacity for issuing multiplications is 2 per cycle. It also says there has to be 1 cycle between issuing independent operations, but I do not see the sense of that given the capacity is 2 per cycle.)
- Cycle 5:
a[i] * xpwr and x * xpwr finish.
- Cycle 5:
result += a[i] * xpwr is started (for i = 0).
- Cycle 5:
a[i] * xpwr is started (for i = 1).
- Cycle 5:
x * xpwr is started.
- Cycle 8:
result += a[i] * xpwr finishes.
- Cycle 8:
result += a[i] * xpwr (for i = 1) cannot start because its xpwr is not available.
- Cycle 10:
a[i] * xpwr and x * xpwr finish.
- Cycle 10:
result += a[i] * xpwr is started (for i = 1).
- Cycle 10:
a[i] * xpwr is started (for i = 2).
- Cycle 10:
x * xpwr is started.
- Cycle 13:
result += a[i] * xpwr finishes.
Then everything repeats on a five-cycle period.
In problem 5.6, result = a[i] + x*result means that, after we get the previous value of result, it takes five cycles to compute x*result, and we cannot start the addition until we have the result of that multiplication. Then it takes three cycles to compute a[i] + x*result, so it takes eight cycles from getting one result to getting the next one.