I have two versions of an iterative procedure, one that has rolled-up loops and one that is unrolled. The computation involves three nested loops. By my calculations, the rolled-up version requires 90,460 cycles, whereas the unrolled loop version has a calculated latency of 56,400 cycles.
However, in real execution, the rolled version is running 2-2.5 times faster. My question is why is the rolled version, with the higher latency, executing faster? Am I getting 'beaten' by instruction caching?
For example, the rolled version has much short code, and each time through the loops, the instructions are the same....only register contents change. For the unrolled version, the actual memory references are changing, so the much longer code cannot be cached.
By the way, this is running on a 1200 MHz Athlon (Thunderbird), with ABIT KT7E mainboard.
Any comments regarding this optimization issue are welcome....if it is a caching issue, how much 'unrolling' can be done before caching beats the unrolling?