: : : : : Hello,
: : : : :
: : : : : does anybody know a complete list of the instruction latencies (and throughput) of Pentium 4 processors?
: : : : : I searched the Intel homepage and found the "IA-32 Intel(R) Architecture Optimization Reference Manual", but the list in there is not complete, only the commonly used instructions are listed.
: : : : : I tried the Intel support but I only get automated e-mails which did not help.
: : : : :
: : : : : Greets,
: : : : : Juergen
: : : : :
: : : :
: : : : Try this site.
: : : :
: : : :
http://www.sandpile.org/
: : : :
: : : : -jeff!
: : : :
: : : Thanks,
: : :
: : : I searched the site and found one document listed (P4 instruction latencies, japanese), but it cannot be downloaded (I know why...). I searched the intel download-ftp, but I can't find anything, even not in japanese...
: : :
: : : Greets,
: : : Jrgen
: : :
: :
Just out of curiosity - why do you need such information?
: :
: : I am asking, because if the need is to optimize your code for speed - then this will not help much, because the time of the instruction depends on so many things now, that to 'lick' out every clock is not possible anyway. The speed of code mostly depends on how do you keep your data (aligned/not-aligned) and how do you plan your logic, and also, CPU has internal cache and its workings is kind of a mystery... well, not exactly, but very complex... Also, if you have wrong memory management in your program - you will get a lot of page faults and one of these takes about a million clocks (!), so whatever you think you saved is gone...
: :
: : So, what is the reason to get this info?
: : Some cool research or to optimize code?
: :
: :
: I'm optimizing the inner loop of an mathematical research project (look at www.zetagrid.net). I'm working for a while on this project and figured out that Athlon XPs are now memory bandwith limited in the actual version of the code, but Pentium 4 processors are not limited by memory speed. What I want to do is to schedule the instructions to break dependency chains (it should be possible to calculate 4 chains "simultanously"). It's more important to schedule the instructions on P4 than on Athlon XPs since the P4 has the weaker FPU and the higher memory performance...
:
: Greets,
: Jrgen
:
:
:
I see...
In addition to scheduling you should align the loop beginning label and all labels inside a loop to a 32-byte frame - it will make a loop larger in size, but it will speed up the JUMP-ing.