# Number of cycles

Hi.
I'm not quiet sure where if this is the right board to post this question, but I'll give it a try.

I'm working on finding an efficient implementation of an algorithm in C++. I have an algorithm to compare it with, so this far I've only been comparing them in runningtime. However I see that people often refer to the "number of cycles" when they talk about how much work is needed to do something. So; I'm not quiet sure what's ment by those cycles. Is this actual operations on the lowest level - like in the gates on hardware level? And are there some ways to measure the number of cycles used between two points in a C++ code?

I've been working on some assembly programming lately and I've seen authors speak about the number of cycles on assembly instructions as well (without further explanation..). So; is that how I measure the number of cycles on my algorithm? By checking the assemblycode generated when I compile my C++ program, and then counting the number of cycles for the instructions. Or is there a better way?.. Or am I just way of with all I've been saying here?

Best Regards,
Stian Karlsen

• : Hi.
: I'm not quiet sure where if this is the right board to post this question, but I'll give it a try.
:
: I'm working on finding an efficient implementation of an algorithm in C++. I have an algorithm to compare it with, so this far I've only been comparing them in runningtime. However I see that people often refer to the "number of cycles" when they talk about how much work is needed to do something. So; I'm not quiet sure what's ment by those cycles. Is this actual operations on the lowest level - like in the gates on hardware level? And are there some ways to measure the number of cycles used between two points in a C++ code?
:
: I've been working on some assembly programming lately and I've seen authors speak about the number of cycles on assembly instructions as well (without further explanation..). So; is that how I measure the number of cycles on my algorithm? By checking the assemblycode generated when I compile my C++ program, and then counting the number of cycles for the instructions. Or is there a better way?.. Or am I just way of with all I've been saying here?
:
:
:
: Best Regards,
: Stian Karlsen
:
:

"Cycles" refer to one tick of the system clock. One cycle is the fastest time the CPU can execute an instruction. If you have a CPU running on 10MHz, one cycle will take 100ns to execute (according to the formula [time = 1/frequency]). There are usually alot of other things involved in determing the speed (especially on a PC-CPU) but the system clock is the main factor.

If you check your assembler manual, it should tell the number of cycles every instruction takes to execute. "NOP" for example takes 1 cycle. Times usually vary depending on addressing mode (direct, indirect etc, what they are called is CPU-dependant).

So in theory, you could measure the time your program takes to execute by disassembling it and count the time for every instruction. This might be useful to determine which algorithm that is most effective for the target CPU.

However, in more complex algorithms, you don't want to do that. In a real-time application, you usually set an output-pin to a value and measure the execution time with an oscilloscope.
On a PC app, you will not get any timing accuracy whatsoever, so functions like GetTickCount() in Windows are usually good enough.
• :
: "Cycles" refer to one tick of the system clock. One cycle is the fastest time the CPU can execute an instruction. If you have a CPU running on 10MHz, one cycle will take 100ns to execute (according to the formula [time = 1/frequency]). There are usually alot of other things involved in determing the speed (especially on a PC-CPU) but the system clock is the main factor.
:
: If you check your assembler manual, it should tell the number of cycles every instruction takes to execute. "NOP" for example takes 1 cycle. Times usually vary depending on addressing mode (direct, indirect etc, what they are called is CPU-dependant).
:
: So in theory, you could measure the time your program takes to execute by disassembling it and count the time for every instruction. This might be useful to determine which algorithm that is most effective for the target CPU.
:
: However, in more complex algorithms, you don't want to do that. In a real-time application, you usually set an output-pin to a value and measure the execution time with an oscilloscope.
: On a PC app, you will not get any timing accuracy whatsoever, so functions like GetTickCount() in Windows are usually good enough.
:

Great, thanks!
So that means that if I measure the time used for the algorithm (witch I already do) I can calculate the number of cycles by taking my CPU frequency into calculation. If so this wount be any more accurate than the time measure, but still cycles is a more prefered way to calculate speed since it can easier be compared with algorithms being runned on different machines? But still; if the CPU is busy with something else this will give me slightly wrong results? But what you're saying is that the result will be good enough at least? If I want the exact result I have to count the number of cycles from the assembly code generated?

-Stian
• : :
: : "Cycles" refer to one tick of the system clock. One cycle is the fastest time the CPU can execute an instruction. If you have a CPU running on 10MHz, one cycle will take 100ns to execute (according to the formula [time = 1/frequency]). There are usually alot of other things involved in determing the speed (especially on a PC-CPU) but the system clock is the main factor.
: :
: : If you check your assembler manual, it should tell the number of cycles every instruction takes to execute. "NOP" for example takes 1 cycle. Times usually vary depending on addressing mode (direct, indirect etc, what they are called is CPU-dependant).
: :
: : So in theory, you could measure the time your program takes to execute by disassembling it and count the time for every instruction. This might be useful to determine which algorithm that is most effective for the target CPU.
: :
: : However, in more complex algorithms, you don't want to do that. In a real-time application, you usually set an output-pin to a value and measure the execution time with an oscilloscope.
: : On a PC app, you will not get any timing accuracy whatsoever, so functions like GetTickCount() in Windows are usually good enough.
: :
:
: Great, thanks!
: So that means that if I measure the time used for the algorithm (witch I already do) I can calculate the number of cycles by taking my CPU frequency into calculation. If so this wount be any more accurate than the time measure, but still cycles is a more prefered way to calculate speed since it can easier be compared with algorithms being runned on different machines?

[blue]As I said, system clock in only one factor. The more advanced CPUs have instruction queues and can run several instructions at once etc.

If a C algorithm is effective on one CPU, it is usually effective on other CPUs as well. It is not guaranteed, though. Especially when porting between 8/16/32/64 bit, effective code will look very different.[/blue]

: But still; if the CPU is busy with something else this will give me slightly wrong results? But what you're saying is that the result will be good enough at least? If I want the exact result I have to count the number of cycles from the assembly code generated?

[blue]On Windows, unix and such operative systems there are no guarantees when a certain line of code will be executed. There might be a context switch anywhere in code, stopping your program for the moment.

To measure the exact execution time you would need to throw out the OS and run the program without it... kind of complicated. There might be other ways, but I don't know why you would need it. There can be no real-time requirements on your program or you are using the wrong platform for it.
[/blue]
• :
: [blue]On Windows, unix and such operative systems there are no guarantees when a certain line of code will be executed. There might be a context switch anywhere in code, stopping your program for the moment.
:
: To measure the exact execution time you would need to throw out the OS and run the program without it... kind of complicated. There might be other ways, but I don't know why you would need it. There can be no real-time requirements on your program or you are using the wrong platform for it.
: [/blue]
:

Yeah, sure. Approximate results will do fine - I was just wondering whether there are any ways to get the exact number of cycles that should be needed - as this can be counted by disassembling the code (if I understood correctly).

Thanks again!

• [b][red]This message was edited by stober at 2006-1-16 7:36:24[/red][/b][hr]
:
: Yeah, sure. Approximate results will do fine - I was just wondering whether there are any ways to get the exact number of cycles that should be needed - as this can be counted by disassembling the code (if I understood correctly).
:

There are ways to do that -- manually calculate the time from a table of assembly code instructions that contain cpu cycle times for each instruction. You can get the assembly output for your own code but what about os api functions?

You may have to do recalculate the times manually every time you recompile the program because the compiler will change the assembly code to accommodate new c/c++ code. Its not usually worth the effort.

• : :
: : [blue]On Windows, unix and such operative systems there are no guarantees when a certain line of code will be executed. There might be a context switch anywhere in code, stopping your program for the moment.
: :
: : To measure the exact execution time you would need to throw out the OS and run the program without it... kind of complicated. There might be other ways, but I don't know why you would need it. There can be no real-time requirements on your program or you are using the wrong platform for it.
: : [/blue]
: :
:
: Yeah, sure. Approximate results will do fine - I was just wondering whether there are any ways to get the exact number of cycles that should be needed - as this can be counted by disassembling the code (if I understood correctly).
:
: Thanks again!
:

[blue]Yup, counting the assembler instructions will do that. I know no other way to get the -exact- number of cycles on a Windows program.
[/blue]

• :
: [blue]Yup, counting the assembler instructions will do that. I know no other way to get the -exact- number of cycles on a Windows program.
: [/blue]
:

Actually -- I don't think there is any reliable way to get that exact because cycle speed is processor dependent, much slower on my old 1.2 mz Zenith 100 computer from 20 years ago than on my 2.5Gig computer today. And I suspect the quality of electrical current going into the computer may affect computer speed.
• : : :
: : : [blue]On Windows, unix and such operative systems there are no guarantees when a certain line of code will be executed. There might be a context switch anywhere in code, stopping your program for the moment.
: : :
: : : To measure the exact execution time you would need to throw out the OS and run the program without it... kind of complicated. There might be other ways, but I don't know why you would need it. There can be no real-time requirements on your program or you are using the wrong platform for it.
: : : [/blue]
: : :
: :
: : Yeah, sure. Approximate results will do fine - I was just wondering whether there are any ways to get the exact number of cycles that should be needed - as this can be counted by disassembling the code (if I understood correctly).
: :
: : Thanks again!
: :
:
: [blue]Yup, counting the assembler instructions will do that. I know no other way to get the -exact- number of cycles on a Windows program.
: [/blue]
:
:
[green]
I would suggest sticking only to primitive instructions like bitwise, add, subtract, increment, or decrement. Stay away from multiplication or division and align your data on a multiple of 4 bytes with even addresses. Most compilers or assemblers, I think, do this mostly for you or you can specify. Odd addresses take extra cycles to compute and will slow down a algorithm somewhat. Trying to incorporate these into your algorithm should help out somewhat in trimming unnecessary cycles. If your really into the speed then pure assembler IMO is the way to go. Just my two cents.
[/green]

• : [green]
: I would suggest sticking only to primitive instructions like bitwise, add, subtract, increment, or decrement. Stay away from multiplication or division and align your data on a multiple of 4 bytes with even addresses. Most compilers or assemblers, I think, do this mostly for you or you can specify. Odd addresses take extra cycles to compute and will slow down a algorithm somewhat. Trying to incorporate these into your algorithm should help out somewhat in trimming unnecessary cycles. If your really into the speed then pure assembler IMO is the way to go. Just my two cents.
: [/green]

Thanks for the answers all of you!
I think I'll stick to the approximate result from measuring the time. Repeating this timetaking some times should give me a pretty correct result. Counting the assembly instructions is definatly not what I wanna do...

I try to stick to primitive instructions, but unfortunately I have to at least one multiplication. I know that multiplication is not the most efficient operation to do, so I'm trying to optimalize it in Assembly.

Best Regards,
Stian
• : I try to stick to primitive instructions, but unfortunately I have to at least one multiplication. I know that multiplication is not the most efficient operation to do, so I'm trying to optimalize it in Assembly.
:
: Best Regards,
: Stian
:

In this case assembly won't help you. The mult and div instructions are simply slow. In some cases you can replace them with shift instructions instead:

x*2 is the same as x << 1
x*4 is the same as x << 2
x*8 is the same as x << 3
and so on

x/2 is the same as x >> 1
x/4 is the same as x >> 2
x/8 is the same as x >> 3
and so on

But. If you are using lots of C++ library calls or if you are making a Windows program, don't worry too much about those things. They are good to know, but on such a slow system it doesn't matter. All too often do we se something like this:

[code]x >>= 3; // div by 8
printf("%d",x); [/code]

One earn 5 cycles or something by optimizing the division, but lost 10000 cycles by calling a slow library function. It is like buying new shoes to improve to the time it would take for one to walk around the earth. :-)

• : : I try to stick to primitive instructions, but unfortunately I have to at least one multiplication. I know that multiplication is not the most efficient operation to do, so I'm trying to optimalize it in Assembly.
: :
: : Best Regards,
: : Stian
: :
:
:
: In this case assembly won't help you. The mult and div instructions are simply slow. In some cases you can replace them with shift instructions instead:
:
: x*2 is the same as x << 1
: x*4 is the same as x << 2
: x*8 is the same as x << 3
: and so on
:
: x/2 is the same as x >> 1
: x/4 is the same as x >> 2
: x/8 is the same as x >> 3
: and so on
:

[green]
Yes, I know about these tricks. I also know (or at least assume) that the implementation of multiplication is optimal in c++, even though "optimal" doesn't mean that it is very good. What I need is however a possibility to multiply numbers of 128bits. This is to big for C++ to handle straight forward, so I've written some functions to able to do it. However, here I get alot of overhead and a slow system. So this is what I'm trying to optimize by writting it in Assembly. I know that there are libraries (like GMP) that does this for me, but I'd like to write the routines myself in order to learn it.
[/green]

:
: But. If you are using lots of C++ library calls or if you are making a Windows program, don't worry too much about those things. They are good to know, but on such a slow system it doesn't matter. All too often do we se something like this:
:
: [code]x >>= 3; // div by 8
: printf("%d",x); [/code]
:
: One earn 5 cycles or something by optimizing the division, but lost 10000 cycles by calling a slow library function. It is like buying new shoes to improve to the time it would take for one to walk around the earth. :-)
:

[green]
I see :-)
Well, don't worry about it. I only use neccecary operations in my algorithm. All prints and such extras are outside the algorithm. So when I measure speed this is the speed of the algorithm, not all the cout's.

Thanks again.
-Stian
[/green]
• [b][red]This message was edited by shaolin007 at 2006-1-17 9:7:55[/red][/b][hr]
: : : I try to stick to primitive instructions, but unfortunately I have to at least one multiplication. I know that multiplication is not the most efficient operation to do, so I'm trying to optimalize it in Assembly.
: : :
: : : Best Regards,
: : : Stian
: : :
: :
: :
: : In this case assembly won't help you. The mult and div instructions are simply slow. In some cases you can replace them with shift instructions instead:
: :
: : x*2 is the same as x << 1
: : x*4 is the same as x << 2
: : x*8 is the same as x << 3
: : and so on
: :
: : x/2 is the same as x >> 1
: : x/4 is the same as x >> 2
: : x/8 is the same as x >> 3
: : and so on
: :
:
: [green]
: Yes, I know about these tricks. I also know (or at least assume) that the implementation of multiplication is optimal in c++, even though "optimal" doesn't mean that it is very good. What I need is however a possibility to multiply numbers of 128bits. This is to big for C++ to handle straight forward, so I've written some functions to able to do it. However, here I get alot of overhead and a slow system. So this is what I'm trying to optimize by writting it in Assembly. I know that there are libraries (like GMP) that does this for me, but I'd like to write the routines myself in order to learn it.
: [/green]
:
: :
: : But. If you are using lots of C++ library calls or if you are making a Windows program, don't worry too much about those things. They are good to know, but on such a slow system it doesn't matter. All too often do we se something like this:
: :
: : [code]x >>= 3; // div by 8
: : printf("%d",x); [/code]
: :
: : One earn 5 cycles or something by optimizing the division, but lost 10000 cycles by calling a slow library function. It is like buying new shoes to improve to the time it would take for one to walk around the earth. :-)
: :
:
: [green]
: I see :-)
: Well, don't worry about it. I only use neccecary operations in my algorithm. All prints and such extras are outside the algorithm. So when I measure speed this is the speed of the algorithm, not all the cout's.
:
: Thanks again.
: -Stian
: [/green]
:

[green]
Forgot about the shift and rotate instructions, good point. One last thing, it's better to use decrement than increment since it takes extra instructions for a processor to check for 'truth'.
[/green]

[code]
*Loop construct is assumed

mov ax, 10
dec ax
jz where_ever ;ax=0 will set zero flag

*compared to

mov ax, 0
inc ax
cmp ax, 10 ;extra instruction due to check
je where_ever
[/code]