每条机器语言指令执行需要多长时间?
设置、读取、移动和比较等操作是否都需要相同的时间来执行?
如果没有:有什么方法可以查出需要多长时间。
是否有一些名称来表达我的意思,某些特定类型的CPU执行不同汇编语言指令(移动、读取等)的速度?
Do operations like set, read, move and compare all take the same time to execute?
If not: Is there any way to find out how long.
Is there some name for what I mean, some specific type cpu's speed of executing the different assembly language instructions (move, read, etc.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可能正在寻找的关键术语是:
这些应该很容易通过谷歌搜索。但基本上,指令需要一定数量的周期来执行(延迟)。但您通常可以同时执行其中多个(吞吐量)。
一般来说没有。不同的指令有不同的延迟和吞吐量。例如,加法通常比除法快得多。
如果您对现代处理器上不同汇编指令的实际值感兴趣,可以查看 Agner Fog 的表格。
也就是说,还有无数其他因素会影响计算机的性能。
其中大多数可以说比指令延迟/吞吐量更重要:
The key terms you're probably looking are:
These should be easy to google for. But basically, instructions take a certain number of cycles to execute (latency). But you can often execute multiple of them simultaneously (throughput).
In general no. Different instructions have different latencies and throughputs. For example, an addition is typically much faster than a division.
If you're interested in the actual values of different assembly instructions on modern processors, you can take a look at Agner Fog's tables.
That said, there's about a gazzillion other factors that affect the performance of a computer.
Most of which are arguably more important than instruction latencies/throughputs:
管道和缓存以及CPU本身不再是主要瓶颈对你的问题做了两件事。第一,当今的 CPU 通常每个时钟执行一条指令,第二,可能需要许多(数十到数百)个时钟来为 CPU 提供一条指令。更现代的处理器,即使它们的指令集很旧,也很少费心提及时钟执行,因为它是一个时钟,并且“真实”执行速度太难描述。
高速缓存和管道尝试让 cpu 以每个时钟频率的这一指令运行,但例如从内存读取,必须等待响应返回。如果该项目不在高速缓存中,则可能需要数百个时钟周期,因为它将必须读取多个位置来填充高速缓存中的一行,然后需要更多时钟才能将其通过高速缓存返回到处理器。
现在,如果你回到过去,或者现在,但在微控制器世界中,或者其他系统中,内存系统可以在一个时钟内做出响应,或者至少是一个非常确定的数字(例如,eeprom 需要两个时钟,ram 需要一个时钟),那么之类的事情),那么你可以很容易地计算出时钟的确切数量。处理器通常会发布每条指令的周期表。例如,读取两条指令需要两个时钟来获取指令,然后用另一个时钟来执行读取,最少需要 3 个时钟。有些实际上需要多个时钟来执行,因此也会添加进去。
我强烈建议您找到 Michael Abrash 所著的 Zen of Assembly Language 的(二手)副本。它问世时已经过时了,但仍然是一部重要的作品。学习处理相对简单的 8088/86 已经够困难的了,今天的 x86 和其他系统要复杂得多。
如果运行 Windows 或 Linux 或类似的东西,尝试计时你的代码不一定能让你到达你想要的地方。添加或删除 nop,导致代码在内存中对齐一个字节,这会对代码其余部分的性能产生巨大影响,而代码的其余部分(除了其在 ram 中的位置未更改)之外。作为理解问题复杂性的一个简单例子。
您对什么处理器或系统感兴趣? stm32f4 discovery 板约 20 美元,包含一个带有指令和数据缓存的 ARM (cortex-m) 处理器。它具有较大系统的复杂性,但同时又足够简单(相对于较大系统),能够进行受控实验。
如果您熟悉微芯片图片世界,他们通常会计算周期以在事件之间执行精确的延迟。一个非常确定的环境(只要您不使用中断)。
Pipelining and caches and the cpu itself no longer being the primary bottleneck has done two things to your question. One, the cpu's today generally execute one instruction per clock, second it can take many (dozens to hundreds) of clocks to feed the cpu an instruction. The more modern processors, even if their instruction sets are old, rarely bother to mention clock execution because it is one clock and the "real" execution speed is too hard to describe.
The cache and pipeline try to allow the cpu to run at this one instruction per clock rate, but for example a read from memory, has to wait for the response to come back. If this item is not in cache this can be hundreds of clock cycles as it will have to read a number of locations to fill a line in the cache then some more clocks to get it through the caches back to the processor.
Now if you go back in time, or present time but in the microcontroller world for example or other system where the memory system can respond in one clock, or at least a very deterministic number (say two clocks for eeprom and one for ram, that kind of thing), then you can very easily count the exact number of clocks. Processors like often do publish a table of cycles per instruction. A two instruction read for example would be two clocks to fetch the instruction, then another clock to perform the read, 3 clocks minimum. some would actually take more than one clock to execute so that would be added in as well.
I highly recommend finding a (used) copy of Zen of Assembly Language by Michael Abrash. It was dated when it came out but still an important work. learning to juggle the relatively simple 8088/86 was tough enough, todays x86 and other systems are quite a bit more complicated.
If running windows or linux or something like that trying to time your code wont necessarily get you to where you want. add or remove a nop, causing the code to be aligned in memory by as much as a byte can have dramatic affects on the performance of the remainder of the code which other than its location in ram has not changed. As a simple example of understanding the complicated nature of the problem.
What processor or system are you interested in? the stm32f4 discovery board, about $20, contains an ARM (cortex-m) processor with instruction and data caches. It has the complications of a bigger system, but at the same time simple enough (relative to a bigger system) to be able to have controlled experiments.
If you are familiar with the microchip pic world they often count cycles to perform precision delays between events. A very deterministic environment (so long as you dont use interrupts).
您可以在 CPU 制造商(例如 Intel)提供的 CPU 汇编语言手册中找到此信息。每个CPU结构通常有一两个页面,它会告诉你执行需要多少个“周期”。它将在其他地方定义“周期”。根据给出的内容,指令可能需要不同的时间来执行。例如,条件跳转可能会也可能不会跳转。乘以 0 可能(我假设)比乘以 7 更快。
You will find this information in the CPU's assembly language manual from the CPU's manufacturer (e.g. Intel). Each CPU instructure usually has a page or two and it will tell you how many "cycles" it will take to execute. It will define "cycles" elsewhere. Instructions can can take different times to execute depending on what they are given. e.g. A conditional jump may or may not jump. A multiply by zero may (i assume) be faster than a multiply by 7.
答案是MIPS。或 IPS 每秒百万条指令。既然你在谈论嵌入式系统。
The answer is MIPS. or IPS million Instructions per second. Since you are talking about Embedded systems.