基本函数的典型执行时间

发布于 2024-11-30 04:57:37 字数 461 浏览 1 评论 0原文

众所周知,乘法的处理器指令比加法花费的时间多几倍,除法甚至更糟(UPD:这不再是事实,见下文)。那么像指数这样更复杂的运算呢?他们有多难?

动机。我很感兴趣,因为这将有助于算法设计,在早期阶段估计算法的性能关键部分。假设我想对图像应用一组滤镜。其中一个对每个像素的 3×3 邻域进行运算,对它们求和并取 atan。另一种方法对更多相邻像素求和,但不使用复杂的函数。哪一个执行时间会更长?

因此,理想情况下,我希望获得基本运算执行的近似相对时间,例如乘法通常比加法花费 5 倍的时间,指数约为 100 次乘法。当然,这是一个数量级的交易,而不是确切的值。我知道这取决于硬件和参数,所以假设我们测量现代 x86/x64 上浮点运算的平均时间(在某种意义上)。对于未在硬件中实现的操作,我对 C++ 标准库的典型运行时间感兴趣。

当分析这样的事情时,你有看到任何消息来源吗?这个问题有道理吗?或者没有这样的经验法则可以应用于实践?

It is well-known that the processor instruction for multiplication takes several times more time than addition, division is even worse (UPD: which is not true any more, see below). What about more complex operations like exponent? How difficult are they?

Motivation. I am interested because it would help in algorithm design to estimate performance-critical parts of algorithms on early stage. Suppose I want to apply a set of filters to an image. One of them operates on 3×3 neighborhood of each pixel, sums them and takes atan. Another one sums more neighbouring pixels, but does not use complicated functions. Which one would execute longer?

So, ideally I want to have approximate relative times of elementary operations execution, like multiplication typically takes 5 times more time than addition, exponent is about 100 multiplications. Of course, it is a deal of orders of magnitude, not the exact values. I understand that it depends on the hardware and on the arguments, so let's say we measure average time (in some sense) for floating-point operations on modern x86/x64. For operations that are not implemented in hardware, I am interested in typical running time for C++ standard libraries.

Have you seen any sources when such thing was analyzed? Does this question makes sense at all? Or no rules of thumb like this could be applied in practice?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一场春暖 2024-12-07 04:57:37

首先,我们要明确一点。这:

众所周知,处理器的乘法指令需要
比加法多几倍的时间

不再成立。很多很多年以来,情况并非如此,并且需要停止重演。在最常见的架构中,整数乘法是几个周期,整数加法是单周期;浮点加法和乘法往往具有几乎相同的时序特性(通常约为 4-6 个周期延迟,单周期吞吐量)。

现在,回答您的实际问题:它随架构和实现的不同而变化。在最近的架构上,具有编写良好的数学库,简单的基本函数(例如 explog)通常需要几十个周期(20-50 个周期是合理的支持)信封图)。对于质量较低的库,您有时会发现这些操作需要数百个周期。

对于更复杂的函数,例如pow,典型的时序范围从数十个周期到数百个周期。

First off, let's be clear. This:

It is well-known that processor instruction for multiplication takes
several times more time than addition

is no longer true in general. It hasn't been true for many, many years, and needs to stop being repeated. On most common architectures, integer multiplies are a couple cycles and integer adds are single-cycle; floating-point adds and multiplies tend to have nearly equal timing characteristics (typically around 4-6 cycles latency, with single-cycle throughput).

Now, to your actual question: it varies with both the architecture and the implementation. On a recent architecture, with a well written math library, simple elementary functions like exp and log usually require a few tens of cycles (20-50 cycles is a reasonable back-of-the-envelope figure). With a lower-quality library, you will sometimes see these operations require a few hundred cycles.

For more complicated functions, like pow, typical timings range from high tens into the hundreds of cycles.

凉月流沐 2024-12-07 04:57:37

你不应该担心这个。如果我告诉你,超越函数的典型 C 库实现往往需要大约 10 次单个浮点加法/乘法(或 50 次浮点加法/乘法),以及大约 5 次浮点除法,这不会是对你有用。

事实上,您的处理器安排内存访问的方式将严重干扰您所做的任何过早的优化。

如果在分析之后您发现使用超越函数的特定实现太慢,您可以考虑设置多项式插值方案。这将包括一个表,因此会产生额外的缓存问题,因此请确保进行测量而不是猜测。

这可能涉及切比雪夫近似。记录一下自己的情况,这是此类领域中特别有用的技术。

有人告诉我编译器在优化浮点代码方面非常糟糕。您可能想要编写自定义汇编代码。

此外,Intel Performance Primitives(如果您使用的是 Intel CPU)如果您准备牺牲一些准确性来换取速度,那么值得拥有的东西。

You shouldn't be concerned about this. If I tell you that a typical C library implementation of transcendental functions tend to take around 10 times a single floating point addition/multiplication (or 50 floating point additions/multiplications), and around 5 times a floating point division, this wouldn't be useful to you.

Indeed, the way your processor schedules memory accesses will interfere badly with any premature optimization you'd do.

If after profiling you find that a particular implementation using transcendental functions is too slow, you can contemplate setting up a polynomial interpolation scheme. This will include a table and therefore will incur extra cache issues, so make sure to measure and not guess.

This will likely involve Chebyshev approximation. Document yourself about it, this is a particularly useful technique in this kind of domains.

I have been told that compilers are quite bad in optimizing floating point code. You may want to write custom assembly code.

Also, Intel Performance Primitives (if you are on Intel CPU) is something good to own if you are ready to trade off some accuracy for speed.

骄傲 2024-12-07 04:57:37

您始终可以启动第二个线程并对操作进行计时。大多数基本操作在执行时间上没有太大差异。最大的区别在于执行的次数。 O(n) 通常是您应该考虑的。

You could always start a second thread and time the operations. Most elementary operations don't have that much difference in execution time. The big difference is how many times the are executed. The O(n) is generally what you should be thinking about.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文