C++ 的速度运算符/简单数学

发布于 2024-12-21 03:37:01 字数 421 浏览 2 评论 0原文

我正在开发一个物理引擎,觉得它将有助于更好地理解执行许多简单或复杂数学运算的速度和性能影响。

  1. 物理引擎的很大一部分正在清除不必要的计算,但是什么时候计算量足够小以至于不需要进行比较检查?

    • 例如:测试两条线段是否相交。在直接进行简单的数学计算之前是否应该检查它们是否彼此靠近,或者从长远来看,额外的操作会减慢整​​个过程?
  2. 不同的数学计算需要多少时间

    • 例如:(3+8) vs (5x4) vs (log(8)) 等
  3. 不等式需要多少时间检查需要吗?

    • 例如:>、<、=

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.

  1. A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?

    • eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
  2. How much time do different mathematical calculations take

    • eg: (3+8) vs (5x4) vs (log(8)) etc.
  3. How much time do inequality checks take?

    • eg: >, <, =

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

丢了幸福的猪 2024-12-28 03:37:01
  1. 您必须进行分析。

  2. 基本运算(例如加法或乘法)应仅采用一条 asm 指令。

    编辑:根据评论,虽然采用一条 asm 指令,但乘法可以扩展到微指令。

    对数需要更长的时间。

  3. 还有一条 asm 指令。

除非您分析代码,否则无法判断瓶颈在哪里。

除非您调用数学运算数百万次(即使您这样做了),选择好的算法或其他一些高级优化将比优化小东西带来更大的速度增益。

您应该编写易于阅读和易于修改的代码,并且只有当您对性能不满意时,才开始优化 - 首先是高级,然后是低级。

您可能还想尝试动态编程或缓存。

  1. You'll have to do profiling.

  2. Basic operations, like additions or multiplications should only take one asm instructions.

    EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.

    Logarithms take longer.

  3. Also one asm instruction.

Unless you profile your code, there's no way to tell where your bottlenecks are.

Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.

You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.

You might also want to try dynamic programming or caching.

困倦 2024-12-28 03:37:01

嗯,这取决于您的硬件。具有指令延迟的非常好的表是 http://www.agner.org/optimize/instruction_tables.pdf< /a>

1. 这很大程度上取决于代码。另外不要忘记,它不仅仅取决于计算,还取决于比较结果的预测程度。

2. 一般来说,加法/减法非常快,浮点乘法则稍慢一些。浮点数除法相当慢(如果您需要除以常数 c,通常最好预先计算 1/c 并乘以它)。库函数通常(我敢说总是)比简单运算符慢,除非编译器决定使用 SSE。例如,可以使用一条 SSE 指令计算 sqrt() 和 1/sqrt()。

3.从大约一个周期到几十个周期。当前的处理器根据条件进行预测。如果预测正确的话,速度会很快。然而,如果预测错误,处理器必须丢弃所有预加载的指令(IIRC Sandy Bridge 预加载最多 30 条指令)并开始处理新指令。

这意味着如果您有一个在大多数情况下都满足条件的代码,那么速度会很快。同样,如果您的代码大多数时候不满足条件,那么速度会很快。简单的交替条件(TFTFTF…)通常也很快。

Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf

1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.

2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.

3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.

That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.

暮凉 2024-12-28 03:37:01

关于2和3,我可以建议您参考英特尔® 64 和 IA-32 架构优化参考手册。附录 C 介绍了各种指令的延迟和吞吐量。
但是,除非您手动编写汇编代码,否则编译器将应用其自己的优化,因此直接使用此信息将相当困难。

更重要的是,您可以使用 SIMD 对代码进行矢量化并并行运行计算。此外,如果内存布局不理想,内存性能也可能成为瓶颈。我链接到的文档有关于这两个问题的章节。

然而,正如 @Ph0en1x 所说,第一步是选择(或编写)一个有效的算法,使其适合您的问题。只有这样,您才应该开始考虑低级优化。

至于1,在一般情况下,我会说,如果您的算法以这样的方式工作,对于何时执行某些测试有一些可调整的阈值,您可以进行一些分析并打印出某种性能图,并且确定这些阈值的最佳值。

As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.

More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.

However, as @Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.

As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.

红ご颜醉 2024-12-28 03:37:01
  1. 这取决于您尝试模拟的场景。您有多少个物体以及它们的距离有多近?它们是聚集还是均匀分布?你的物体经常移动,还是静止的?您将必须运行测试。用于快速检查邻近度的可能数据结构是 kd-trees局部敏感哈希(可能还有其他)。我不确定这些是否适合您的应用程序,您必须检查数据结构的维护和查找成本是否适合您。
  2. 您将必须运行测试。考虑检查是否可以使用矢量化,或者是否可以运行某些使用 CUDA 或类似的东西在 GPU 中进行计算。
  3. 与上面相同 - 你必须进行测试。
  1. This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
  2. You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
  3. Same as above - you have to test.
沦落红尘 2024-12-28 03:37:01

您通常可以认为不等式检查、递增、递减、位移位、加法和减法非常便宜。乘法和除法通常要贵一些。像对数这样的复杂数学运算要昂贵得多。

确保在您的平台上进行基准测试。使用具有紧密循环的人工测试进行基准测试时要小心——这往往会给您带来误导性的结果。尝试在尽可能真实的代码中进行基准测试。理想情况下,在现实条件下分析实际代码。

至于线相交等优化,则取决于数据集。如果您进行了大量检查并且大多数行都很短,则可能值得快速检查以排除 X 或 Y 范围不重叠的情况。

You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.

Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.

As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.

小情绪 2024-12-28 03:37:01

据我所知,所有“不平等检查”都需要相同的时间。
关于其余的计算,我建议您运行一些测试,例如

  1. 采用时间戳 A
  2. 进行 1,000,000“+”计算(或任何其他)。
  3. 使用时间戳 B
  4. 计算 A 和 B 之间的差异。

然后您可以比较计算结果。

请记住:

  1. 使用不同的数学库可能会改变它(一些数学库更注重性能,一些更注重精度)
  2. 编译器优化可能会改变它。
  3. 每个处理器的做法都不同。

as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like

  1. take time stamp A
  2. make 1,000,000 "+" calculation (or any other).
  3. take time stamp B
  4. calculate the diff between A and B.

then you can compare the calculations.

take in mind:

  1. using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
  2. the compiler optimization may change it.
  3. each processor is doing it differently.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文