Clang 与 GCC - 哪个生成更快的二进制文件?

发布于 2024-09-08 08:24:38 字数 1436 浏览 10 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

心舞飞扬 2024-09-15 08:24:38

以下是我对 GCC 4.7.2 的一些最新的、尽管有限的发现
以及 C++ 的 Clang 3.2。

更新:GCC 4.8.1 与 clang 3.3 的比较附在下面。

更新:GCC 4.8.2 与 clang 3.4 的比较附在下面。

我维护一个构建的 OSS 工具对于同时具有 GCC 和 Clang 的 Linux,
以及 Microsoft 的 Windows 编译器。该工具,coan,是一个预处理器
以及 C/C++ 源文件和代码行的分析器:其
计算概要主要涉及递归下降解析和文件处理。
开发分支(这些结果所属)
目前包含大约 90 个文件中的大约 11K LOC。它被编码,
现在,在 C++ 中,它具有丰富的多态性和模板,但仍然是
由于其不远的过去在被黑客攻击在一起而陷入了许多补丁中。
移动语义没有被明确利用。它是单线程的。我
没有投入认真的精力来优化它,而“架构”
很大程度上仍然是 ToDo。

我在 3.2 之前仅使用 Clang 作为实验编译器
因为,尽管它具有卓越的编译速度和诊断能力,但它
C++11 标准支持落后于当代 GCC 版本
受到科恩的尊重。在 3.2 中,这一差距已被缩小。

我的 Linux 测试工具用于当前 co 开发流程的大致情况
70K 源文件混合在一个文件解析器测试用例中,压力
测试消耗 1000 个文件,场景测试消耗 <; 1K 文件。

除了报告测试结果外,线束还会累积并
显示 coan 中消耗的文件总数和运行时间(它只是将每个 coan 命令行传递给 Linux time 命令并捕获并添加报告的数字)。任何数量的可测量时间为 0 的测试加起来都为 0,这一事实让时间安排感到很荣幸,但此类测试的贡献可以忽略不计。计时统计信息显示在 make check 的末尾,如下所示:

coan_test_timer: info: coan processed 70844 input_files.
coan_test_timer: info: run time in coan: 16.4 secs.
coan_test_timer: info: Average processing time per input file: 0.000231 secs.

我比较了 GCC 4.7.2 和
Clang 3.2,除了编译器之外,所有条件都相同。从 Clang 3.2 开始,
我不再需要代码之间的任何预处理器区分
GCC 将编译的小册子和 Clang 替代品。我建立到
每种情况下都使用相同的 C++ 库(GCC)并运行所有比较
在同一终端会话中连续。

我的发布版本的默认优化级别是 -O2。我也
在 -O3 成功测试了构建。我测试了每个配置 3
连续进行多次,并对 3 个结果进行平均,结果如下
结果。数据单元格中的数字是平均数
coan 可执行文件处理每个内容所消耗的微秒数
约 70K 输入文件(读取、解析和写入输出和诊断)。

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 231 | 237 |0.97 |
----------|-----|-----|-----|
Clang-3.2 | 234 | 186 |1.25 |
----------|-----|-----|------
GCC/Clang |0.99 | 1.27|

任何特定的应用程序很可能具有发挥作用的特征
对编译器的优点或缺点不公平。严格的基准测试
采用多样化的应用程序。考虑到这一点,值得注意的是
这些数据的特点是:

  1. -O3 优化对 GCC 略有不利
  2. -O3 优化对 Clang 非常有利
  3. 在 -O2 优化时,GCC 比 Clang 快一点点
  4. 在 -O3 优化时,Clang 比 GCC 快得多。

两个编译器的进一步有趣比较是偶然出现的
这些发现后不久。 Coan 大量使用智能指针
其中之一是在文件处理中大量使用。这个特别的
智能指针类型已在之前的版本中进行了 typedef'd,以便
编译器差异化,如果满足以下条件,则为 std::unique_ptr
配置的编译器对其用法有足够成熟的支持
,否则为 std::shared_ptr。对 std::unique_ptr 的偏见是
愚蠢的,因为这些指针实际上被转移了,
std::unique_ptr 看起来像是更适合替换的选项
std::auto_ptr 当时 C++11 变体对我来说还很新鲜。

在实验构建过程中衡量 Clang 3.2 的持续需求
对于这个和类似的区别,我无意中构建了
std::shared_ptr 当我打算构建 std::unique_ptr 时,
并惊讶地发现生成的可执行文件默认为 -O2
优化,是我见过的最快的,有时达到 184
毫秒。每个输入文件。通过对源代码的这一更改,
相应的结果是这些;

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 234 | 234 |1.00 |
----------|-----|-----|-----|
Clang-3.2 | 188 | 187 |1.00 |
----------|-----|-----|------
GCC/Clang |1.24 |1.25 |

这里需要注意的是:

  1. 现在两个编译器都没有从 -O3 优化中受益。
  2. Clang 在每个优化级别上都同样重要地击败了 GCC。
  3. GCC 的性能仅受智能指针类型的轻微影响
    改变。
  4. Clang 的 -O2 性能很大程度上受智能指针类型的影响
    改变。

在智能指针类型更改之前和之后,Clang 能够构建一个
在 -O3 优化下,coan 可执行文件的速度大大加快,并且它可以
在 -O2 和 -O3 处构建同样更快的可执行文件
指针类型是最适合这项工作的类型 - std::shared_ptr

我无法评论的一个明显问题是为什么
Clang 应该能够在我的应用程序中找到 25% -O2 加速
大量使用的智能指针类型从唯一更改为共享,
而GCC则对同样的变化漠不关心。我也不知道我是否应该
对 Clang 的 -O2 优化所蕴藏的发现欢呼或嘘声
对我的智能指针选择的智慧如此巨大的敏感度。

更新:GCC 4.8.1 v clang 3.3

现在相应的结果是:

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.1 | 442 | 443 |1.00 |
----------|-----|-----|-----|
Clang-3.3 | 374 | 370 |1.01 |
----------|-----|-----|------
GCC/Clang |1.18 |1.20 |

所有四个可执行文件现在比以前花费更多的平均时间来处理
1 文件反映最新编译器的性能。这是由于
事实上,测试应用程序的后期开发分支已经承担了很多工作
同时解析复杂性并以速度为代价。只有比率是
重要的。

现在值得注意的点并不是特别新颖:

  • GCC 对 -O3 优化漠不关心
  • clang 从 -O3 优化中获得的收益非常小,
  • clang 在每个优化级别上都以同样重要的优势击败了 GCC。

将这些结果与 GCC 4.7.2 和 clang 3.2 的结果进行比较,可以看出:
GCC 在每个优化级别上都追回了 clang 大约四分之一的领先优势。但
由于测试应用程序已被大量开发,因此无法
自信地将其归因于 GCC 代码生成的赶超。
(这一次,我已经注意到从中获取计时的应用程序快照
并可以再次使用它。)

更新:GCC 4.8.2 v clang 3.4

我完成了 GCC 4.8.1 v Clang 3.3 的更新,说我会
坚持使用相同的 coan 快照以获取进一步更新。但我决定
而是测试该快照(修订版 301)最新开发
我的快照通过了测试套件(修订版 619)。这给出了结果
一点经度,我还有另一个动机:

我最初的帖子指出,我没有致力于优化 coan
速度。截至修订版,情况仍然如此。 301.然而,在我建造之后
每次我运行测试套件时,将计时装置放入 coan 测试工具中
最新变化对性能的影响让我眼前一亮。我看到了
它往往大得惊人,而且趋势比
我觉得功能上的进步是值得的。

通过修订。第308章 测试套件中每个输入文件的平均处理时间
自首次在这里发布以来已经增加了一倍多。那时我做了一个
我10年来不关心绩效的政策发生了180度大转变。在密集的
直到 619 性能的大量修改始终是一个考虑因素和一个
其中很大一部分纯粹是为了从根本上重写关键的承载者
更快的线路(尽管不使用任何非标准编译器功能来做到这一点)。看看每个编译器对此的反应会很有趣
掉头,

这是现在熟悉的最新两个编译器版本 rev.301 的时序矩阵:

coan - rev.301 结果

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 428 | 428 |1.00 |
----------|-----|-----|-----|
Clang-3.4 | 390 | 365 |1.07 |
----------|-----|-----|------
GCC/Clang | 1.1 | 1.17|

这里的故事只是略有改变来自 GCC-4.8.1 和 Clang-3.3。 GCC 的展示
好一点了。 Clang 的情况稍差一些。噪音可以很好地解释这一点。
Clang 仍然以 -O2-O3 的优势领先,这在大多数情况下并不重要
应用程序,但对相当多的人来说很重要。

这是转速矩阵。 619.

coan - rev.619 结果

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 210 | 208 |1.01 |
----------|-----|-----|-----|
Clang-3.4 | 252 | 250 |1.01 |
----------|-----|-----|------
GCC/Clang |0.83 | 0.83|

将 301 和 619 的数字放在一起,可以看出几点。

  • 我的目标是编写更快的代码,两个编译器都强调证明是正确的
    我的努力。但是:

  • GCC 比 Clang 更慷慨地回报这些努力。在-O2
    优化 Clang 的 619 构建比 301 构建快 46%:位于 -O3 Clang 的
    改善率为31%。很好,但是在每个优化级别,GCC 的 619 构建都是
    速度是 301 的两倍多。

  • GCC 不仅扭转了 Clang 以前的优势。并且在每次优化时
    GCC 级别现在比 Clang 领先 17%。

  • Clang 在 301 构建中能够通过 -O3 优化获得比 GCC 更多的优势
    在 619 版本中消失了。两个编译器都无法从 -O3 中获得有意义的收益。

我对这种命运的逆转感到非常惊讶,我怀疑我
可能意外地使 clang 3.4 本身的构建变得缓慢(自从我构建
从源头看)。因此,我使用发行版的 Clang 3.3 重新运行了 619 测试。这
结果实际上与 3.4 相同。

因此,关于对 U 型转变的反应:就这里的数字而言,Clang 做了很多工作
当我不给 C++ 代码时,它的速度比 GCC 更好
帮助。当我专心提供帮助时,GCC 做得比 Clang 好得多。

我并没有将这一观察提升为原则,但我认为
“哪个编译器生成更好的二进制文件?”的课程是一个问题
即使您指定了与答案相关的测试套件,
二进制文件的计时仍然不是一个明确的问题。

您的更好的二进制文件是最快的二进制文件,还是最好的二进制文件
补偿廉价制作的代码?或者最好的补偿昂贵
精心设计的代码是否优先考虑可维护性和重用性而不是速度?这取决于
您生成二进制文件的动机的性质和相对权重,以及
您这样做的限制。

无论如何,如果您非常关心构建“最好的”二进制文件,那么您
最好不断检查编译器的连续迭代如何交付给您
在代码的连续迭代中“最好”的想法。

Here are some up-to-date albeit narrow findings of mine with GCC 4.7.2
and Clang 3.2 for C++.

UPDATE: GCC 4.8.1 v clang 3.3 comparison appended below.

UPDATE: GCC 4.8.2 v clang 3.4 comparison is appended to that.

I maintain an OSS tool that is built for Linux with both GCC and Clang,
and with Microsoft's compiler for Windows. The tool, coan, is a preprocessor
and analyser of C/C++ source files and codelines of such: its
computational profile majors on recursive-descent parsing and file-handling.
The development branch (to which these results pertain)
comprises at present around 11K LOC in about 90 files. It is coded,
now, in C++ that is rich in polymorphism and templates and but is still
mired in many patches by its not-so-distant past in hacked-together C.
Move semantics are not expressly exploited. It is single-threaded. I
have devoted no serious effort to optimizing it, while the "architecture"
remains so largely ToDo.

I employed Clang prior to 3.2 only as an experimental compiler
because, despite its superior compilation speed and diagnostics, its
C++11 standard support lagged the contemporary GCC version in the
respects exercised by coan. With 3.2, this gap has been closed.

My Linux test harness for current coan development processes roughly
70K sources files in a mixture of one-file parser test-cases, stress
tests consuming 1000s of files and scenario tests consuming < 1K files.

As well as reporting the test results, the harness accumulates and
displays the totals of files consumed and the run time consumed in coan (it just passes each coan command line to the Linux time command and captures and adds up the reported numbers). The timings are flattered by the fact that any number of tests which take 0 measurable time will all add up to 0, but the contribution of such tests is negligible. The timing stats are displayed at the end of make check like this:

coan_test_timer: info: coan processed 70844 input_files.
coan_test_timer: info: run time in coan: 16.4 secs.
coan_test_timer: info: Average processing time per input file: 0.000231 secs.

I compared the test harness performance as between GCC 4.7.2 and
Clang 3.2, all things being equal except the compilers. As of Clang 3.2,
I no longer require any preprocessor differentiation between code
tracts that GCC will compile and Clang alternatives. I built to the
same C++ library (GCC's) in each case and ran all the comparisons
consecutively in the same terminal session.

The default optimization level for my release build is -O2. I also
successfully tested builds at -O3. I tested each configuration 3
times back-to-back and averaged the 3 outcomes, with the following
results. The number in a data-cell is the average number of
microseconds consumed by the coan executable to process each of
the ~70K input files (read, parse and write output and diagnostics).

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 231 | 237 |0.97 |
----------|-----|-----|-----|
Clang-3.2 | 234 | 186 |1.25 |
----------|-----|-----|------
GCC/Clang |0.99 | 1.27|

Any particular application is very likely to have traits that play
unfairly to a compiler's strengths or weaknesses. Rigorous benchmarking
employs diverse applications. With that well in mind, the noteworthy
features of these data are:

  1. -O3 optimization was marginally detrimental to GCC
  2. -O3 optimization was importantly beneficial to Clang
  3. At -O2 optimization, GCC was faster than Clang by just a whisker
  4. At -O3 optimization, Clang was importantly faster than GCC.

A further interesting comparison of the two compilers emerged by accident
shortly after those findings. Coan liberally employs smart pointers and
one such is heavily exercised in the file handling. This particular
smart-pointer type had been typedef'd in prior releases for the sake of
compiler-differentiation, to be an std::unique_ptr<X> if the
configured compiler had sufficiently mature support for its usage as
that, and otherwise an std::shared_ptr<X>. The bias to std::unique_ptr was
foolish, since these pointers were in fact transferred around,
but std::unique_ptr looked like the fitter option for replacing
std::auto_ptr at a point when the C++11 variants were novel to me.

In the course of experimental builds to gauge Clang 3.2's continued need
for this and similar differentiation, I inadvertently built
std::shared_ptr<X> when I had intended to build std::unique_ptr<X>,
and was surprised to observe that the resulting executable, with default -O2
optimization, was the fastest I had seen, sometimes achieving 184
msecs. per input file. With this one change to the source code,
the corresponding results were these;

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 234 | 234 |1.00 |
----------|-----|-----|-----|
Clang-3.2 | 188 | 187 |1.00 |
----------|-----|-----|------
GCC/Clang |1.24 |1.25 |

The points of note here are:

  1. Neither compiler now benefits at all from -O3 optimization.
  2. Clang beats GCC just as importantly at each level of optimization.
  3. GCC's performance is only marginally affected by the smart-pointer type
    change.
  4. Clang's -O2 performance is importantly affected by the smart-pointer type
    change.

Before and after the smart-pointer type change, Clang is able to build a
substantially faster coan executable at -O3 optimisation, and it can
build an equally faster executable at -O2 and -O3 when that
pointer-type is the best one - std::shared_ptr<X> - for the job.

An obvious question that I am not competent to comment upon is why
Clang should be able to find a 25% -O2 speed-up in my application when
a heavily used smart-pointer-type is changed from unique to shared,
while GCC is indifferent to the same change. Nor do I know whether I should
cheer or boo the discovery that Clang's -O2 optimization harbours
such huge sensitivity to the wisdom of my smart-pointer choices.

UPDATE: GCC 4.8.1 v clang 3.3

The corresponding results now are:

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.1 | 442 | 443 |1.00 |
----------|-----|-----|-----|
Clang-3.3 | 374 | 370 |1.01 |
----------|-----|-----|------
GCC/Clang |1.18 |1.20 |

The fact that all four executables now take a much greater average time than previously to process
1 file does not reflect on the latest compilers' performance. It is due to the
fact that the later development branch of the test application has taken on lot of
parsing sophistication in the meantime and pays for it in speed. Only the ratios are
significant.

The points of note now are not arrestingly novel:

  • GCC is indifferent to -O3 optimization
  • clang benefits very marginally from -O3 optimization
  • clang beats GCC by a similarly important margin at each level of optimization.

Comparing these results with those for GCC 4.7.2 and clang 3.2, it stands out that
GCC has clawed back about a quarter of clang's lead at each optimization level. But
since the test application has been heavily developed in the meantime one cannot
confidently attribute this to a catch-up in GCC's code-generation.
(This time, I have noted the application snapshot from which the timings were obtained
and can use it again.)

UPDATE: GCC 4.8.2 v clang 3.4

I finished the update for GCC 4.8.1 v Clang 3.3 saying that I would
stick to the same coan snaphot for further updates. But I decided
instead to test on that snapshot (rev. 301) and on the latest development
snapshot I have that passes its test suite (rev. 619). This gives the results a
bit of longitude, and I had another motive:

My original posting noted that I had devoted no effort to optimizing coan for
speed. This was still the case as of rev. 301. However, after I had built
the timing apparatus into the coan test harness, every time I ran the test suite
the performance impact of the latest changes stared me in the face. I saw that
it was often surprisingly big and that the trend was more steeply negative than
I felt to be merited by gains in functionality.

By rev. 308 the average processing time per input file in the test suite had
well more than doubled since the first posting here. At that point I made a
U-turn on my 10 year policy of not bothering about performance. In the intensive
spate of revisions up to 619 performance was always a consideration and a
large number of them went purely to rewriting key load-bearers on fundamentally
faster lines (though without using any non-standard compiler features to do so). It would be interesting to see each compiler's reaction to this
U-turn,

Here is the now familiar timings matrix for the latest two compilers' builds of rev.301:

coan - rev.301 results

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 428 | 428 |1.00 |
----------|-----|-----|-----|
Clang-3.4 | 390 | 365 |1.07 |
----------|-----|-----|------
GCC/Clang | 1.1 | 1.17|

The story here is only marginally changed from GCC-4.8.1 and Clang-3.3. GCC's showing
is a trifle better. Clang's is a trifle worse. Noise could well account for this.
Clang still comes out ahead by -O2 and -O3 margins that wouldn't matter in most
applications but would matter to quite a few.

And here is the matrix for rev. 619.

coan - rev.619 results

          | -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 210 | 208 |1.01 |
----------|-----|-----|-----|
Clang-3.4 | 252 | 250 |1.01 |
----------|-----|-----|------
GCC/Clang |0.83 | 0.83|

Taking the 301 and the 619 figures side by side, several points speak out.

  • I was aiming to write faster code, and both compilers emphatically vindicate
    my efforts. But:

  • GCC repays those efforts far more generously than Clang. At -O2
    optimization Clang's 619 build is 46% faster than its 301 build: at -O3 Clang's
    improvement is 31%. Good, but at each optimization level GCC's 619 build is
    more than twice as fast as its 301.

  • GCC more than reverses Clang's former superiority. And at each optimization
    level GCC now beats Clang by 17%.

  • Clang's ability in the 301 build to get more leverage than GCC from -O3 optimization
    is gone in the 619 build. Neither compiler gains meaningfully from -O3.

I was sufficiently surprised by this reversal of fortunes that I suspected I
might have accidentally made a sluggish build of clang 3.4 itself (since I built
it from source). So I re-ran the 619 test with my distro's stock Clang 3.3. The
results were practically the same as for 3.4.

So as regards reaction to the U-turn: On the numbers here, Clang has done much
better than GCC at at wringing speed out of my C++ code when I was giving it no
help. When I put my mind to helping, GCC did a much better job than Clang.

I don't elevate that observation into a principle, but I take
the lesson that "Which compiler produces the better binaries?" is a question
that, even if you specify the test suite to which the answer shall be relative,
still is not a clear-cut matter of just timing the binaries.

Is your better binary the fastest binary, or is it the one that best
compensates for cheaply crafted code? Or best compensates for expensively
crafted code that prioritizes maintainability and reuse over speed? It depends on the
nature and relative weights of your motives for producing the binary, and of
the constraints under which you do so.

And in any case, if you deeply care about building "the best" binaries then you
had better keep checking how successive iterations of compilers deliver on your
idea of "the best" over successive iterations of your code.

遮了一弯 2024-09-15 08:24:38

Phoronix 对此做了一些基准测试,但是它是关于几个月前的 Clang/LLVM 的快照版本。结果是,事情或多或少都是有推动力的。 GCC 和 Clang 在所有情况下都不是绝对更好。

由于您将使用最新的 Clang,因此它可能不太相关。话又说回来,GCC 4.6 预计会有一些主要优化对于 Core 2Core i7

我认为 Clang 更快的编译速度对于原始开发人员来说会更好,然后当您将代码推向 Linux 发行版时, BSD 等。最终用户将使用 GCC 来获得更快的二进制文件。

Phoronix did some benchmarks about this, but it is about a snapshot version of Clang/LLVM from a few months back. The results being that things were more-or-less a push; neither GCC nor Clang is definitively better in all cases.

Since you'd use the latest Clang, it's maybe a little less relevant. Then again, GCC 4.6 is slated to have some major optimizations for Core 2 and Core i7, apparently.

I figure Clang's faster compilation speed will be nicer for original developers, and then when you push the code out into the world, Linux distribution, BSD, etc. end-users will use GCC for the faster binaries.

掩于岁月 2024-09-15 08:24:38

Clang 编译代码速度更快的事实可能并不像生成的二进制文件的速度那么重要。不过,这里有一系列基准测试

The fact that Clang compiles code faster may not be as important as the speed of the resulting binary. However, here is a series of benchmarks.

多彩岁月 2024-09-15 08:24:38

就生成的二进制文件的速度而言,GCC 4.8 和 Clang 3.3 之间的总体差异非常小。在大多数情况下,两个编译器生成的代码执行类似。这两个编译器都不支配另一个编译器。

基准表明 GCC 和 Clang 之间存在显着的性能差距,这纯属巧合。

程序性能受到编译器选择的影响。如果一个开发人员或一组开发人员专门使用 GCC,那么使用 GCC 的程序预计会比使用 Clang 的运行速度稍快,反之亦然。

从开发人员的角度来看,GCC 4.8+ 和 Clang 3.3 之间的一个显着区别是 GCC 具有 -Og 命令行选项。此选项可实现不干扰调试的优化,因此始终可以获得准确的堆栈跟踪。 Clang 中缺少此选项使得某些开发人员更难将 clang 用作优化编译器。

There is very little overall difference between GCC 4.8 and Clang 3.3 in terms of speed of the resulting binary. In most cases code generated by both compilers performs similarly. Neither of these two compilers dominates the other one.

Benchmarks telling that there is a significant performance gap between GCC and Clang are coincidental.

Program performance is affected by the choice of the compiler. If a developer or a group of developers is exclusively using GCC then the program can be expected to run slightly faster with GCC than with Clang, and vice versa.

From developer viewpoint, a notable difference between GCC 4.8+ and Clang 3.3 is that GCC has the -Og command line option. This option enables optimizations that do not interfere with debugging, so for example it is always possible to get accurate stack traces. The absence of this option in Clang makes clang harder to use as an optimizing compiler for some developers.

月亮坠入山谷 2024-09-15 08:24:38

我在 GCC 5.2.1 和 Clang 3.6.2 上注意到的一个特殊区别是
如果你有一个像这样的关键循环:那么

for (;;) {
    if (!visited) {
        ....
    }
    node++;
    if (!*node)
        break;
}

当使用 -O3-O2 编译时,GCC 会推测
将循环展开八次。 Clang 根本不会展开它。通过
反复试验我发现在我的程序数据的具体情况下,
正确的展开量是 5,所以 GCC 超过了 Clang
下颚突出式。然而,超调对性能的影响更大,因此 GCC 在这里表现得更差。

不知道这种展开差异是总体趋势还是
只是针对我的场景的一些东西。

不久前,我写了一个一些垃圾收集器来自学更多关于 C 语言性能优化的知识。结果我心里的想法足以稍微偏向 Clang。特别是自从垃圾
集合主要是关于指针追逐和复制内存。

结果是(以秒为单位的数字):

+---------------------+-----+-----+
|Type                 |GCC  |Clang|
+---------------------+-----+-----+
|Copying GC           |22.46|22.55|
|Copying GC, optimized|22.01|20.22|
|Mark & Sweep         | 8.72| 8.38|
|Ref Counting/Cycles  |15.14|14.49|
|Ref Counting/Plain   | 9.94| 9.32|
+---------------------+-----+-----+

这都是纯 C 代码,我对任何一个编译器都没有做出任何声明
编译 C++ 代码时的性能。

Ubuntu 15.10 (Wily Werewolf) 上,x86.64 ,以及 AMD Phenom II X6 1090T 处理器。

A peculiar difference I have noted on GCC 5.2.1 and Clang 3.6.2 is
that if you have a critical loop like:

for (;;) {
    if (!visited) {
        ....
    }
    node++;
    if (!*node)
        break;
}

Then GCC will, when compiling with -O3 or -O2, speculatively
unroll the loop eight times. Clang will not unroll it at all. Through
trial and error I found that in my specific case with my program data,
the right amount of unrolling is five so GCC overshot and Clang
undershot. However, overshooting was more detrimental to performance, so GCC performed much worse here.

I have no idea if the unrolling difference is a general trend or
just something that was specific to my scenario.

A while back I wrote a few garbage collectors to teach myself more about performance optimization in C. And the results I got is in my mind enough to slightly favor Clang. Especially since garbage
collection is mostly about pointer chasing and copying memory.

The results are (numbers in seconds):

+---------------------+-----+-----+
|Type                 |GCC  |Clang|
+---------------------+-----+-----+
|Copying GC           |22.46|22.55|
|Copying GC, optimized|22.01|20.22|
|Mark & Sweep         | 8.72| 8.38|
|Ref Counting/Cycles  |15.14|14.49|
|Ref Counting/Plain   | 9.94| 9.32|
+---------------------+-----+-----+

This is all pure C code, and I make no claim about either compiler's
performance when compiling C++ code.

On Ubuntu 15.10 (Wily Werewolf), x86.64, and an AMD Phenom II X6 1090T processor.

陌伤浅笑 2024-09-15 08:24:38

确定这一点的唯一方法就是尝试一下。 FWIW,与常规 GCC 4.2 相比,我使用 Apple 的 LLVM GCC 4.2 看到了一些非常好的改进(对于具有大量 SSE 的 x86-64 代码),但对于不同的代码库来说是 YMMV。

假设您正在使用 x86/x86-64 并且您确实关心最后百分之几,那么您应该尝试英特尔的 ICC 也是如此,因为这通常可以击败 GCC - 您可以从 intel.com 获得 30 天的评估许可证并尝试一下。

The only way to determine this is to try it. FWIW, I have seen some really good improvements using Apple's LLVM GCC 4.2 compared to the regular GCC 4.2 (for x86-64 code with quite a lot of SSE), but YMMV for different code bases.

Assuming you're working with x86/x86-64 and that you really do care about the last few percent then you ought to try Intel's ICC too, as this can often beat GCC - you can get a 30-day evaluation license from intel.com and try it.

别在捏我脸啦 2024-09-15 08:24:38

基本上来说,答案是:视情况而定。
有很多针对不同类型应用程序的基准测试。

我的应用程序的基准是:GCC > ICC >铛。

I/O很少,但是CPU浮点和数据结构操作很多。

编译标志为-Wall -g -DNDEBUG -O3

https://github.com/zhangyafeikimi/ml-pack/ blob/master/gbdt/profile/benchmark

Basically speaking, the answer is: it depends.
There are many many benchmarks focusing on different kinds of application.

My benchmark on my application is: GCC > ICC > Clang.

There are rare I/O, but many CPU float and data structure operations.

The compile flags are -Wall -g -DNDEBUG -O3.

https://github.com/zhangyafeikimi/ml-pack/blob/master/gbdt/profile/benchmark

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文