C99 中数据结构的效率(可能受字节序影响)

发布于 2024-10-09 03:34:51 字数 1067 浏览 8 评论 0原文

我有几个相互关联的问题。基本上,在我实现的算法中,单词 w 被定义为四个字节,因此它可以整个包含在 uint32_t 中。

然而,在算法运行过程中,我经常需要访问单词的各个部分。现在,我可以通过两种方式做到这一点:

uint32_t w = 0x11223344;
uint8_t a = (w & 0xff000000) >> 24;
uint8_t b = (w & 0x00ff0000) >> 16;
uint8_t b = (w & 0x0000ff00) >>  8;
uint8_t d = (w & 0x000000ff);

但是,我的一部分认为这不是特别有效。我认为更好的方法是使用联合表示,如下所示:

typedef union
{
    struct
    {
        uint8_t d;
        uint8_t c;
        uint8_t b;
        uint8_t a;
    };
    uint32_t n;
} word32;

使用此方法我可以分配 word32 w = 0x11223344; 然后我可以访问各种 我需要的部分(wa=11 小端)。

然而,在这个阶段,我遇到了字节序问题,即在大字节序系统中,我的结构定义不正确,因此我需要在传入单词之前对其进行重新排序。

我可以毫不费力地做到这一点。那么,我的问题是,与使用联合的实现相比,第一部分(各种按位与和移位)是否有效?两者一般有什么区别吗?在现代 x86_64 处理器上我应该走哪条路?字节序在这里只是转移注意力吗?

我当然可以检查汇编输出,但我对编译器的了解并不出色。我本以为联合会更有效,因为它本质上会转换为内存偏移量,如下所示:

mov eax, [r9+8]

编译器会意识到上面的位移情况中发生了什么吗?

如果重要的话,我正在使用 C99,特别是我的编译器是 clang (llvm)。

提前致谢。

I have a couple of questions that are all inter-related. Basically, in the algorithm I am implementing a word w is defined as four bytes, so it can be contained whole in a uint32_t.

However, during the operation of the algorithm I often need to access the various parts of the word. Now, I can do this in two ways:

uint32_t w = 0x11223344;
uint8_t a = (w & 0xff000000) >> 24;
uint8_t b = (w & 0x00ff0000) >> 16;
uint8_t b = (w & 0x0000ff00) >>  8;
uint8_t d = (w & 0x000000ff);

However, part of me thinks that isn't particularly efficient. I thought a better way would be to use union representation like so:

typedef union
{
    struct
    {
        uint8_t d;
        uint8_t c;
        uint8_t b;
        uint8_t a;
    };
    uint32_t n;
} word32;

Using this method I can assign word32 w = 0x11223344; then I can access the various
parts as I require (w.a=11 in little endian).

However, at this stage I come up against endianness issues, namely, in big endian systems my struct is defined incorrectly so I need to re-order the word prior to it being passed in.

This I can do without too much difficulty. My question is, then, is the first part (various bitwise ands and shifts) efficient compared to the implementation using a union? Is there any difference between the two generally? Which way should I go on a modern, x86_64 processor? Is endianness just a red herring here?

I could inspect the assembly output of course, but my knowledge of compilers is not brilliant. I would have thought a union would be more efficient as it would essentially convert to memory offsets, like so:

mov eax, [r9+8]

Would a compiler realise that is what happening in the bit-shift case above?

If it matters, I'm using C99, specifically my compiler is clang (llvm).

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

凉城 2024-10-16 03:34:52

鉴于使用移位和掩码访问位是一种常见操作,我希望编译器对此非常聪明,特别是如果您使用恒定的移位计数和掩码。

一种选择是使用宏进行位设置/获取,这样如果在特定平台上编译器恰好处于愚蠢的一面,您可以在配置时选择最佳策略(并且明智地选择宏名称也可以使代码更清晰和自我解释)。

Given that accessing bits using shift and masking is a common operation I'd expect compilers to be quite smart about it especially if you're using constant shift count and mask.

An option would be to use macros for bit set/get so that you can pick the best strategy at configure time if on a specific platform a compiler happens to be on the dumb side (and wisely chosen names for the macros can also make the code more clear and self explaining).

心的位置 2024-10-16 03:34:51

如果您需要 AES,为什么不使用现有的实现呢?这对于具有 AES 硬件支持的现代英特尔处理器尤其有利。

由于存储到加载转发 (STLF) 失败,联合技巧可能会减慢速度。如果您将数据写入内存并以不同的数据类型(例如 32 位与 8 位)将其读回,则可能会发生这种情况,具体取决于处理器型号。

If you need AES, why not use an existing implementation? This can be particularly beneficial on modern Intel processors with hardware support for AES.

The union trick can slow down things due to store-to-load-forwarding (STLF) failures. This may happen, depending on the processor model, if you write data to memory and read it back soon as a different data type (e.g. 32bit vs 8bit).

高冷爸爸 2024-10-16 03:34:51

如果无法检查代码中这些操作的实际用途,很难判断这样的事情:

  • shift 版本可能会这样做
    如果你碰巧拥有你所有的,那就更好了
    无论如何,寄存器中的变量,以及
    然后你进行大量计算
    他们。通常编译器(包括 clang)在为部分单词和类似的东西发出指令方面相对聪明。
  • 联盟版本也许是
    如果您必须加载,效率会更高
    内存中的大部分字节
    无论如何

,我会将访问操作抽象为宏,以便您可以轻松修改它,从而获得工作代码。

就我个人的口味而言,我会选择移位版本,因为它在概念上更简单,并且只有当我发现最终生成的汇编程序看起来不令人满意时才选择union

Such a thing is hard to tell without being able to inspect the real use of these operations in your code:

  • the shift version will probably do
    better if you happen to have all your
    variables in registers, anyhow, and
    then you do intensive computations on
    them. Usually compilers (clang including) are relatively clever in issuing instructions for partial words and stuff like that.
  • the union version would perhaps be
    more efficient if you'd have to load
    your bytes from memory most of the
    time

In any case I would abstract the access operation into a macro, such that you can modify it easily whence you have a working code.

For my personal taste I would go for the shift version, since it is conceptually simpler, and only go for the union when I'd see that at the end the produced assembler doesn't look satisfactory.

感情旳空白 2024-10-16 03:34:51

我猜想使用联合可能会更有效。当然,编译器可能能够优化字节加载的移位,因为它们在编译期间是已知的——在这种情况下,两种方案将产生相同的代码。

另一种选择(也依赖于字节顺序)是将字转换为字节数组并直接访问字节。即,类似于下面的内容,

uint8_t b = ((uint8_t*)w)[n] 

但我不确定您会在真正的现代 32/64 位处理器上看到任何差异。

编辑:看起来 clang 在这两种情况下都会生成相同的代码。

I would guess using a union may be more efficient. Of course, the compiler may be able to optimize the shifts into byte loads since they are known during compilation -- in which case both schemes will yield identical code.

Another option (also byte order dependent) is to cast the word to a byte array and access the bytes directly. I.e., something like the following

uint8_t b = ((uint8_t*)w)[n] 

I'm not sure you will see any difference on a real modern 32/64 bit processor, though.

EDIT: It seems like clang produces identical code in both cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文