是否有一个标准宏来检测需要对齐内存访问的架构?

发布于 2024-12-20 04:20:19 字数 1642 浏览 1 评论 0原文

假设类似:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
}

我可以通过编写类似以下内容在非对齐访问机器(例如x86)上运行得更快:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;
  for(i=0; i<wordlen; i++)
  {
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; // this raises SIGBUS on SPARC and other archs that require aligned access.
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
}

但是它需要构建在多种体系结构上,所以我想做类似的事情:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;

#if defined(__ALIGNED2__) || defined(__ALIGNED4__) || defined(__ALIGNED8__)
  // go slow
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
#else
  // go fast
  for(i=0; i<wordlen; i++)
  {
    // the following line will raise SIGBUS on SPARC and other archs that require aligned access.
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; 
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
#endif
}

但我找不到有关编译器的任何好的信息定义的宏(就像我上面假设的__ALIGNED4__)指定对齐或使用预处理器确定目标架构对齐的任何巧妙方法。我可以测试 define (__SVR4) &&定义 (__sun),但我更喜欢能够在需要对齐内存访问的其他体系结构上正常工作的TM

Assuming something like:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
}

I can go faster on a non-aligned access machine (e.g. x86) by writing something like:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;
  for(i=0; i<wordlen; i++)
  {
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; // this raises SIGBUS on SPARC and other archs that require aligned access.
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
}

However it needs to build on several architectures so I would like to do something like:

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;

#if defined(__ALIGNED2__) || defined(__ALIGNED4__) || defined(__ALIGNED8__)
  // go slow
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
#else
  // go fast
  for(i=0; i<wordlen; i++)
  {
    // the following line will raise SIGBUS on SPARC and other archs that require aligned access.
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; 
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
#endif
}

But I cannot find any good information on compiler defined macros (like my hypothetical __ALIGNED4__ above) that specify alignment or any clever ways of using the pre-processor to determine target architecture alignment. I could just test defined (__SVR4) && defined (__sun), but I would prefer something that will Just WorkTM on other architectures requiring aligned memory accesses.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

强者自强 2024-12-27 04:20:19

虽然 x86 默默地修复了未对齐的访问,但这对于性能来说并不是最佳的。通常最好假设某种对齐方式并自行执行修复:

unsigned int const alignment = 8;   /* or 16, or sizeof(long) */

void memcpy(char *dst, char const *src, unsigned int size) {
    if((((intptr_t)dst) % alignment) != (((intptr_t)src) % alignment)) {
        /* no common alignment, copy as bytes or shift around */
    } else {
        if(((intptr_t)dst) % alignment) {
            /* copy bytes at the beginning */
        }
        /* copy words in the middle */
        if(((intptr_t)dst + size) % alignment) {
            /* copy bytes at the end */
        }
    }
}

另外,请查看 SIMD 指令。

While x86 silently fixes up unaligned accesses, this is hardly optimal for performance. It is usually best to assume a certain alignment and perform fixups yourself:

unsigned int const alignment = 8;   /* or 16, or sizeof(long) */

void memcpy(char *dst, char const *src, unsigned int size) {
    if((((intptr_t)dst) % alignment) != (((intptr_t)src) % alignment)) {
        /* no common alignment, copy as bytes or shift around */
    } else {
        if(((intptr_t)dst) % alignment) {
            /* copy bytes at the beginning */
        }
        /* copy words in the middle */
        if(((intptr_t)dst + size) % alignment) {
            /* copy bytes at the end */
        }
    }
}

Also, take a look at SIMD instructions.

我恋#小黄人 2024-12-27 04:20:19

标准方法是使用一个配置脚本来运行程序来测试对齐问题。如果测试程序没有崩溃,配置脚本会在生成的配置标头中定义一个宏,以实现更快的实现。默认情况下更安全的实现。

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;

#if defined(UNALIGNED)
  // go fast
  for(i=0; i<wordlen; i++)
  {
    // the following line will raise SIGBUS on SPARC and other archs that require aligned access.
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; 
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
#else
  // go slow
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
#endif
}

The standard approach would be to have a configure script that runs a program to test for alignment issues. If the test program doesn't crash, the configure script defines a macro in a generated config header that allows for the faster implementation. The safer implementation is the default.

void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
  unsigned int i;
  unsigned int wordlen = len >> 2;

#if defined(UNALIGNED)
  // go fast
  for(i=0; i<wordlen; i++)
  {
    // the following line will raise SIGBUS on SPARC and other archs that require aligned access.
    ((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; 
  }
  for(i=wordlen<<2; i<len; i++){
    dest[i] = src[i] & mask[i];
  }
#else
  // go slow
  for(i=0; i<len; i++)
  {
     dest[i] = src[i] & mask[i];
  }
#endif
}
感受沵的脚步 2024-12-27 04:20:19

(我觉得很奇怪,当真正通勤时,你有 srcmask。我将 mask_bytes 重命名为 memand。但无论如何......)

另一种选择是使用利用 C 中类型的不同函数。例如:

void memand_bytes(char *dest, char *src1, char *src2, size_t len)
{
    unsigned int i;
    for (i = 0; i < len; i++)
        dest[i] = src1[i] & src2[i];
}

void memand_ints(int *dest, int *src1, int *src2, size_t len)
{
    unsigned int i;
    for (i = 0; i < len; i++)
        dest[i] = src1[i] & src2[i];
}

这样您就可以让程序员决定。

(I find it weird that you have src and mask when really these commute. I renamed mask_bytes to memand. But anyways...)

Another options is to use different functions that take advantage of types in C. For instance:

void memand_bytes(char *dest, char *src1, char *src2, size_t len)
{
    unsigned int i;
    for (i = 0; i < len; i++)
        dest[i] = src1[i] & src2[i];
}

void memand_ints(int *dest, int *src1, int *src2, size_t len)
{
    unsigned int i;
    for (i = 0; i < len; i++)
        dest[i] = src1[i] & src2[i];
}

This way you let the programmer decide.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文