有没有办法强迫Clang使用未对准的负载/存储X86说明？

发布于 2025-02-05 15:30:50 字数 1418 浏览 2 评论 0 原文

我正在尝试在一个大型Visual Studio项目中使用Clang。有很多MS特定的代码，包括C ++ / CLI和MSTest，这些代码无法与Clang一起编译，因此它是Microsoft Compiler（版本17.2 / vs 2022）和Clang-CL（13.0.2）编译的库的混合物（13.0.2）。

现有代码使用AVX来优化至关重要的瓶颈，因此有几个类使数据对齐数据（例如

struct tx
{
  alignas(32) double m_data[12];
}

问题）是Microsoft 始终符合对齐要求的要求。在大多数情况下，它将正确对齐数据，但有时（通常对于临时变量）会分配非对齐的结构。例如，

struct edge_object
{
  ...
  tx m_pos;
};

int c = sizeof(edge_object);  // 256
int a = alignof(edge_object); // 32
int b = offsetof(edge_object, tx); // 160

std::vector<edge_object> edges;
for (int i = 0; i < n - 1; ++i)
{
    edges.push_back(edge_object( (edge_id_t)i, test_cost_0, lower_v[i], lower_v[i + 1], tx ));
    edges.push_back(edge_object( (edge_id_t)(n + i), test_cost_0, upper_v[i], upper_v[i + 1], tx ));
}

在此代码段中，MS编译器对第一个临时 edge_object 正确对齐（例如，如果我在堆栈上分配了几个其他变量，它将移动32个字节），但它位于第二个临时 edge> edge> edge_object < /代码>在一个完全怪异的位置（在某种原因出于某种原因，第一个临时位置的位置移动了78h字节）。 MS离开了这一点，因为它总是发出未对准的负载/存储指令（即使明确说使用对齐的负载/存储），因此即使对象不对准对象，生成的代码仍然可以正常工作。另一方面，是发出对齐的负载说明。我首先替换所有内在物质，例如 _mm256_load_ps to _mm256_loadu_ps 在我自己的矢量化代码中（32）。

因此，我想知道 - 是否有一种方法可以强迫Clang仅发布像MSVC和ICC编译器这样的非对齐负载/商店？作为潜在的解决方法，我可以通过将对齐方式更改为8而不是32来强迫Clang这样做，但这会损害性能。另一方面，MS方法在设法正确对齐数据时几乎同样快（现代CPU上的VMOVUP和VMOVAPs对于正确对齐的地址的性能几乎具有相同的性能），但是当由于编译器错误而对齐时，并不会崩溃。有什么建议吗？

原文

I'm trying to use CLang in a large Visual Studio project. There's a lot of MS-specific code, including C++/CLI and MStest that can't be compiled with CLang, so it's a mix of libraries compiled by Microsoft compiler (version 17.2 / VS 2022) and CLang-CL (13.0.2).

Existing code uses AVX to optimize performance-critical bottlenecks, so there are several classes that store aligned data like

struct tx
{
  alignas(32) double m_data[12];
}

The problem is that Microsoft does not always honor alignment requirements. Most of the time it will properly align the data, but sometimes (usually for temporary variables) it will allocate non-aligned structs. For example,

struct edge_object
{
  ...
  tx m_pos;
};

int c = sizeof(edge_object);  // 256
int a = alignof(edge_object); // 32
int b = offsetof(edge_object, tx); // 160

std::vector<edge_object> edges;
for (int i = 0; i < n - 1; ++i)
{
    edges.push_back(edge_object( (edge_id_t)i, test_cost_0, lower_v[i], lower_v[i + 1], tx ));
    edges.push_back(edge_object( (edge_id_t)(n + i), test_cost_0, upper_v[i], upper_v[i + 1], tx ));
}

In this code snippet, MS compiler aligns first temporary edge_object properly (e.g. it will move it 32 bytes if I allocate few additional variables on stack), but it places second temporary edge_object in a totally weird location (at a position shifted 78h bytes off position of first temporary for some reason). MS gets away with this because it always issue unaligned load/store instructions (even if explicitly said to use aligned load/store), so even if object is not aligned, the generated code will still work. CLang, on the other hand, is issuing aligned load instructions. I started by replacing all intrinsics like _mm256_load_ps to _mm256_loadu_ps in my own vectorized code, but sadly Clang is smart enough to issue its own aligned loads when it sees that alignas(32).

So I'm wondering - is there a way to force CLang to issue only unaligned load/stores like MSVC and ICC compilers do? As a potential workaround I can force Clang to do so by changing alignment to 8 instead of 32, but this will hurt performance. MS approach, on the other hand, is almost just as fast when it manages to properly align the data (VMOVUPS and VMOVAPS on modern CPUs have almost same performance for properly aligned addresses) but does not crash when alignment is wrong due to compiler bug. Any suggestions?

分享到QQ

分享到微博