英特尔X86与AMD X86 CPU上的非对齐访问性能

发布于 2025-02-03 18:13:04 字数 5615 浏览 3 评论 0 原文

我已经实现了一个简单的线性探测哈希映射,并具有一系列结构存储器布局。结构保留键,值和标志,指示条目是否有效。默认情况下,该结构被编译器填充,因为密钥和值是64位整数,但是该条目仅占用8个布尔。因此,我还尝试以非对齐的访问为代价打包结构。我希望由于更高的内存密度,从包装/未对准版本中获得更好的性能(我们不会在传输填充字节上浪费带宽)。

当在Intel Xeon Gold 5220S CPU上基准该哈希地图(单线线程,GCC 11.2,-O3和-March =本机)时,我看到填充版本与未对准版本之间没有性能差异。但是,在AMD EPYC 7742 CPU(相同的设置)上,我发现了未对准和填充之间的性能差异。这是一个图形,描述了X轴上不同的成功查询率(0,25,50,75,100)的结果25%和50%的结果: “ ,您可以看到,如您所见在英特尔上,灰色和蓝色(圆形和方形)线几乎重叠,结构堆积的好处是边缘的。但是,在AMD上,代表未对准/包装结构的线始终较高,即,我们有更多的吞吐量。

为了调查这一点,我尝试构建一个较小的微型计算标准。在此微博结中,我们执行类似的基准测试,但是没有哈希地图找到逻辑(即,我们只是在阵列中选择随机索引并在那里稍微推进一点)。请在此处找到基准:

#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>

void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }

template <typename T>
void doNotOptimize(T const& val) {
  asm volatile("" : : "r,m"(val) : "memory");
}

struct PaddedStruct {
  uint64_t key;
  uint64_t value;
  bool is_valid;

  PaddedStruct() { reset(); }

  void reset() {
    key = uint64_t{};
    value = uint64_t{};
    is_valid = 0;
  }
};

struct PackedStruct {
  uint64_t key;
  uint64_t value;
  uint8_t is_valid;

  PackedStruct() { reset(); }

  void reset() {
    key = uint64_t{};
    value = uint64_t{};
    is_valid = 0;
  }
} __attribute__((__packed__));

int main() {
  const uint64_t size = 134217728;
  uint16_t repetitions = 0;
  uint16_t advancement = 0;

  std::cin >> repetitions;
  std::cout << "Got " << repetitions << std::endl;
  std::cin >> advancement;
  std::cout << "Got " << advancement << std::endl;
  std::cout << "Initializing." << std::endl;

  std::vector<PaddedStruct> padded(size);
  std::vector<PackedStruct> unaligned(size);
  std::vector<uint64_t> queries(size);

  // Initialize the structs with random values + prefault
  std::random_device rd;
  std::mt19937 gen{rd()};
  std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
  std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};

  for (uint64_t i = 0; i < padded.size(); ++i) {
    padded[i].key = dist(gen);
    padded[i].value = dist(gen);
    padded[i].is_valid = 1;
  }

  for (uint64_t i = 0; i < unaligned.size(); ++i) {
    unaligned[i].key = padded[i].key;
    unaligned[i].value = padded[i].value;
    unaligned[i].is_valid = 1;
  }

  for (uint64_t i = 0; i < unaligned.size(); ++i) {
    queries[i] = dist2(gen);
  }

  std::cout << "Running benchmark." << std::endl;

  ClobberMemory();
  auto start_padded = std::chrono::high_resolution_clock::now();
  PaddedStruct* padded_ptr = nullptr;
  uint64_t sum = 0;
  for (uint16_t j = 0; j < repetitions; j++) {
    for (const uint64_t& query : queries) {
      for (uint16_t i = 0; i < advancement; i++) {
        padded_ptr = &padded[query + i];
        if (padded_ptr->is_valid) [[likely]] {
          sum += padded_ptr->value;
        }
      }
      doNotOptimize(sum);
    }
  }

  ClobberMemory();
  auto end_padded = std::chrono::high_resolution_clock::now();
  uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
  std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl;  // print sum to avoid that it gets optimized out

  ClobberMemory();
  auto start_unaligned = std::chrono::high_resolution_clock::now();
  uint64_t sum2 = 0;
  PackedStruct* packed_ptr = nullptr;
  for (uint16_t j = 0; j < repetitions; j++) {
    for (const uint64_t& query : queries) {
      for (uint16_t i = 0; i < advancement; i++) {
        packed_ptr = &unaligned[query + i];
        if (packed_ptr->is_valid) [[likely]] {
          sum2 += packed_ptr->value;
        }
      }
      doNotOptimize(sum2);
    }
  }
  ClobberMemory();
  auto end_unaligned = std::chrono::high_resolution_clock::now();
  uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
  std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}

运行基准测试时,我选择重复= 3,而Advancement = 5,即,在编译和运行它之后,您必须输入3(并按Newline),然后输入5并按Enter/Newline。我将源代码更新为(a)避免编译器的循环展开,因为重复/进步是硬编码,并且(b)将指针切换到该向量,因为它更类似于哈希地图的作用。

在英特尔CPU上,我得到:

填充运行时(MS):13204 非对齐的运行时(MS):12185

在AMD CPU上,我得到:

填充运行时(MS):28432 非对齐的运行时(MS):22926

因此,在此微型计算标准中,英特尔仍然从未对齐的访问中受益一点,对于AMD CPU而言,绝对和相对改进都更高。我无法解释这一点。通常,从我从相关的信息中学到的内容,只要它留在单个缓存线(1)中,单个成员的不一致的访问就与对齐访问一样昂贵。同样在(1)中,给出了(2)的引用,该引用声称​​缓存提取宽度可能与缓存线大小不同。但是,除了Linus Torvalds邮件外,我找不到其他任何其他文档的加速器提取宽度的文档,尤其是对于我的混凝土两个CPU而言,我无法以某种方式弄清楚这是否与此有关。

有人知道为什么AMD CPU从结构包装中受益更多吗?如果这是关于减少内存带宽消耗,我应该能够看到对这两个CPU的影响。而且,如果带宽使用情况相似,我不明白可能导致差异的原因。

(1)相关的so线程:准确地在x86_64上准确基准非对齐访问速度?

(2)

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).

When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): enter image description here As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.

In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:

#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>

void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }

template <typename T>
void doNotOptimize(T const& val) {
  asm volatile("" : : "r,m"(val) : "memory");
}

struct PaddedStruct {
  uint64_t key;
  uint64_t value;
  bool is_valid;

  PaddedStruct() { reset(); }

  void reset() {
    key = uint64_t{};
    value = uint64_t{};
    is_valid = 0;
  }
};

struct PackedStruct {
  uint64_t key;
  uint64_t value;
  uint8_t is_valid;

  PackedStruct() { reset(); }

  void reset() {
    key = uint64_t{};
    value = uint64_t{};
    is_valid = 0;
  }
} __attribute__((__packed__));

int main() {
  const uint64_t size = 134217728;
  uint16_t repetitions = 0;
  uint16_t advancement = 0;

  std::cin >> repetitions;
  std::cout << "Got " << repetitions << std::endl;
  std::cin >> advancement;
  std::cout << "Got " << advancement << std::endl;
  std::cout << "Initializing." << std::endl;

  std::vector<PaddedStruct> padded(size);
  std::vector<PackedStruct> unaligned(size);
  std::vector<uint64_t> queries(size);

  // Initialize the structs with random values + prefault
  std::random_device rd;
  std::mt19937 gen{rd()};
  std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
  std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};

  for (uint64_t i = 0; i < padded.size(); ++i) {
    padded[i].key = dist(gen);
    padded[i].value = dist(gen);
    padded[i].is_valid = 1;
  }

  for (uint64_t i = 0; i < unaligned.size(); ++i) {
    unaligned[i].key = padded[i].key;
    unaligned[i].value = padded[i].value;
    unaligned[i].is_valid = 1;
  }

  for (uint64_t i = 0; i < unaligned.size(); ++i) {
    queries[i] = dist2(gen);
  }

  std::cout << "Running benchmark." << std::endl;

  ClobberMemory();
  auto start_padded = std::chrono::high_resolution_clock::now();
  PaddedStruct* padded_ptr = nullptr;
  uint64_t sum = 0;
  for (uint16_t j = 0; j < repetitions; j++) {
    for (const uint64_t& query : queries) {
      for (uint16_t i = 0; i < advancement; i++) {
        padded_ptr = &padded[query + i];
        if (padded_ptr->is_valid) [[likely]] {
          sum += padded_ptr->value;
        }
      }
      doNotOptimize(sum);
    }
  }

  ClobberMemory();
  auto end_padded = std::chrono::high_resolution_clock::now();
  uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
  std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl;  // print sum to avoid that it gets optimized out

  ClobberMemory();
  auto start_unaligned = std::chrono::high_resolution_clock::now();
  uint64_t sum2 = 0;
  PackedStruct* packed_ptr = nullptr;
  for (uint16_t j = 0; j < repetitions; j++) {
    for (const uint64_t& query : queries) {
      for (uint16_t i = 0; i < advancement; i++) {
        packed_ptr = &unaligned[query + i];
        if (packed_ptr->is_valid) [[likely]] {
          sum2 += packed_ptr->value;
        }
      }
      doNotOptimize(sum2);
    }
  }
  ClobberMemory();
  auto end_unaligned = std::chrono::high_resolution_clock::now();
  uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
  std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}

When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.

On the Intel CPU, I get:

Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185

On the AMD CPU, I get:

Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926

So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.

Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.

(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?

(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

心的憧憬 2025-02-10 18:13:04

英特尔Xeon Gold 5220s(以及所有其他Skylake/Cascadelake Xeon处理器)上的L1数据缓存提取宽度最高为64个自然平整的Bytes每个周期。

对于不会越过Cacheline边界的任何大小和对齐组合的组合,核心可以执行每个周期的两个负载。我尚未测试SKX/CLX处理器上的所有组合,但是在Haswell/Broadwell上,每当载荷越过Cacheline边界时,吞吐量都会减少到每个周期的一个负载,我认为SKX/CLX相似。可以将其视为必要的功能,而不是“惩罚” - 拆卸负载可能需要使用两个端口来加载一对相邻行,然后将所请求的线的部分组合到目标寄存器的有效负载中。

交叉页面边界的负载具有较大的性能惩罚,但是要测量它,您必须非常谨慎地理解和控制这两个页面的页面表条目的位置:DTLB,STLB,在卡希斯中或主内存中。我的回忆是,最常见的情况非常快 - 部分是因为“下一页预摘要”非常擅长将下一页的PTE输入预加载到TLB中,然后再在一系列负载到第一个载荷结束之前页。唯一令人痛苦的情况是跨越页面边界的商店,并且英特尔编译器非常努力地避免这种情况。

我尚未详细查看示例代码,但是如果我执行此分析,我会谨慎地固定处理器频率,测量指令和周期计数,并计算每个更新的指令和周期的平均数量。 (我通常将核心频率设置为标称(TSC)频率,以使数字更易于使用。)对于自然对准的情况,应该很容易查看组件代码并估算周期计数应是。如果测量值类似于该情况的观察结果,则可以参考对基线的更可靠的理解,从而开始查看未对齐的访问的开销。

硬件性能计数器也对这种情况也很有价值,尤其是DTLB_LOAD_MISSES事件和L1D.Beplacement事件。只需几个高延迟TLB Miss或L1D Miss Events就可以偏向平均值。

The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.

The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.

Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.

I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.

Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.

幼儿园老大 2025-02-10 18:13:04

使用24字节数据结构时,高速线访问的数量可能与使用17字节数据结构时相同。

请参阅此博客文章:

The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.

Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文