现代机器的内存带宽性能

发布于 2024-08-25 09:07:22 字数 596 浏览 13 评论 0原文

我正在设计一个实时系统,偶尔需要复制大量内存。内存由非微小区域组成,因此我预计复制性能将相当接近相关组件(CPU、RAM、MB)可以执行的最大带宽。这让我想知道现代商用机器可以聚集什么样的原始内存带宽?

如果我使用 1 个线程进行 memcpy(),我老化的 Core2Duo 的速度为 1.5 GB/s(如果我同时使用两个内核 memcpy(),则性能会降低,这是可以理解的。)而 1.5 GB是相当大量的数据,我正在开发的实时应用程序将具有大约 1/50 秒的数据,这意味着 30 MB。基本上,几乎什么都没有。也许最糟糕的是,当我添加多个核心时,我可以处理更多的数据,而无需提高所需复制步骤的性能。

但低端 Core2Due 现在并不是很热门。是否有任何网站提供有关当前和近期硬件上的原始内存带宽的信息,例如实际基准测试?

此外,对于在内存中复制大量数据,是否有任何快捷方式,或者 memcpy() 是否足够好?

给定一堆核心除了在短时间内复制尽可能多的内存之外无所事事,我能做的最好的是什么?

编辑:我仍在寻找有关原始内存复制性能的良好信息。我刚刚运行了旧的 memcpy() 基准测试。相同的机器和设置,现在提供 2.5 GB/s...

I'm designing a real-time system that occasionally has to duplicate a large amount of memory. The memory consists of non-tiny regions, so I expect the copying performance will be fairly close to the maximum bandwidth the relevant components (CPU, RAM, MB) can do. This led me to wonder what kind of raw memory bandwidth modern commodity machine can muster?

My aging Core2Duo gives me 1.5 GB/s if I use 1 thread to memcpy() (and understandably less if I memcpy() with both cores simultaneously.) While 1.5 GB is a fair amount of data, the real-time application I'm working on will have have something like 1/50th of a second, which means 30 MB. Basically, almost nothing. And perhaps worst of all, as I add multiple cores, I can process a lot more data without any increased performance for the needed duplication step.

But a low-end Core2Due isn't exactly hot stuff these days. Are there any sites with information, such as actual benchmarks, on raw memory bandwidth on current and near-future hardware?

Furthermore, for duplicating large amounts of data in memory, are there any shortcuts, or is memcpy() as good as it will get?

Given a bunch of cores with nothing to do but duplicate as much memory as possible in a short amount of time, what's the best I can do?

EDIT: I'm still looking for good information on raw Memory Copy performance. I just ran my old memcpy() benchmark. Same machine and settings, now gives 2.5 GB/s...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

独﹏钓一江月 2024-09-01 09:07:22

在 Nehalem 等较新的 CPU 上,以及自 Opteron 以来的 AMD 上,内存是一个 CPU 的“本地”内存,其中单个 CPU 可能有多个内核。也就是说,核心需要一定的时间来访问连接到其CPU的本地内存,并且核心需要更多的时间来访问远程内存,其中远程内存是其他CPU的本地内存。这称为非均匀内存访问或 NUMA。为了获得最佳的 memcpy 性能,您需要将 BIOS 设置为 NUMA 模式,将线程固定到内核,并始终访问本地内存。了解维基百科上的 NUMA 的更多信息。

不幸的是,我不知道有关于最新 CPU 和芯片组上 memcpy 性能的网站或最近的论文。你最好的选择可能是自己测试一下。

至于 memcpy() 性能,根据实现的不同,存在很大的差异。例如,Intel C 库(或者可能是编译器本身)有一个 memcpy(),它比 Visual Studio 2005 提供的库快得多。至少在英特尔机器上是这样。

您能够执行的最佳内存复制取决于数据的对齐方式、是否能够使用向量指令以及页面大小等。实现一个好的 memcpy() 是令人惊讶的。 ,所以我建议在编写自己的实现之前找到并测试尽可能多的实现。如果您了解有关副本的更多细节,例如对齐和大小,您可能能够比英特尔的memcpy()更快地实现某些功能。如果您想了解详细信息,可以从 Intel 和 AMD 优化指南开始,或者 Agner Fog 的软件优化页面

On newer CPU's such as the Nehalem, and on AMD's since the Opteron, the memory is "local" to one CPU, where a single CPU may have multiple cores. That is, it takes a certain amount of time for a core to access the local memory attached to it's CPU, and more time for the core to access remote memory, where remote memory is memory that is local to other CPUs. This is called non-uniform memory access, or NUMA. For the best memcpy performance, you want to set your BIOS to NUMA mode, pin your threads to cores, and always access local memory. Find out more about NUMA on wikipedia.

Unfortunately I do not know of a site or recent papers on memcpy performance on recent CPUs and chipsets. You best bet is probably to test it yourself.

As for memcpy() performance, there are wide variations, depending on the implementation. The Intel C library (or possibly the compiler itself) has a memcpy() that is much faster than the one provided with Visual Studio 2005, for instance. At least on Intel machines.

The best memory copy you will be able to do will depend on the alignment of your data, wether you are able to use vector instructions, and page size, etc. Implementing a good memcpy() is surprisingly involved, so I recommend finding and testing as many implementations as possible before writing your own. If you know more specifics about your copy, such as alignment and size, you might be able to implement something faster than Intel's memcpy(). If you want to get into the details, you might start with the Intel and AMD optimization guides, or Agner Fog's software optimization pages.

春花秋月 2024-09-01 09:07:22

我认为你处理问题的方式是错误的。我认为目标是导出数据的一致快照而不破坏实时性能。不要使用硬件,使用算法。

您想要做的是在数据之上定义一个日志系统。当您开始内存中传输时,您有两个线程:原来的线程确实工作并认为它正在修改数据(但实际上只是写入日志),以及一个将旧的(未记录的)数据复制到日志的新线程。一个单独的位置,以便它可以慢慢地写出来。

当新线程完成后,您将其用于将数据集与日志合并,直到日志为空。完成后,旧线程可以返回直接与数据交互,而不是通过日志修改版本进行读/写。

最后,新线程可以转到复制的数据并开始缓慢地将其传递到远程源。

如果你建立一个这样的系统,你就可以在运行的系统中获得任意大量数据的即时快照,只要你能在日志变满以至于实时系统无法处理之前完成内存中的复制。跟不上其处理需求。

I think you're approaching the problem the wrong way. The goal, I assume, is to export a consistent snapshot of your data without destroying your real-time performance. Don't use hardware, use an algorithm.

What you want to do is define a journaling system on top of your data. When you start your in-memory transfer, you have two threads: the original that does work and thinks it is modifying the data (but is actually only writing to the journal), and a new thread that copies the old (unjournaled) data to a separate spot so it can slowly write it out.

When the new thread is done, you put it to work merging the data set with the journal until the journal is empty. When it's complete, the old thread can go back to interacting directly with the data instead of reading/writing through the journal-modified version.

Finally, the new thread can go over to the copied data and start slowly passing it away to a remote source.

If you set up a system like this, you can get essentially instant snapshotting of arbitrarily large amounts of data in a running system, as long as you can finish the in-memory copy before the journal gets so full that the real-time system can't keep up with its processing demands.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文