如何内存映射一个巨大的矩阵?
假设您有一个巨大的(40+ GB)特征值(浮点)矩阵,行是不同的特征,列是样本/图像。
该表是按列预先计算的。 然后它被完全按行和多线程访问(每个线程加载整行)多次。
处理这个矩阵的最佳方法是什么?我特别考虑了 5 点:
- 由于它在 x64 PC 上运行,我可以立即对整个矩阵进行内存映射,但这有意义吗?
- 多线程的效果如何(多线程初始计算也是如此?)?
- 如何布局矩阵:行主还是列主?
- 预计算完成后将矩阵标记为只读是否有帮助?
- 可以像 http://www.kernel. org/doc/man-pages/online/pages/man2/madvise.2.html 用于加快速度?
Suppose you got a huge (40+ GB) feature value (floating-point) matrix, rows are different features and columns are the samples/images.
The table is precomputed column-wise.
Then it is completely accessed row-wise and multi-threaded (each thread loads a whole row) several times.
What would be the best way to handle this matrix? I'm especially pondering over 5 points:
- Since it's run on an x64 PC I could memory map the whole matrix at once but would that make sense?
- What about the effects of multithreading (multithreaded initial computation as well?)?
- How to layout the matrix: row or column major?
- Would it help to mark the matrix as read-only after the precomputation has been finished?
- Could something like http://www.kernel.org/doc/man-pages/online/pages/man2/madvise.2.html be used to speed it up?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
映射整个文件的内存可以使该过程变得更加容易。
您希望对数据进行布局以针对最常见的访问模式进行优化。听起来数据将被写入一次(按列)并读取多次(按行)。这表明数据应该按行优先顺序存储。
一旦预计算完成,将矩阵标记为只读可能不会提高性能(有一些可能的低级优化,但我认为没有任何东西实现它们),但它会防止错误意外写入数据你不打算这样做。不妨。
一旦您编写并运行了应用程序,
madvise
最终可能会很有用。我的总体建议:以最简单的方式编写程序,首先按顺序编写,然后在整个过程和各种主要操作周围放置计时器。确保主要操作时间总和等于总时间,这样您就可以确保没有遗漏任何内容。然后将您的性能改进工作集中在实际花费最多时间的组件上。
根据 JimR 在他的评论中提到的 4MB 页面,您可能最终想要研究 Hugetlbfs 或使用具有透明大页面支持的 Linux 内核版本(合并为 2.6.38,可能会修补到早期版本)。这可能会避免大量 TLB 未命中,并说服内核以足够大的块执行磁盘 IO,以分摊任何寻道开销。
Memory mapping the whole file could make the process much easier.
You want to lay out your data to optimize for the most common access pattern. It sounds like the data is going to be written once (column-wise) and read several times (row-wise). That suggests the data should be stored in row-major order.
Marking the matrix read-only once the pre-computation is done probably won't help performance (there are some possible low-level optimizations, but I don't think anything implements them), but it will prevent bugs from accidentally writing to data you don't intend to. Might as well.
madvise
could end up being useful, once you've got your application written and working.My overall advice: write the program in the simplest way you can, sequentially at first, and then put timers around the whole thing and the various major operations. Make sure the major operation times sum to the overall time, so you can be sure you're not missing anything. Then target your performance improvement efforts toward the components that are actually taking the most time.
Per JimR's mention of 4MB pages in his comment, you may end up wanting to look into hugetlbfs or using a Linux Kernel release with transparent huge page support (merged for 2.6.38, could probably be patched into earlier versions). This would likely save you a whole lot of TLB misses, and convince the kernel to do the disk IO in sufficiently large chunks to amortize any seek overhead.
回复 3:例如,如果您有 8 个 CPU,但没有足够的 RAM 来加载 8 行,则应该让每个线程以可管理的块顺序处理其行。在这种情况下,矩阵的块布局就有意义了。如果线程必须在内存中拥有整行来处理它,恐怕您无法使用所有 CPU,因为该进程将开始抖动,即将矩阵的某些子集从内存中踢出并重新加载另一个需要的子集。这比完全交换稍微好一点,因为矩阵永远不会被修改,因此页面的内容在被踢出之前不需要写入交换文件。但它仍然严重影响性能。
另外,从多个线程进行随机访问 I/O 是一个坏主意,如果您使用 mmap(),这就是您最终要做的事情。您(大概)只有一个磁盘,并行 I/O 只会使其速度变慢。因此 mmap() 可能没有意义,您可以通过按顺序将数据读入 RAM 来获得更好的 I/O 性能。
请注意,40GB 大约为 1050 万个 4096 字节的页面。通过执行 mmap(),在最坏的情况下,您将因多次硬盘搜索而减慢计算速度。每次搜索 8 毫秒(摘自维基百科),您最终将浪费 83666 秒,即几乎一整天!
Re 3: If you have, e.g., 8 CPUs but do not have enough RAM to load 8 rows, you should make each thread process its row sequentially in manageable chunks. In this case, block-layout of a matrix would make sense. If the thread MUST have the whole row in memory to process it, I'm afraid that you can't use all the CPUs, as the process will start thrashing, i.e., kicking out some subset of the matrix out of the ram and reloading another needed subset. This is slightly less bad than full swapping as the matrix is never modified, so the contents of the pages do not need to be written to the swap file before being kicked out. But it still hurts performance badly.
Also, doing random access I/O from multiple threads is a bad idea, which is what you'll end up doing if you use mmap(). You have (presumably) only a single disk, and parallel I/O will just make it slower. So mmap() might not make sense and you could achieve better I/O performance by reading data sequentially into ram.
Note that 40GB is approximately 10.5 million pages of 4096 bytes. By doing mmap(), you will, in the worst case, slow down computation by that many hard disk seeks. At 8ms per seek (taken from wikipedia), you'll end up wasting 83666 seconds, i.e., almost a whole day!
如果您可以将整个内容放入主内存中,那么是的:内存将其全部映射,并且它是列主还是行主并不重要。然而,对于 40+ Gb,我确信它对于主内存来说太大了。在这种情况下:
我认为简单的实现会带来性能问题。计算机在写入时会出现抖动(如果您将其存储为行主要),或者在查询时会出现抖动(如果您将其存储为列主要)。后者可能更糟,但这是一个双向的问题。
正确的解决方案是使用中间表示,它既不是行优先也不是列优先,而是“大方块”。获取前 50,000 列并将它们存储在内存映射文件中(阶段 1)。它是列主还是行主并不重要,因为它纯粹驻留在内存中。然后,获取每一行并将其写入最终的行优先内存映射文件(阶段 2)。然后对接下来的 50,000 列重复该循环,依此类推。
If you could fit the whole thing into main memory, then yes: memory map it all, and it doesn't matter whether it's column major or row major. However, at 40+ Gb, I'm sure it's too big for main memory. In which case:
I think you are going to have a performance problem with a naive implementation. Either the computer with thrash while writing (if you store it row major) or it will thrash while querying (if you store it column major). The latter is presumably worse, but it's a problem both ways.
The right solution is to use an intermediate representation which is neither row-major nor column-major but 'large squares'. Take the first 50,000 columns and store them in a memory-mapped file (phase 1). It doesn't matter if it's column major or row major since it'll be purely memory resident. Then, take each row and write it into the final row-major memory-mapped file (phase 2). Then repeat the cycle for the next 50,000 columns, and so on.