对大型数据集进行高效重新排序，以最大限度地提高内存缓存效率

发布于 2024-07-12 22:44:57 字数 897 浏览 5 评论 0原文

我一直在研究一个我认为人们可能会感兴趣的问题（也许有人知道一个预先存在的解决方案）。

我有一个大型数据集，由一长串指向对象的指针对组成，如下所示：

[
  (a8576, b3295), 
  (a7856, b2365), 
  (a3566, b5464),
  ...
]

任何时候都有太多对象无法保存在内存中（可能有数百 GB），因此它们需要存储在磁盘上，但可以缓存在内存中（可能使用 LRU 缓存）。

我需要运行这个列表来处理每一对，这要求将这对中的两个对象加载到内存中（如果它们尚未缓存在那里）。

那么，问题是：是否有一种方法可以对列表中的对进行重新排序，以最大限度地提高内存中缓存的有效性（换句话说：最大限度地减少缓存未命中的次数）？

注释

显然，重新排序算法应该尽可能快，并且不应该依赖于能够一次将整个列表存储在内存中（因为我们没有足够的内存） RAM） - 但如果有必要，它可以多次迭代列表。
如果我们处理的是单个对象，而不是成对的对象，那么简单的答案就是对它们进行排序。这显然在这种情况下不起作用，因为您需要考虑对中的两个元素。
问题可能与寻找最小图割有关，但即使问题是等价的，我也不认为最小割的解决方案满足
我的假设是启发式会流将数据从磁盘上删除，并以更好的顺序将其分块写回。它可能需要对此进行多次迭代。
实际上可能不只是一对，也可能是三胞胎、四胞胎，甚至更多。我希望对对执行此操作的算法可以轻松推广。

原文

I've been working on a problem which I thought people might find interesting (and perhaps someone is aware of a pre-existing solution).

I have a large dataset consisting of a long list of pairs of pointers to objects, something like this:

[
  (a8576, b3295), 
  (a7856, b2365), 
  (a3566, b5464),
  ...
]

There are way too many objects to keep in memory at any one time (potentially hundreds of gigabytes), so they need to be stored on disk, but can be cached in memory (probably using an LRU cache).

I need to run through this list processing every pair, which requires that both objects in the pair be loaded into memory (if they aren't already cached there).

So, the question: is there a way to reorder the pairs in the list to maximize the effectiveness of an in-memory cache (in other words: minimize the number of cache misses)?

Notes

Obviously, the re-ordering algorithm should be as fast as possible, and shouldn't depend on being able to have the entire list in memory at once (since we don't have enough RAM for that) - but it could iterate over the list several times if necessary.
If we were dealing with individual objects, not pairs, then the simple answer would be to sort them. This obviously won't work in this situation because you need to consider both elements in the pair.
The problem may be related to that of finding a minimum graph cut, but even if the problems are equivalent, I don't think solutions to min-cut meet
My assumption is that the heuristic would stream the data off the disk, and write it back in chunks in a better order. It may need to iterate over this several times.
Actually it may not just be pairs, it could be triplets, quadruplets, or more. I'm hoping that an algorithm that does this for pairs can be easily generalized.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泼猴你往哪里跑 2024-07-19 22:44:57

您的问题与计算机图形硬件的类似问题有关：

在三角形网格中渲染索引顶点时，通常硬件会缓存最近转换的顶点（上次我不得不担心它时〜128，但怀疑这个数字这些天更大）。未缓存的顶点需要相对昂贵的变换操作来计算。重组三角形网格以优化缓存使用的“网格优化”曾经是一个非常热门的研究主题。谷歌搜索
顶点缓存优化
（或优化：^）可能会为您找到一些与您的问题相关的有趣材料。正如其他海报所暗示的那样，我怀疑有效地做到这一点将取决于利用数据中的任何固有一致性。

另一件需要记住的事情是：当 LRU 缓存变得过载时，非常值得更改为 MRU 替换策略，以至少在内存中保留一些项目（而不是每次传递整个缓存）。我似乎记得 John Carmack 在这个主题上写了一些与 Direct3D 纹理缓存策略相关的好材料。

回复收藏 0 原文

落在眉间の轻吻 2024-07-19 22:44:57

首先，您可以 mmap 列表。如果有足够的地址空间而不是内存（例如在 64 位 CPU 上），则该方法有效。这使得按顺序访问元素变得更加容易。

您可以根据缓存中考虑两个元素的最小距离对该列表进行排序，如果对象位于连续空间中，则效果很好。排序函数可能类似于：比较 (a, b) 与 (c, d) = (a - c) + (b - d) （看起来像汉明距离）。然后，您提取对象存储的切片并根据列表进行处理。

编辑：修正了距离上的错误。

回复收藏 0 原文

霓裳挽歌倾城醉 2024-07-19 22:44:57

即使您不只是对此列表进行排序，多路合并排序可能适用 - 也就是说，考虑将集合某种（可能是递归的）分解为可以在内存中单独处理的较小集合，然后是第二阶段，其中小块前面处理过的集合都可以组合在一起。即使不知道您对这些对所做的具体性质，可以肯定地说，当您处理排序数据时，许多算法问题都会变得更加简单（包括图形问题，这可能是您在处理排序数据时遇到的问题）手在这里）。

回复收藏 0 原文