低内存的内存管理：查找和跟踪随机函数返回值的重复项

发布于 2024-11-08 11:00:23 字数 531 浏览 0 评论 0原文

假设我有一个函数，它接受 32 位整数，并返回随机的 32 位整数。

现在，我想查看该函数将在 0 到 2^32-1 之间的所有可能输入值上返回多少个重复值以及哪些重复值。如果我有超过 4gig 的可用内存，我可以让这件事变得简单，但我没有超过 1gig 的内存。

我尝试使用 4gig 文件将计算值映射到磁盘上，其中一个字节代表它有多少个重复项，但我注意到以我的 HDD 速度，大约完成时间将是未来 25 天！（我不得不使用 SSD，因为担心损坏我的 HDD...）

所以，现在下一步是在 RAM 中计算这一切，而不使用磁盘，但当我思考如何优雅地解决这个问题时，我陷入了困境。我能想到的唯一方法是循环 (2^32)*(2^32) 次函数，但这显然比我的 HDD 方法还要慢。

我现在需要的是一些令人讨厌的想法来加快速度！

编辑：该函数并不是真正的随机函数，而是类似于随机函数，但事实是您不需要了解有关该函数的任何信息，这不是这里的问题。我想用肉眼看到所有的重复项，而不仅仅是一些数学猜测可能有多少。我为什么要这样做？出于好奇:)

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北凤男飞 2024-11-15 11:00:23

要检查 2^32 个可能的重复项，您只需要 4 GB，即 512MB，因为每个值只需要一位。第一次命中 0 位将其设置为 1，每次命中 1 位时，您就知道您有一个重复项，并且可以将其打印出来或执行您想做的任何操作。

即，您可以执行以下操作：

int value = nextValue(...);
static int bits[] = new int[ 0x08000000 ]();

unsigned int idx = value >> 5, bit = 1 << ( value & 31 );
if( bits[ idx ] & bit )
   // duplicate
else
    bits[ idx ] |= bit;

响应您的评论

是的，如果没有太多且不同的重复项，则将重复项放入地图中是一个好主意。如果每个第二个值恰好出现两次，则这里最坏的情况是 2^31 条目。如果映射太大而无法立即保存在内存中，您可以对其进行分区，即仅允许特定范围内的值，即整个数字空间的四分之一。如果重复项分布相当均匀，这将使地图只有整个地图大小的 1/4。当然，您需要每个季度运行该程序 4 次才能找到所有重复项。

要查找第一个重复项，您可以分两遍运行它：在第一遍中，您使用位图来查找重复项并将它们放入映射中。在第二遍中，如果映射中已存在条目且值尚不存在，则跳过位图并将值添加到映射中。

不，没有充分的理由使用 int 代替 unsigned int 数组。您也可以使用 unsigned int ，这实际上在这里更合适。

To check for 2^32 possible duplicates you only need 4 gigabits which is 512MB, since you need only a single bit per value. The first hit of a zero bit sets it to 1 and on every hit of a 1 bit you know you have a duplicate and can print it out or do whatever you want to do with it.

I.e. you can do something like this:

int value = nextValue(...);
static int bits[] = new int[ 0x08000000 ]();

unsigned int idx = value >> 5, bit = 1 << ( value & 31 );
if( bits[ idx ] & bit )
   // duplicate
else
    bits[ idx ] |= bit;

in response to your comments

Yes, putting the duplicates into a map is a good idea if there are not too many and not to many different duplicates. The worst case here is 2^31 entries if every 2nd value appears exactly twice. If the map becomes too large to be held in in memory at once you can partition it, i.e. by only allowing values in the certain range, i.e. a quarter of the entire number space. This would make the map have only 1/4th of the size of the entire map if the duplicates are distributed rather equally. You would of course need to run the program 4 times for each quarter to find all duplicates.

To find also the 1st duplicate you can run it in two passes: In the first pass you use the bitmap to find the duplicates and put them into the map. In the 2nd pass you skip the bitmap and add the values into the map if there is already a entry in the map and the value is not yet there.

No, there is no good reason for a int over a unsigned int array. you can as well use unsigned int which would actually be more appropriate here.

回复收藏 0 原文