低内存的内存管理:查找和跟踪随机函数返回值的重复项

发布于 2024-11-08 11:00:23 字数 531 浏览 0 评论 0原文

假设我有一个函数,它接受 32 位整数,并返回随机的 32 位整数。

现在,我想查看该函数将在 0 到 2^32-1 之间的所有可能输入值上返回多少个重复值以及哪些重复值。如果我有超过 4gig 的可用内存,我可以让这件事变得简单,但我没有超过 1gig 的内存。

我尝试使用 4gig 文件将计算值映射到磁盘上,其中一个字节代表它有多少个重复项,但我注意到以我的 HDD 速度,大约完成时间将是未来 25 天! (我不得不使用 SSD,因为担心损坏我的 HDD...)

所以,现在下一步是在 RAM 中计算这一切,而不使用磁盘,但当我思考如何优雅地解决这个问题时,我陷入了困境。我能想到的唯一方法是循环 (2^32)*(2^32) 次函数,但这显然比我的 HDD 方法还要慢。

我现在需要的是一些令人讨厌的想法来加快速度!

编辑:该函数并不是真正的随机函数,而是类似于随机函数,但事实是您不需要了解有关该函数的任何信息,这不是这里的问题。我想用肉眼看到所有的重复项,而不仅仅是一些数学猜测可能有多少。我为什么要这样做?出于好奇:)

Lets assume i have a function that takes 32bit integer in, and returns random 32bit integer out.

Now, i want to see how many and which duplicate values this function will return on all possible input values from 0 to 2^32-1. I could make this easy if i had more than 4gigs free ram, but i dont have more than 1gig ram.

I tried to map the calculated values on disk, using 4gig file where one byte represented how many duplicates it had got, but i noticed the approximated finishing time will be 25 days in the future with my HDD speeds! (i had to use SSD in fear of breaking my HDD...)

So, now the next step is to calculate this all in RAM and not use disk at all, but i ran at wall when thinking how to solve this elegantly. The only method i could think of was to loop (2^32)*(2^32) times the function, but this is obviously even slower than my HDD method.

What i need now is some nasty ideas to speed this up!

Edit: The function isnt really a random function, but similar to a random function, but the fact is you dont need to know anything about the function, its not the problem here. I want to see all the duplicates by my bare eyes, not just some mathematical guessing how many there could be. Why im doing this? Out of curiosity :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北凤男飞 2024-11-15 11:00:23

要检查 2^32 个可能的重复项,您只需要 4 GB,即 512MB,因为每个值只需要一位。第一次命中 0 位将其设置为 1,每次命中 1 位时,您就知道您有一个重复项,并且可以将其打印出来或执行您想做的任何操作。

即,您可以执行以下操作:

int value = nextValue(...);
static int bits[] = new int[ 0x08000000 ]();

unsigned int idx = value >> 5, bit = 1 << ( value & 31 );
if( bits[ idx ] & bit )
   // duplicate
else
    bits[ idx ] |= bit;

响应您的评论

是的,如果没有太多且不同的重复项,则将重复项放入地图中是一个好主意。如果每个第二个值恰好出现两次,则这里最坏的情况是 2^31 条目。如果映射太大而无法立即保存在内存中,您可以对其进行分区,即仅允许特定范围内的值,即整个数字空间的四分之一。如果重复项分布相当均匀,这将使地图只有整个地图大小的 1/4。当然,您需要每个季度运行该程序 4 次才能找到所有重复项。

要查找第一个重复项,您可以分两遍运行它:在第一遍中,您使用位图来查找重复项并将它们放入映射中。在第二遍中,如果映射中已存在条目且值尚不存在,则跳过位图并将值添加到映射中。

不,没有充分的理由使用 int 代替 unsigned int 数组。您也可以使用 unsigned int ,这实际上在这里更合适。

To check for 2^32 possible duplicates you only need 4 gigabits which is 512MB, since you need only a single bit per value. The first hit of a zero bit sets it to 1 and on every hit of a 1 bit you know you have a duplicate and can print it out or do whatever you want to do with it.

I.e. you can do something like this:

int value = nextValue(...);
static int bits[] = new int[ 0x08000000 ]();

unsigned int idx = value >> 5, bit = 1 << ( value & 31 );
if( bits[ idx ] & bit )
   // duplicate
else
    bits[ idx ] |= bit;

in response to your comments

Yes, putting the duplicates into a map is a good idea if there are not too many and not to many different duplicates. The worst case here is 2^31 entries if every 2nd value appears exactly twice. If the map becomes too large to be held in in memory at once you can partition it, i.e. by only allowing values in the certain range, i.e. a quarter of the entire number space. This would make the map have only 1/4th of the size of the entire map if the duplicates are distributed rather equally. You would of course need to run the program 4 times for each quarter to find all duplicates.

To find also the 1st duplicate you can run it in two passes: In the first pass you use the bitmap to find the duplicates and put them into the map. In the 2nd pass you skip the bitmap and add the values into the map if there is already a entry in the map and the value is not yet there.

No, there is no good reason for a int over a unsigned int array. you can as well use unsigned int which would actually be more appropriate here.

魔法唧唧 2024-11-15 11:00:23

不可问的问题:为什么?。你想达到什么目的?

这是某种蒙特卡罗实验吗?

如果没有,只需查找 (P)RNG 的实现算法,它会准确地告诉您值的分布情况。

查看 Boost.Random 了解更多选择您可以理解,它将具有例如 uniform_int<> 和变量生成器,可以限制您的输出范围,同时仍然对整个输出域的值分布具有明确定义的保证

The unaskable question: Why?. what are you trying to achieve?

Is this some kind of Monte-Carlo experiment?

If not, just look up the implementation algorithn of your (P)RNG and it will tell you exactly what the distribution of values is going to be.

Have a look at Boost.Random for more choices than you can fathom, and it will have e.g. uniform_int<> and variate generators that can limit your output range while still having well-defined guarantees on distribution of values across the output domain

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文