在 C 中保留一个用于成员资格测试的大列表
每一项都是一个由 17 个 32 位整数组成的数组。我或许可以为它们生成 120 位唯一的哈希值。
我有一个算法可以生成 9,731,643,264 个这样的项目,并且想看看其中有多少是唯一的。我推测其中最多 1/36 是唯一的,但不能确定。
在这种大小下,我实际上无法在内存中执行此操作(因为我只有 4 个演出),因此我需要一种方法来保存这些列表,进行成员资格测试,并添加每个新列表(如果尚不存在)。
我在 Linux 上使用 C(gcc) 工作,所以如果解决方案可以从那里工作那就太好了。
有什么想法吗?
Each item is an array of 17 32-bit integers. I can probably produce 120-bit unique hashes for them.
I have an algorithm that produces 9,731,643,264 of these items, and want to see how many of these are unique. I speculate that at most 1/36th of these will be unique but can't be sure.
At this size, I can't really do this in memory (as I only have 4 gigs), so I need a way to persist a list of these, do membership tests, and add each new one if it's not already there.
I am working in C(gcc) on Linux so it would be good if the solution can work from there.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这让我想起了很多年前我在解决《骑士之旅》时遇到的一些问题。 (一个数学问题现在已经解决了,但不是我解决的。)
即使你的散列也没有多大帮助。 。 。它们的大小接近 GUID,因此在整个已知宇宙中很容易是唯一的。
仅在磁盘上保存该列表就需要大约 0.75 太字节。 。 。无论是否有 4 GB 内存,您仍然需要一个巨大的磁盘来容纳它们。您需要两倍或更多的磁盘来执行我下面讨论的排序/合并解决方案。
如果您可以对该列表进行排序,那么您可以一次将列表中的一项扔出去,寻找彼此相邻的唯一副本。当然,对这么多数据进行排序需要一个自定义排序例程(您编写的),因为它是二进制的(转换为十六进制将使数据大小加倍,但允许您使用标准例程)。 。 。尽管即使在那里,他们也可能会因为这么多数据而窒息。 。 。所以你又回到了你自己的习惯。
需要考虑的一些事情:
对如此多的数据进行排序将需要数周、数月甚至数年的时间。 只有这么多的磁盘空间,所以无论您在内存中做什么,您都可能会对文件进行“冒泡”排序。
根据您的生成算法的样子,您可以生成“一次内存负载”的数据,将其就地排序,然后将其写入磁盘文件中(已排序)。完成后,您只需“合并”所有这些单独的排序文件,这是一项容易得多的任务(即使有 1000 个文件,它仍然是一项相对容易的任务)。
如果您的生成器可以告诉您有关数据的任何信息,请充分利用它。例如,在我的例子中,当我处理骑士的移动时,我知道我的输出值不断变大(因为我总是在每个移动中添加一位),这些小知识使我能够以一些独特的方式优化我的排序。查看您的数据,看看您是否知道类似的信息。
当然,减小数据总是好的。例如,您谈论 120 哈希,但它是可逆的吗?如果是这样,请对哈希进行排序,因为它较小。如果不是,散列可能没有多大帮助(至少对于我的排序解决方案而言)。
我对此类问题的机制很感兴趣,我很乐意就这个主题交换电子邮件,只是为了探讨想法和可能的解决方案。
This reminds me of some of the problems I faced working on a solution to "Knight's Tour" many years ago. (A math problem which is now solved, but not by me.)
Even your hash isn't that much help . . . at the nearly the size of a GUID, they could easily be unique accross all the the known universe.
It will take approximately .75 Terrabytes just to hold the list on disk . . . 4 Gigs of memory or not, you'd still need a huge disk just to hold them. And you'd need double that much disk or more to do the sort/merge solutions I talk about below.
If you could SORT that list, then you could just go threw the list one item at a time looking for unique copies next to each other. Of course sorting that much data would required a custom sort routine (that you wrote) since it is binary (coverting to hex would double the size of your data, but would allow you to use standard routines) . . . though likely even there they would probably choke on that much data . . . so your are back to your own custom routines.
Some things to think about:
Sorting that much data will take weeks, months or perhaps years. While you can do a nice heap sort or whatever in memory, because you only have so much disk space, you will likely be doing a "bubble" sort of the files regardless of what you do in memory.
Depending on what your generation algorithm looks like, you could generate "one memory load" worth of data, sort it in place then write it out to disk in a file (sorted). Once that was done, you just have to "merge" all those individual sorted files, which is a much easier task (even thought there would be 1000s of files, it would still be a relatively easier task).
If your generator can tell you ANYTHING about your data, use that to your advantage. For instance in my case, as I processed the Knight's Moves, I know my output values were constantly getting bigger (because I was always adding one bit per move), that small knowledge allowed me to optimize my sort in some unique ways. Look at your data, see if you know anything similar.
Making the data smaller is always good of course. For instance you talk about a 120 hash, but is that has reversable? If so, sort the hash since it is smaller. If not, the hash might not be that much help (at least for my sorting solutions).
I am interested in the machanics of issues like this and I'd be happy to exchange emails on this subject just to bang around ideas and possible solutions.
如果您可以对输入数据设置一些限制,您的生活可能会变得更加轻松:即使假设只有 120 个有效位,大量的重复值也表明分布不均匀,因为均匀分布会使给定样本不太可能出现重复值
10^10
的大小:如果您有连续的簇(而不是稀疏但重复的值),则通过对范围而不是原子值进行操作可以获得很多收益。
我会做什么:
然后,您需要合并单个文件,这可以在线完成 - 即当文件可用时 - 与基于堆栈的合并排序的操作方式相同:将每个文件与一个等于文件中范围数的计数器关联起来,并将每个新文件推送到堆栈上。当堆栈顶部的文件的计数器大于或等于前一个文件时,将文件合并为一个新文件,该新文件的计数器为合并文件中范围的数量。
You can probably make your life a lot easier if you can place some restrictions on your input data: Even assuming only 120 significant bits, the high number of duplicate values suggests an uneven distribution, as an even distribution would make duplicates unlikely for a given sample size of
10^10
:If you have continuous clusters (instead of sparse, but repeated values), you can gain a lot by operating on ranges instead of atomic values.
What I would do:
Then, you need to merge the individual files, which can be done online - ie as the files become available - the same way a stack-based mergesort operates: associate to each file a counter equal to the number of ranges in the file and push each new file on a stack. When the file on top of the stack has a counter greater or equal to the previous file, merge the files into a new file whose counter is the number of ranges in the merged file.