从太大而无法放入内存的数据集创建一个唯一列表
我有一个包含 1.2 亿条记录的列表,每条记录约 40/50 字节,原始内存空间约为 5.5/6 GB,不包括将数组保留在内存中所需的任何额外存储空间。
我想确保这个列表是唯一的。我尝试的方法是创建一个 Hashset
当我达到大约 3300 万条记录时,我的内存不足,并且列表创建速度慢得像爬行一样。
有没有更好的方法来及时对如此庞大的条目列表进行排序?我能想到的唯一解决方案是使用 Amazon EC2 高内存四倍超大实例一个小时。
谢谢
I have a list of 120 million records of around 40/50 bytes each which is about 5.5/6 gigabytes of raw memory space not including any extra storage required to keep an array in memory.
I'd like to make sure this list is unique. The way I have tried to do it is create a Hashset<string> and add all the entries to it one by one.
When I get to about 33 million records I'm out of memory and the list creation slows to a crawl.
Is there a better way to sort this massive list of entries in a timely manner? The only solution I can think of is using an Amazon EC2 High-Memory Quadruple Extra Large Instance for an hour.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您只是想检查唯一性,我只需将输入序列分成多个桶,然后分别检查每个桶。
例如,假设您正在从文件加载数据,您可以将输入流式传输,并将其写入 26 个不同的文件,每个记录以 AZ 开头的字母对应一个文件(我天真地假设每个记录以 AZ 开头 -请根据您的实际情况进行调整)。然后,您可以使用现有代码之类的方法检查每个较小文件的唯一性 - 因为它们都不会太大而无法一次装入内存。初始存储桶保证不同存储桶中不会有任何重复条目。
当然,您可以通过多种不同的方式来执行分桶,并且不同的方法对于不同的数据集将有效。例如,您可以通过哈希码进行存储 - 采用哈希码的底部 5 位来创建 32 个不同的存储桶。这可能会在存储桶之间获得合理的记录分布,并且不会对输入数据做出任何假设。我上面只提到了“采用第一个字母的方法”,因为这是理解这个概念的更简单的方法:)
If you're just trying to check for uniqueness, I would simply split the input sequence into buckets, and then check each bucket separately.
For example, assuming you're loading the data from a file, you could stream the input in, and write it out to 26 different files, one for each letter that record starts with (I'm naively assuming each record starts with A-Z - please adjust for your real situation). Then you can check each of those smaller files for uniqueness using something like your existing code - because none of them will be too large to fit into memory at a time. The initial bucketing guarantees that there won't be any duplicate entries which are in different buckets.
Of course, there are various different ways you could perform the bucketing, and different approaches will be effective for different data sets. You could bucket by hash code, for example - take the bottom 5 bits of the hash code to create 32 different buckets. That's likely to get a reasonably equal distribution of records between buckets, and doesn't make any assumptions about the input data. I only mentioned the "take the first letter approach" above as it's a simpler way of grasping the concept :)
使用桶排序对列表进行排序,定期将桶中的一些内容刷新到磁盘以避免内存不足。然后按顺序加载每个刷新的存储桶,并使用 HashSet 方法或对其进行排序并以这种方式检查。
Use bucket sort to sort the list, flushing some of the contents of the buckets out to disk regularly to avoid running out of memory. Then load each flushed bucket in sequence and either use your HashSet approach or sort it and check it that way.
您始终可以在具有唯一索引的 sqlite 数据库中工作,因为它可能有助于对数据集进行进一步处理。
You could always work in a sqlite database with a unique index as it may help for further processing on the dataset.