在大文件中查找重复项
我有一个非常大的文件,大约有 1500 万个条目。 文件中的每一行都包含一个字符串(称为键)。
我需要使用 java 查找文件中的重复条目。 我尝试使用哈希图并检测重复的条目。 显然,这种方法向我抛出了“java.lang.OutOfMemoryError:Java堆空间”错误。
我该如何解决这个问题?
我想我可以增加堆空间并尝试一下,但我想知道是否有更有效的解决方案而无需调整堆空间。
I have really large file with approximately 15 million entries.
Each line in the file contains a single string (call it key).
I need to find the duplicate entries in the file using java.
I tried to use a hashmap and detect duplicate entries.
Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.
How can I solve this problem?
I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
关键是你的数据无法装入内存。您可以使用外部合并排序来实现此目的:
将文件划分为多个适合内存的较小块。对每个块进行排序,消除重复项(现在是相邻元素)。
合并块并在合并时再次消除重复项。由于这里将进行 n-nway 合并,因此您可以将每个块的下一个 k 元素保留在内存中,一旦块的项目耗尽(它们已经被合并),就可以从磁盘中获取更多元素。
The key is that your data will not fit into memory. You can use external merge sort for this:
Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).
Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.
我不确定您是否会考虑在 java 之外执行此操作,但如果是这样,这在 shell 中非常简单:
I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:
您可能无法一次加载整个文件,但您可以将哈希值和行号存储在 HashSet 中,没有问题。
伪代码...
You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.
Pseudo code...
我认为您不需要对数据进行排序来消除重复项。只需使用受快速排序启发的方法即可。
请注意,k 可以等于 1。
I don't think you need to sort the data to eliminate duplicates. Just use quicksort inspired approach.
Note that k can be equal to 1.
我可以想象解决这个问题的一种方法是首先使用 外部排序算法 对文件进行排序(搜索for
external sort java
使用代码产生大量结果)。然后您可以逐行迭代文件,重复项现在显然将直接相互跟随,因此您只需要在迭代时记住前一行。One way I can imagine solving this is to first use an external sorting algorithm to sort the file (searching for
external sort java
yields lots of results with code). Then you can iterate the file line by line, duplicates will now obviously be directly following each other so you only need to remember the previous line while iterating.如果由于没有足够的内存而无法构建完整的列表,则可以尝试循环执行。即创建一个哈希图,但只存储一小部分项目(例如,以 A 开头的项目)。然后收集重复项,然后继续“B”等。
当然,您可以选择任何类型的“分组”(即前 3 个字符、前 6 个字符等)。
它只会需要(许多)更多的迭代。
If you cannot build up a complete list since you don't have enough memory, you might try do it in loops. I.e. create a hashmap but only store a small portion of the items (for example, those starting with A). Then you gather the duplicates, then continue with 'B' etc.
Of course you can select any kind of 'grouping' (i.e. first 3 characters, first 6 etc).
It only will take (many) more iterations.
如果您愿意接受一定数量的统计误差,您可以尝试布隆过滤器。 Guava 提供了一个,但现在有一个相当大的错误应该会在下周发布 11.0.2 时修复。
You might try a Bloom filter, if you're willing to accept a certain amount of statistical error. Guava provides one, but there's a pretty major bug in it right now that should be fixed probably next week with release 11.0.2.
关键是您的数据无法装入内存。 (BrokenGlass)
有足够的内存来存储键哈希值的
Map
来定位键,例如RandomAccessFile.seek()
的偏移量或行号 Andrew White 建议,您可以在识别非唯一密钥时对其进行处理。否则,首先建立一个之前可能见过的哈希值映射(例如,使用
key.hashCode() % (3<<23)
索引的 3MB 位图)传递,并在第二传递中处理桶中的钥匙,仅击中至少两次。The key is that your data will not fit into memory. (BrokenGlass)
With enough memory to store a
Map
of key hash values to something to locate the key like an offset forRandomAccessFile.seek()
or a line number as Andrew White suggested, you can process non-unique keys as they are identified.Otherwise, establish a map of hash values to maybe seen before (e.g. a 3MB bitmap indexed with
key.hashCode() % (3<<23)
) in a first pass and in a second pass handle keys from buckets hit at least twice, only.