在 Java 中检查 500 万行文件中的唯一行数据
我有一个大文件,其中包含像 ID|VALUE
这样的行。
如果 ID 重复,则必须忽略该行。
如何有效地进行这项检查?
额外: ID很长(8字节)。我需要一个使用最少内存的解决方案。
谢谢大家的帮助。我现在能够增加堆空间并使用 Set 。
I have big file with row like ID|VALUE
in one pass.
In case of ID repeat, line must be ignored.
How to effectively make this checking?
added:
ID is long(8 bytes). I need a solution that uses minimum of memory.
Thank's for help guys. I was able to increase heap space and use Set now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以将数据存储在 TLongObjectHashMap 中或使用 TLongHashSet。这些类有效地存储基于原语的信息。
500 万个长值将使用 < TLongHashSet 中有 60 MB,但是 TLongObjectHashMap 也可以有效地存储您的值。
要了解有关这些类的更多信息,
http://www.google.co.uk/search? q=TLongHashSet
http://www.google.co.uk/search? q=TLongObjectHashMap
You can store the data in a TLongObjectHashMap or use a TLongHashSet. These classes store primitive based information efficiently.
5 million long values will use < 60 MB in a TLongHashSet however a TLongObjectHashMap will also store your values efficiently.
To find out more about these classes
http://www.google.co.uk/search?q=TLongHashSet
http://www.google.co.uk/search?q=TLongObjectHashMap
无论如何,您都必须将 ID 存储在某个地方才能检测重复项。在这里,我将使用
HashSet
及其contains
方法。You'll have to store ID's somewhere anyway in order to detect duplicates. Here I'd use a
HashSet<String>
and itscontains
method.您必须读取整个文件,一次一行。您必须保留一组 ID,并将传入的 ID 与该组中已有的值进行比较。如果出现值,请跳过该行。
您自己编写了用例;这里没有魔法。
You have to read the entire file, one line at a time. You have to keep a Set of IDs and compare the incoming one to the values already in the Set. If a value appears, skip that line.
You wrote the use case yourself; there's no magic here.
对我来说,这看起来像是一个典型的数据库任务。如果您的应用程序中使用了数据库,您可以利用它来完成您的任务。创建一个具有 UNIQUE INTEGER 字段的表并开始添加行;您将在重复的 ID 上遇到异常。数据库引擎将负责光标窗口和缓存,因此它适合您的内存预算。完成后,只需放下该桌子即可。
This looks like a typical database task to me. If you have a database used in your app, you could utilize that to do your task. Create a table with a UNIQUE INTEGER field and start adding rows; you'll get an exception on the duplicated IDs. The database engine will take care of cursor windowing and caching so it fits in your memory budget. Then just drop that table when you're done.
有两种基本解决方案;
首先,正如上面 duffymo 和 Andreas_D 所建议的,您可以将所有值存储在
Set
中。这为您提供了 O(n) 时间复杂度和 O(n) 内存使用量。其次,如果 O(n) 内存太多,您可以通过牺牲速度在 O(1) 内存中完成。对于文件中的每一行,读取其之前的所有其他行,如果 ID 出现在当前行之前,则丢弃该行。
There are two basic solutions;
First, as suggested by duffymo and Andreas_D above you can store all the values in a
Set
. This gives you O(n) time complexity and O(n) memory usage.Second, if O(n) memory is too much, you can do it in O(1) memory by sacrificing speed. For each line in the file, read all other lines before it and discard if the ID appears before the current line.
概率算法怎么样?
What about probabilistic algorithms?