删除文件中的重复数据
我在提出算法时遇到问题。伙计们,你们能帮我一下吗?
我有一个很大的文件,因此无法立即加载。存在重复数据(通用数据,可能是字符串)。我需要删除重复项。
I have a problem coming up with an algorithm. Will you, guys, help me out here?
I have a file which is huge and thus can not be loaded at once. There exists duplicate data (generic data, might be strings). I need to remove duplicates.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
一种简单但缓慢的解决方案是读取 HashSet 中的第 1 GB。读取文件的顺序其余部分并删除文件中的重复字符串。比读取内存中的第二个千兆位(哈希集)并删除文件中的重复项,一次又一次......
它很容易编程,如果你只想做一次就足够了。
One easy but slow solution is read 1st Gigabite in HashSet. Read sequential rest of the file and remove duplicit Strings, that are in file. Than read 2nd gigabite in memory(hashset) and remove duplicit in files and again, and again...
Its quite easy to program and if you want to do it only once it could be enough.
您可以计算每个记录的哈希值并将其保存在 Map> 中
读取构建地图的文件,如果您发现 HashKey 存在于您寻求定位的地图中,请仔细检查(如果不相等,则将该位置添加到映射集中)
you can calculate a hash for each record and keep that in a Map>
read in the file building the map and if you find the HashKey exists in the map you seek to position to double check (and if not equal add the location to the mapped set)
第二种解决方案:
Second solution:
取决于输入在文件中的放置方式;如果每一行都可以用行数据表示;
另一种方法是使用数据库服务器,将数据插入具有唯一值列的数据库表中,从文件中读取并插入到数据库中。最后数据库将包含所有唯一的行/行。
Depending on how the input is placed in the file; if each line can be represented by row data;
Another way is to use a database server, insert your data into a database table with a unique value column, read from file and insert into database. At the end database will contain all unique lines/rows.