从txt文件中删除重复行
我正在处理包含按行分隔的数据的大型文本文件(~20MB)。 大多数数据条目都是重复的,我想删除这些重复项以仅保留一份副本。
此外,为了使问题稍微复杂一些,一些条目会重复,并附加一些额外的信息。在这种情况下,我需要保留包含额外信息的条目并删除旧版本。
例如 我需要从这里开始:
BOB 123 1DB JIM 456 3DB AX DAVE 789 1DB BOB 123 1DB JIM 456 3DB AX DAVE 789 1DB BOB 123 1DB EXTRA BITSto this:
JIM 456 3DB AX DAVE 789 1DB BOB 123 1DB EXTRA BITSNB. the final order doesn't matter.
什么是有效的方法来做到这一点?
我可以使用 awk、python 或任何标准的 Linux 命令行工具。
谢谢。
I am processing large text files (~20MB) containing data delimited by line.
Most data entries are duplicated and I want to remove these duplications to only keep one copy.
Also, to make the problem slightly more complicated, some entries are repeated with an extra bit of info appended. In this case I need to keep the entry containing the extra info and delete the older versions.
e.g.
I need to go from this:
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
to this:
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
NB. the final order doesn't matter.
What is an efficient way to do this?
I can use awk, python or any standard linux command line tool.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
怎么样(在Python中):
如果您发现内存使用有问题,您可以使用Unix
sort
(即基于磁盘)并更改脚本,使其不读取整个文件进入内存。How about the following (in Python):
If you find memory usage an issue, you can do the sort as a pre-processing step using Unix
sort
(which is disk-based) and change the script so that it doesn't read the entire file into memory.awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'
如果需要指定不同文件的列数:
awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'
If you need to specify the number of columns for different files:
无论具有额外位的行的位置如何,格伦·杰克曼答案的这种变化都应该有效:
或者
This variation on glenn jackman's answer should work regardless of the position of lines with extra bits:
Or
这个或一个轻微的变体应该做:
输出:
This or a slight variant should do:
outputs:
您必须定义一个函数来将您的行分成重要位和额外位,然后您可以执行以下操作:
You'll have to define a function to split your line into important bits and extra bits, then you can do:
由于您需要额外的位,最快的方法是创建一组唯一的条目(sort -u 即可),然后您必须将每个条目相互比较,例如
and just leave x and discard y.
Since you need the extra bits the fastest way is to create a set of unique entries (sort -u will do) and then you must compare each entry against each other, e.g.
and just leave x and discard y.
如果您有 perl 并且只想保留最后一个条目:
If you have perl and want only the last entry to be preserved :
函数
find_unique_lines
适用于文件对象或字符串列表。The function
find_unique_lines
will work for a file object or a list of strings.