排序& Linux shell 中的 uniq
下面两个命令有什么区别?
sort -u FILE
sort FILE | uniq
What is the difference between the following two commands?
sort -u FILE
sort FILE | uniq
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
使用
sort -u
执行的 I/O 少于sort | uniq
,但最终结果是一样的。特别是,如果文件足够大,以至于sort
必须创建中间文件,则sort -u
很有可能会使用稍少或稍小的中间文件,因为它可以在对每组进行排序时消除重复项。如果数据高度重复,这可能是有益的;如果事实上重复很少,则不会有太大区别(与管道的一阶效果相比,绝对是二阶性能效果)。请注意,有时管道是合适的。例如:
这会按照文件中每行出现的次数对文件进行排序,重复次数最多的行出现在最后。 (如果我发现这种 Unix 或 POSIX 惯用的组合可以通过 GNU sort 压缩为一个复杂的“排序”命令,我不会感到惊讶。)
有时不使用管道很重要。例如:
这对文件进行“原位”排序;也就是说,输出文件由
-o FILE
指定,并且该操作保证安全(文件在被覆盖输出之前被读取)。Using
sort -u
does less I/O thansort | uniq
, but the end result is the same. In particular, if the file is big enough thatsort
has to create intermediate files, there's a decent chance thatsort -u
will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).Note that there times when the piping is appropriate. For example:
This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)
There are times when not using the pipe is important. For example:
This sorts the file 'in situ'; that is, the output file is specified by
-o FILE
, and this operation is guaranteed safe (the file is read before being overwritten for output).有一个细微的差别:返回码。
问题是,除非设置了 shopt -o pipelinefail ,否则管道命令的返回代码将是最后一个命令的返回代码。并且
uniq
始终返回零(成功)。尝试检查退出代码,您将看到类似这样的内容(此处未设置pipefail
):除此之外,命令是等效的。
There is one slight difference: return code.
The thing is that unless
shopt -o pipefail
is set the return code of the piped command will be return code of the last one. Anduniq
always returns zero (success). Try examining exit code, and you'll see something like this (pipefail
is not set here):Other than this, the commands are equivalent.
提防!虽然“sort -u”和“sort|uniq”确实是等效的,但任何附加的排序选项都可能会破坏等效性。下面是 coreutils 手册中的一个示例:
例如,“sort -n -u”在检查唯一性时仅检查初始数字字符串的值,而“sort -n |”则仅检查初始数字字符串的值。 uniq' 检查整行。
同样,如果您对关键字段进行排序,则 sort 使用的唯一性测试不一定再查看整行。在过去被这个 bug 困扰之后,现在我在编写 Bash 脚本时倾向于使用“sort|uniq”。我宁愿有更高的 I/O 开销,也不愿冒着商店中的其他人在修改我的代码以添加其他排序参数时不知道该特定陷阱的风险。
Beware! While it's true that "sort -u" and "sort|uniq" are equivalent, any additional options to sort can break the equivalence. Here's an example from the coreutils manual:
For example, 'sort -n -u' inspects only the value of the initial numeric string when checking for uniqueness, whereas 'sort -n | uniq' inspects the entire line.
Similarly, if you sort on key fields, the uniqueness test used by sort won't necessarily look at the entire line anymore. After being bitten by that bug in the past, these days I tend to use "sort|uniq" when writing Bash scripts. I'd rather have higher I/O overhead than run the risk that someone else in the shop won't know about that particular pitfall when they modify my code to add additional sort parameters.
sort -u
会稍微快一些,因为它不需要在两个命令之间通过管道传输输出也请参阅我关于该主题的问题:在shell中调用uniq并按不同顺序排序
sort -u
will be slightly faster, because it does not need to pipe the output between two commandsalso see my question on the topic: calling uniq and sort in different orders in shell
我曾在一些排序不支持“-u”选项的服务器上工作过。在那里我们必须使用
I have worked on some servers where sort don't support '-u' option. there we have to use
没什么,他们会产生相同的结果
Nothing, they will produce the same result