排序& Linux shell 中的 uniq

发布于 2024-09-12 08:04:22 字数 99 浏览 3 评论 0原文

下面两个命令有什么区别?

sort -u FILE
sort FILE | uniq

What is the difference between the following two commands?

sort -u FILE
sort FILE | uniq

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

旧人九事 2024-09-19 08:04:22

使用 sort -u 执行的 I/O 少于 sort | uniq,但最终结果是一样的。特别是,如果文件足够大,以至于 sort 必须创建中间文件,则 sort -u 很有可能会使用稍少或稍小的中间文件,因为它可以在对每组进行排序时消除重复项。如果数据高度重复,这可能是有益的;如果事实上重复很少,则不会有太大区别(与管道的一阶效果相比,绝对是二阶性能效果)。

请注意,有时管道是合适的。例如:

sort FILE | uniq -c | sort -n

这会按照文件中每行出现的次数对文件进行排序,重复次数最多的行出现在最后。 (如果我发现这种 Unix 或 POSIX 惯用的组合可以通过 GNU sort 压缩为一个复杂的“排序”命令,我不会感到惊讶。)

有时不使用管道很重要。例如:

sort -u -o FILE FILE

这对文件进行“原位”排序;也就是说,输出文件由-o FILE指定,并且该操作保证安全(文件在被覆盖输出之前被读取)。

Using sort -u does less I/O than sort | uniq, but the end result is the same. In particular, if the file is big enough that sort has to create intermediate files, there's a decent chance that sort -u will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).

Note that there times when the piping is appropriate. For example:

sort FILE | uniq -c | sort -n

This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)

There are times when not using the pipe is important. For example:

sort -u -o FILE FILE

This sorts the file 'in situ'; that is, the output file is specified by -o FILE, and this operation is guaranteed safe (the file is read before being overwritten for output).

不必你懂 2024-09-19 08:04:22

有一个细微的差别:返回码。

问题是,除非设置了 shopt -o pipelinefail ,否则管道命令的返回代码将是最后一个命令的返回代码。并且 uniq 始终返回零(成功)。尝试检查退出代码,您将看到类似这样的内容(此处未设置 pipefail):

pavel@lonely ~ $ sort -u file_that_doesnt_exist ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
2
pavel@lonely ~ $ sort file_that_doesnt_exist | uniq ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
0

除此之外,命令是等效的。

There is one slight difference: return code.

The thing is that unless shopt -o pipefail is set the return code of the piped command will be return code of the last one. And uniq always returns zero (success). Try examining exit code, and you'll see something like this (pipefail is not set here):

pavel@lonely ~ $ sort -u file_that_doesnt_exist ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
2
pavel@lonely ~ $ sort file_that_doesnt_exist | uniq ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
0

Other than this, the commands are equivalent.

一张白纸 2024-09-19 08:04:22

提防!虽然“sort -u”和“sort|uniq”确实是等效的,但任何附加的排序选项都可能会破坏等效性。下面是 coreutils 手册中的一个示例:

例如,“sort -n -u”在检查唯一性时仅检查初始数字字符串的值,而“sort -n |”则仅检查初始数字字符串的值。 uniq' 检查整行。

同样,如果您对关键字段进行排序,则 sort 使用的唯一性测试不一定再查看整行。在过去被这个 bug 困扰之后,现在我在编写 Bash 脚本时倾向于使用“sort|uniq”。我宁愿有更高的 I/O 开销,也不愿冒着商店中的其他人在修改我的代码以添加其他排序参数时不知道该特定陷阱的风险。

Beware! While it's true that "sort -u" and "sort|uniq" are equivalent, any additional options to sort can break the equivalence. Here's an example from the coreutils manual:

For example, 'sort -n -u' inspects only the value of the initial numeric string when checking for uniqueness, whereas 'sort -n | uniq' inspects the entire line.

Similarly, if you sort on key fields, the uniqueness test used by sort won't necessarily look at the entire line anymore. After being bitten by that bug in the past, these days I tend to use "sort|uniq" when writing Bash scripts. I'd rather have higher I/O overhead than run the risk that someone else in the shop won't know about that particular pitfall when they modify my code to add additional sort parameters.

神仙妹妹 2024-09-19 08:04:22

sort -u 会稍微快一些,因为它不需要在两个命令之间通过管道传输输出

也请参阅我关于该主题的问题:在shell中调用uniq并按不同顺序排序

sort -u will be slightly faster, because it does not need to pipe the output between two commands

also see my question on the topic: calling uniq and sort in different orders in shell

随遇而安 2024-09-19 08:04:22

我曾在一些排序不支持“-u”选项的服务器上工作过。在那里我们必须使用

sort xyz | uniq

I have worked on some servers where sort don't support '-u' option. there we have to use

sort xyz | uniq
固执像三岁 2024-09-19 08:04:22

没什么,他们会产生相同的结果

Nothing, they will produce the same result

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文