如何在 Unix 中删除文件中的重复行而不对其进行排序
有没有办法删除 Unix 中文件中的重复行?
我可以使用 sort -u
和 uniq
命令来完成此操作,但我想使用 sed
或 awk
。
这可能吗?
Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u
and uniq
commands, but I want to use sed
or awk
.
Is that possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
seen
是一个关联数组,AWK 将向其传递文件的每一行。如果数组中没有某行,则seen[$0]
将计算为 false。!
是逻辑 NOT 运算符,会将 false 反转为 true。 AWK 将打印表达式计算结果为 true 的行。++
递增seen
,以便在第一次找到一行后seen[$0] == 1
然后seen[ $0] == 2
,依此类推。AWK 将除
0
和""
(空字符串)之外的所有内容评估为 true。如果将重复行放入seen
中,则!seen[$0]
的计算结果将为 false,并且该行不会写入输出。seen
is an associative array that AWK will pass every line of the file to. If a line isn't in the array thenseen[$0]
will evaluate to false. The!
is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.The
++
incrementsseen
so thatseen[$0] == 1
after the first time a line is found and thenseen[$0] == 2
, and so on.AWK evaluates everything but
0
and""
(empty string) to true. If a duplicate line is placed inseen
then!seen[$0]
will evaluate to false and the line will not be written to the output.来自 http://sed.sourceforge.net/sed1line.txt:
(请不要问我这是如何工作的;-))
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
Perl 单行类似于 jonas 的 AWK 解决方案:
此变体在比较之前删除尾随空格:
此变体就地编辑文件:
此变体就地编辑文件,并制作备份
file.bak< /代码>:
Perl one-liner similar to jonas's AWK solution:
This variation removes trailing white space before comparing:
This variation edits the file in-place:
This variation edits the file in-place, and makes a backup
file.bak
:使用 Vim(Vi 兼容)的另一种方法:
从文件中删除重复的连续行:
vim -esu NONE +'g/\v^(.*)\n\1$/ d' +wq
从文件中删除重复的、不连续的和非空的行:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d '+wq
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
Andre Miller 发布了 的工作,除了最新版本的 sed,输入文件以空行结尾且没有字符时。在我的 Mac 上,我的 CPU 只是旋转。
如果最后一行为空且没有任何字符,则这是一个无限循环:
sed '$!N; /^\(.*\)\n\1$/!P; D'
它不会挂起,但你会丢失最后一行:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
解释位于 sed FAQ:
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
第一个解决方案也来自 http://sed.sourceforge.net/sed1line.txt
核心思想是:
说明:
$!N;
:如果当前行不是最后一行,则使用N
命令将下一行读入< em>模式空间。/^(.*)\n\1$/!P
:如果当前模式空间的内容是两个重复字符串\n
,表示下一行与当前行相同,根据我们的核心思想,我们可以不打印;否则,这意味着当前行是其所有重复连续行的最后出现。我们现在可以使用P
命令打印当前模式空间中的字符,直到\n
(\n
也打印)。D
:我们使用D
命令删除当前模式空间中的字符,直到\n
(\n
也删除),然后模式空间的内容就是下一行。D
命令将强制sed
跳转到其第一个命令
$!N
,但不会 从文件或标准输入流中读取下一行。第二种解决方案很容易理解(来自我自己):
核心思想是:
说明:
:loop
命令设置一个名为loop的标签。N
将下一行读入模式空间。s/^(.*)\n\1$/\1/
删除当前行。我们使用s
命令来执行删除操作。s
命令执行成功,则使用tloop命令强制sed
跳转到名为标签 循环,它将对下一行执行相同的循环,直到最新打印的行没有重复的连续行;否则,使用D
命令删除
与最新打印行相同的行,并强制sed 跳转到第一个命令,即
p
命令。当前模式空间的内容是下一个新行。The first solution is also from http://sed.sourceforge.net/sed1line.txt
The core idea is:
Explanation:
$!N;
: if the current line is not the last line, use theN
command to read the next line into the pattern space./^(.*)\n\1$/!P
: if the contents of the current pattern space is two duplicate strings separated by\n
, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use theP
command to print the characters in the current pattern space until\n
(\n
also printed).D
: we use theD
command to delete the characters in the current pattern space until\n
(\n
also deleted), and then the content of pattern space is the next line.D
command will forcesed
to jump to its first command$!N
, but not read the next line from a file or standard input stream.The second solution is easy to understand (from myself):
The core idea is:
Explanation:
:loop
command to set a label named loop.N
to read the next line into the pattern space.s/^(.*)\n\1$/\1/
to delete the current line if the next line is the same with the current line. We use thes
command to do the delete action.s
command is executed successfully, then use the tloop command to forcesed
to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use theD
command todelete
the line which is the same with the latest-printed line, and forcesed
to jump to the first command, which is thep
command. The content of the current pattern space is the next new line.uniq 会被尾随空格和制表符所欺骗。为了模仿人类如何进行比较,我在比较之前修剪所有尾随空格和制表符。
我认为
$!N;
需要大括号,否则它会继续,这就是无限循环的原因。我在 Ubuntu 20.10 (Groovy大猩猩)。第二行在字符集匹配时不起作用。
这是三个变体。第一个是消除相邻的重复行,第二个是消除重复行(无论它们出现在何处),第三个是消除文件中除最后一个实例之外的所有行。
pastebin
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the
$!N;
needs curly braces or else it continues, and that is the cause of the infinite loop.I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
这可以使用 AWK 来实现。
下面的行将显示唯一值:
您可以将这些唯一值输出到新文件:
新文件 uniq_file_name 将仅包含唯一值,没有任何重复项。
This can be achieved using AWK.
The below line will display unique values:
You can output these unique values to a new file:
The new file uniq_file_name will contain only unique values, without any duplicates.
用途:
使用AWK删除重复行。
Use:
It deletes the duplicate lines using AWK.