如何在 Unix 中删除文件中的重复行而不对其进行排序

发布于 2024-08-04 16:38:03 字数 148 浏览 5 评论 0原文

有没有办法删除 Unix 中文件中的重复行?

我可以使用 sort -uuniq 命令来完成此操作,但我想使用 sedawk

这可能吗?

Is there a way to delete duplicate lines in a file in Unix?

I can do it with sort -u and uniq commands, but I want to use sed or awk.

Is that possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

家住魔仙堡 2024-08-11 16:38:03
awk '!seen[$0]++' file.txt

seen 是一个关联数组,AWK 将向其传递文件的每一行。如果数组中没有某行,则 seen[$0] 将计算为 false。 ! 是逻辑 NOT 运算符,会将 false 反转为 true。 AWK 将打印表达式计算结果为 true 的行。

++ 递增 seen,以便在第一次找到一行后 seen[$0] == 1 然后 seen[ $0] == 2,依此类推。
AWK 将除 0""(空字符串)之外的所有内容评估为 true。如果将重复行放入 seen 中,则 !seen[$0] 的计算结果将为 false,并且该行不会写入输出。

awk '!seen[$0]++' file.txt

seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.

The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

没有你我更好 2024-08-11 16:38:03

来自 http://sed.sourceforge.net/sed1line.txt
(请不要问我这是如何工作的;-))

 # delete duplicate, consecutive lines from a file (emulates "uniq").
 # First line in a set of duplicate lines is kept, rest are deleted.
 sed '$!N; /^\(.*\)\n\1$/!P; D'

 # delete duplicate, nonconsecutive lines from a file. Beware not to
 # overflow the buffer size of the hold space, or else use GNU sed.
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )

 # delete duplicate, consecutive lines from a file (emulates "uniq").
 # First line in a set of duplicate lines is kept, rest are deleted.
 sed '$!N; /^\(.*\)\n\1$/!P; D'

 # delete duplicate, nonconsecutive lines from a file. Beware not to
 # overflow the buffer size of the hold space, or else use GNU sed.
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
断爱 2024-08-11 16:38:03

Perl 单行类似于 jonas 的 AWK 解决方案

perl -ne 'print if ! $x{$_}++' file

此变体在比较之前删除尾随空格:

perl -lne 's/\s*$//; print if ! $x{$_}++' file

此变体就地编辑文件:

perl -i -ne 'print if ! $x{$_}++' file

此变体就地编辑文件,并制作备份 file.bak< /代码>:

perl -i.bak -ne 'print if ! $x{$_}++' file

Perl one-liner similar to jonas's AWK solution:

perl -ne 'print if ! $x{$_}++' file

This variation removes trailing white space before comparing:

perl -lne 's/\s*$//; print if ! $x{$_}++' file

This variation edits the file in-place:

perl -i -ne 'print if ! $x{$_}++' file

This variation edits the file in-place, and makes a backup file.bak:

perl -i.bak -ne 'print if ! $x{$_}++' file
上课铃就是安魂曲 2024-08-11 16:38:03

使用 Vim(Vi 兼容)的另一种方法

从文件中删除重复的连续行:

vim -esu NONE +'g/\v^(.*)\n\1$/ d' +wq

从文件中删除重复的、不连续的和非空的行:

vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d '+wq

An alternative way using Vim (Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq

予囚 2024-08-11 16:38:03

Andre Miller 发布了 的工作,除了最新版本的 sed,输入文件以空行结尾且没有字符时。在我的 Mac 上,我的 CPU 只是旋转。

如果最后一行为空且没有任何字符,则这是一个无限循环:

sed '$!N; /^\(.*\)\n\1$/!P; D'

它不会挂起,但你会丢失最后一行:

sed '$d;N; /^\(.*\)\n\1$/!P; D'

解释位于 sed FAQ:

GNU sed 维护者认为尽管存在可移植性问题
这会导致将 N 命令更改为打印(而不是
删除)模式空间更符合直觉
关于“附加下一行”的命令应该如何表现。
支持更改的另一个事实是“{N;command;}”将
如果文件有奇数行,则删除最后一行,但是
如果文件有偶数行,则打印最后一行。

转换使用 N 以前行为的脚本(删除
到达 EOF 时的模式空间)到与
兼容的脚本
所有版本的 sed,更改一个单独的“N;”到“$d;N;”

The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.

This is an infinite loop if the last line is blank and doesn't have any characterss:

sed '$!N; /^\(.*\)\n\1$/!P; D'

It doesn't hang, but you lose the last line:

sed '$d;N; /^\(.*\)\n\1$/!P; D'

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".

他夏了夏天 2024-08-11 16:38:03

第一个解决方案也来自 http://sed.sourceforge.net/sed1line.txt

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5

核心思想是:

在每个重复的连续行的最后出现处打印一次,并使用D命令实现循环

说明:

  1. $!N;:如果当前行不是最后一行,则使用N命令将下一行读入< em>模式空间。
  2. /^(.*)\n\1$/!P:如果当前模式空间的内容是两个重复字符串 \n,表示下一行与当前行相同,根据我们的核心思想,我们可以不打印;否则,这意味着当前行是其所有重复连续行的最后出现。我们现在可以使用 P 命令打印当前模式空间中的字符,直到 \n (\n也打印)。
  3. D:我们使用D命令删除当前模式空间中的字符,直到\n\n 也删除),然后模式空间的内容就是下一行。
  4. 并且 D 命令将强制 sed 跳转到其第一个命令 $!N,但不会 从文件或标准输入流中读取下一行。

第二种解决方案很容易理解(来自我自己):

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5

核心思想是:

在每个重复的连续行第一次出现时打印一次,并使用:命令和t执行LOOP的命令。

说明:

  1. 从输入流或文件中读取新行并打印一次。
  2. 使用:loop命令设置一个名为loop标签
  3. 使用N 将下一行读入模式空间
  4. 如果下一行与当前行相同,则使用 s/^(.*)\n\1$/\1/ 删除当前行。我们使用s命令来执行删除操作。
  5. 如果s命令执行成功,则使用tloop命令强制sed跳转到名为标签 循环,它将对下一行执行相同的循环,直到最新打印的行没有重复的连续行;否则,使用D命令删除最新打印行相同的行,并强制sed 跳转到第一个命令,即 p 命令。当前模式空间的内容是下一个新行。

The first solution is also from http://sed.sourceforge.net/sed1line.txt

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5

The core idea is:

Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.

Explanation:

  1. $!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
  2. /^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
  3. D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
  4. and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.

The second solution is easy to understand (from myself):

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5

The core idea is:

print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.

Explanation:

  1. read a new line from the input stream or file and print it once.
  2. use the :loop command to set a label named loop.
  3. use N to read the next line into the pattern space.
  4. use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
  5. if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
抚笙 2024-08-11 16:38:03

uniq 会被尾随空格和制表符所欺骗。为了模仿人类如何进行比较,我在比较之前修剪所有尾随空格和制表符。

我认为 $!N; 需要大括号,否则它会继续,这就是无限循环的原因。

我在 Ubuntu 20.10 (Groovy大猩猩)。第二行在字符集匹配时不起作用。

这是三个变体。第一个是消除相邻的重复行,第二个是消除重复行(无论它们出现在何处),第三个是消除文件中除最后一个实例之外的所有行。

pastebin

# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.

dedupe() {
 sed -E '
  $!{
   N;
   s/[ \t]+$//;
   /^(.*)\n\1$/!P;
   D;
  }
 ';
}

# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one

norepeat() {
 sed -n -E '
  s/[ \t]+$//;
  G;
  /^(\n){2,}/d;
  /^([^\n]+).*\n\1(\n|$)/d;
  h;
  P;
  ';
}

lastrepeat() {
 sed -n -E '
  s/[ \t]+$//;
  /^$/{
   H;
   d;
  };
  G;
  # delete previous repeated line if found
  s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
  # after searching for previous repeat, move tested last line to end
  s/^([^\n]+)(\n)(.*)/\3\2\1/;
  $!{
   h;
   d;
  };
  # squeeze blank lines to one
  s/(\n){3,}/\n\n/g;
  s/^\n//;
  p;
 ';
}

uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.

I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.

I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.

The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.

pastebin

# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.

dedupe() {
 sed -E '
  $!{
   N;
   s/[ \t]+$//;
   /^(.*)\n\1$/!P;
   D;
  }
 ';
}

# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one

norepeat() {
 sed -n -E '
  s/[ \t]+$//;
  G;
  /^(\n){2,}/d;
  /^([^\n]+).*\n\1(\n|$)/d;
  h;
  P;
  ';
}

lastrepeat() {
 sed -n -E '
  s/[ \t]+$//;
  /^$/{
   H;
   d;
  };
  G;
  # delete previous repeated line if found
  s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
  # after searching for previous repeat, move tested last line to end
  s/^([^\n]+)(\n)(.*)/\3\2\1/;
  $!{
   h;
   d;
  };
  # squeeze blank lines to one
  s/(\n){3,}/\n\n/g;
  s/^\n//;
  p;
 ';
}
谁与争疯 2024-08-11 16:38:03

这可以使用 AWK 来实现。

下面的行将显示唯一值:

awk file_name | uniq

您可以将这些唯一值输出到新文件:

awk file_name | uniq > uniq_file_name

新文件 uniq_file_name 将仅包含唯一值,没有任何重复项。

This can be achieved using AWK.

The below line will display unique values:

awk file_name | uniq

You can output these unique values to a new file:

awk file_name | uniq > uniq_file_name

The new file uniq_file_name will contain only unique values, without any duplicates.

长途伴 2024-08-11 16:38:03

用途:

cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'

使用AWK删除重复行。

Use:

cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'

It deletes the duplicate lines using AWK.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文