查找文件中的重复行并计算每行重复的次数?
假设我有一个类似于以下内容的文件:
123
123
234
234
123
345
我想查找“123”重复了多少次,“234”重复了多少次等。 所以理想情况下,输出将是这样的:
123 3
234 2
345 1
Suppose I have a file similar to the following:
123
123
234
234
123
345
I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc.
So ideally, the output would be like:
123 3
234 2
345 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
在Windows中,使用“Windows PowerShell”,我使用下面提到的命令来实现此目的
此外,我们可以使用where-object Cmdlet来过滤结果
In Windows, using "Windows PowerShell", I used the command mentioned below to achieve this
Also, we can use the where-object Cmdlet to filter the result
假设您可以访问标准 Unix shell 和/或 cygwin 环境:
基本上:将所有空格字符转换为换行符,然后对翻译后的输出进行排序并将其提供给 uniq 并计算重复行。
Assuming you've got access to a standard Unix shell and/or cygwin environment:
Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.
通过 awk:
在
awk 'dups[ $1]++'
命令,变量$1
保存column1的全部内容,方括号是数组访问。因此,对于data
文件中的每一行第一列,名为dups
的数组的节点都会递增。最后,我们以
num
作为变量循环遍历dups
数组,并首先打印保存的数字,然后打印重复值的数量 <代码>dups[num]。请注意,您的输入文件的某些行末尾有空格,如果清除这些空格,您可以在上面的命令中使用
$0
代替$1
:)Via awk:
In
awk 'dups[$1]++'
command, the variable$1
holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line indata
file, the node of the array nameddups
is incremented.And at the end, we are looping over
dups
array withnum
as variable and print the saved numbers first then their number of duplicated value bydups[num]
.Note that your input file has spaces on end of some lines, if you clear up those, you can use
$0
in place of$1
in command above :)要查找重复计数,请使用以下命令:
To find duplicate counts, use this command:
要查找并计算多个文件中的重复行,您可以尝试以下命令:
或者:
To find and count duplicate lines in multiple files, you can try the following command:
or:
这将仅打印重复行,带有计数:
或者,带有GNU长选项(在Linux上):
在BSD上在 OSX 中,您必须使用 grep 来过滤掉唯一的行:
对于给定的示例,结果将是:
如果您想打印所有行的计数,包括那些只出现一次的行:
或者,与GNU 长选项(在 Linux 上):
对于给定的输入,输出为:
为了将最常见的行放在顶部对输出进行排序,您可以执行以下操作(以获取所有结果):
或者,为了只获取重复的行,最常见的第一个:
在 OSX 和 BSD 上,最后一个变为:
This will print duplicate lines only, with counts:
or, with GNU long options (on Linux):
on BSD and OSX you have to use grep to filter out unique lines:
For the given example, the result would be:
If you want to print counts for all lines including those that appear only once:
or, with GNU long options (on Linux):
For the given input, the output is:
In order to sort the output with the most frequent lines on top, you can do the following (to get all results):
or, to get only duplicate lines, most frequent first:
on OSX and BSD the final one becomes:
假设每行有一个数字:
您也可以在 GNU 版本中使用更详细的
--count
标志,例如在 Linux 上:Assuming there is one number per line:
You can use the more verbose
--count
flag too with the GNU version, e.g., on Linux: