如何使用linux命令从纯文本文件中删除重复的单词
我有一个包含单词的纯文本文件,单词之间用逗号分隔,例如:
word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
我想删除重复项并变为:
word1, word2, word3, word4, word5, word6, word7
有什么想法吗? 我认为,egrep 可以帮助我,但我不确定如何准确使用它......
I have a plain text file with words, which are separated by comma, for example:
word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
i want to delete the duplicates and to become:
word1, word2, word3, word4, word5, word6, word7
Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
假设这些单词是每行一个,并且文件已经排序:
如果文件未排序:
如果它们不是每行一个,并且您不介意它们是每行一个:
这不会删除标点符号,不过,也许您想要:
但这会删除连字符的单词中的连字符。 “man tr”以获得更多选项。
Assuming that the words are one per line, and the file is already sorted:
If the file's not sorted:
If they're not one per line, and you don't mind them being one per line:
That doesn't remove punctuation, though, so maybe you want:
But that removes the hyphen from hyphenated words. "man tr" for more options.
ruby -pi.bak -e '$_.split(",").uniq.join(",")' 文件名
?我承认这两种引文都很丑陋。
ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename
?I'll admit the two kinds of quotations are ugly.
由于
uniq
,创建一个唯一的列表非常容易,尽管大多数 Unix 命令喜欢每行一个条目而不是逗号分隔的列表,所以我们必须首先将其转换为:将其再次放在一行上,并用逗号作为分隔符而不是终止符。 我使用 Perl 单行代码来做到这一点,但如果有人有更惯用的东西,请编辑我。 :)
Creating a unique list is pretty easy thanks to
uniq
, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)
这是一个 awk 脚本,它将保留每一行,仅删除重复的单词:
Here's an awk script that will leave each line in tact, only removing the duplicate words:
我今天遇到了同样的问题.. 一个包含 238,000 个单词的单词列表,但其中大约 40, 000 个是重复的。 我已经通过
删除重复项将它们放在单独的行中,我只是做了
完美的工作,没有错误,现在我的文件从 1.45MB 减少到 1.01MB
i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing
to remove the duplicates I simply did
Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB
我认为您需要用换行符替换空格,请使用 uniq命令查找唯一行,然后再次用空格替换换行符。
I'd think you'll want to replace the spaces with newlines, use the uniq command to find unique lines, then replace the newlines with spaces again.
我猜想您希望这些单词在一行中是唯一的,而不是在整个文件中。 如果是这种情况,那么下面的 Perl 脚本就可以解决问题。
如果您希望整个文件具有唯一性,只需将
%seen
哈希移到while (){}
循环之外即可。I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.
If you want uniqueness over the whole file, you can just move the
%seen
hash outside thewhile (){}
loop.在尝试解决同样的问题时遇到了这个线程。 我连接了几个包含密码的文件,所以自然有很多双打。 此外,还有许多非标准字符。 我真的不需要对它们进行排序,但这似乎对于 uniq 来说是必要的。
我尝试过:
尝试过:
甚至尝试先通过 cat 传递它,这样我就可以看看我们是否得到了正确的输入。
我不确定发生了什么事。 尽管找到了“t/203”和“tonnement”,但在文件中找不到字符串“t\203tonnement”和“t\203tonner”,但它们位于单独的、不相邻的行上。 与“zon\351s”相同。
最终对我有用的是:
它还保留了唯一区别是大小写的单词,这正是我想要的。 我不需要对列表进行排序,所以不需要排序也很好。
Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.
I tried:
Tried:
And even tried passing it through cat first, just so I could see if we were getting a proper input.
I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".
What finally worked for me was:
It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.
如果您也有兴趣获取单词计数,请不要忘记
uniq
实用程序的-c
选项。And don't forget the
-c
option for theuniq
utility if you're interested in getting a count of the words as well.使用 vim 打开文件(
vim filename
)并运行带有唯一标志的排序命令(:sort u
)。open file with vim (
vim filename
) and run sort command with unique flag (:sort u
).