shell 过滤文件中禁止的单词

发布于 2024-09-28 17:40:55 字数 588 浏览 4 评论 0原文

贝壳爱好者们好！

基本上我有两个文件：

Frequency.txt：（多行，空格分隔的文件，包含单词和频率）

de 1711
a 936
et 762
la 530
les 482
pour 439
le 425
...

，我有一个包含“禁止”单词的文件：

stopwords.txt：（单行，空格分隔的文件），

 au aux avec le ces dans ...

所以我想从 Frequency.txt 中删除包含在 stopwords.txt 上找到的单词的所有行

，我该怎么做？我想这可以用 awk 来完成......类似的东西

awk 'match($0,SOMETHING_MAGICAL_HERE) == 0 {print $0}' frequency.txt > new.txt

，但我不太确定......有什么想法吗？提前谢谢

原文

Good day shell lovers!

basically i have two files:

frequency.txt: (multiple lines, space separated file containing words and a frequency)

de 1711
a 936
et 762
la 530
les 482
pour 439
le 425
...

and i have a file containing "prohibited" words:

stopwords.txt: (one single line, space separated file)

 au aux avec le ces dans ...

so i want to delete from frequency.txt all the lines containing a word found on stopwords.txt

how could i do that? i'm thinking that it could be done with awk.. something like

awk 'match($0,SOMETHING_MAGICAL_HERE) == 0 {print $0}' frequency.txt > new.txt

but i'm not really sure... any ideas?? thxs in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

留一抹残留的笑 2024-10-05 17:40:55

$ awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))' stop.txt freq.txt
de 1711
a 936
et 762
la 530
les 482
pour 439

$ awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))' stop.txt freq.txt
de 1711
a 936
et 762
la 530
les 482
pour 439

回复收藏 0 原文

紫南 2024-10-05 17:40:55

这将为您做到这一点：

tr ' ' '\n' <stopwords.txt | grep -v -w -F -f - frequency.txt

-v 是反转匹配
-w 仅适用于整个单词匹配
-F 表示模式是一组换行符分隔的固定字符串
-f 从 stopwords.txt 文件中获取模式字符串

如果您遇到问题，因为它是空格分隔的，您可以使用 tr 将空格替换为换行符：

This will do it for you:

tr ' ' '\n' <stopwords.txt | grep -v -w -F -f - frequency.txt

-v is to invert the match
-w is for whole word matches only
-F is to indicate that pattern is a set of newline separated fixed strings
-f to get the pattern strings from the stopwords.txt file

If you have trouble with that, because it's space delimited, you can use tr to replace spaces with newlines:

回复收藏 0 原文

另类 2024-10-05 17:40:55

tr ' ' '\n' < stopwords.txt | grep -vwFf - frequency.txt

grep 的 -w 对于避免 stopwords.txt 中的 le 删除包含 le 的单词至关重要，例如 <代码>较少或<代码>很少。

tr ' ' '\n' < stopwords.txt | grep -vwFf - frequency.txt

The -w to grep is crucial to avoid e.g. le in stopwords.txt from removing words containing le like less or little.

回复收藏 0 原文

樱娆 2024-10-05 17:40:55

join -v1 <(sort frequency.txt) <(tr ' ' '\n' <stopwords.txt|sort) | sort -k2,2rn

join -v1 <(sort frequency.txt) <(tr ' ' '\n' <stopwords.txt|sort) | sort -k2,2rn

回复收藏 0 原文

~没有更多了~

关于作者

浪推晚风

暂无简介

0 文章

0 评论

22 人气

关注发私信

謌踐踏愛綪

文章 0 评论 0

关注

开始看清了

文章 0 评论 0

关注

高速公鹿

文章 0 评论 0

关注

alipaysp_PLnULTzf66

文章 0 评论 0

关注

热情消退

文章 0 评论 0

关注

白色月光

文章 0 评论 0

友情链接

文江博客

shell 过滤文件中禁止的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

shell 过滤文件中禁止的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。