用于搜索必须包含两个特定关键字的文本文件的 Unix 脚本
找到的解决方案:(感谢 Zsolt Botykai 和 Mike Ryan)
下面的脚本到 awk
一行的准确翻译是:
find /home/data/ -type f -exec awk '/PATTERN1/ {c++} /PATTERN2/ {d++} c>0 && d>0 {print ARGV[1] ; exit 0 } END { if (! c || ! d) {exit 1}}' \{\} \; > assetsToDelete.txt 2>&1
参见 https://stackoverflow.com/a/9442764/356815
原始问题:
问题很简单,但我没有找到为此创建快速脚本的可能性。
我有 100'000 个文本文件,我需要搜索所有满足两个条件的文件。
我的脚本看起来像这样,但速度慢得要命......还有更好的主意吗?
echo Searching for first criteria...
date
grep -rl 'PATTERN1' /home/data/assets/ > assets.txt
file=assets.txt
echo Now filtering for second criteria
date
for i in `cat $file`
do
grep -l 'PATTERN2' $i >> assetsToDelete.txt
done
echo DONE
date
所以我正在寻找一种可能性来做这样的事情:
搜索一个目录并一步过滤掉满足条件 1 和条件 2 的所有文件。条件通常是模式匹配,但位于文件内容中的不同行。
SOLUTION FOUND: (thanks to Zsolt Botykai and Mike Ryan)
The exact translation of the script below into an awk
one-liner is:
find /home/data/ -type f -exec awk '/PATTERN1/ {c++} /PATTERN2/ {d++} c>0 && d>0 {print ARGV[1] ; exit 0 } END { if (! c || ! d) {exit 1}}' \{\} \; > assetsToDelete.txt 2>&1
see https://stackoverflow.com/a/9442764/356815
ORIGINAL QUESTION:
The question is so simple but I didn't find a possibility, to create a fast script for this.
I have 100'000 text files and I need to search all those, which fulfill two conditions.
My script looks like this, but it is slow like hell... any better idea?
echo Searching for first criteria...
date
grep -rl 'PATTERN1' /home/data/assets/ > assets.txt
file=assets.txt
echo Now filtering for second criteria
date
for i in `cat $file`
do
grep -l 'PATTERN2' $i >> assetsToDelete.txt
done
echo DONE
date
So I'm looking for a possibility to do something like this:
Search a directory and filter out all files that fulfill condition1 AND condition2 in one step. The conditions are usually pattern matchings but on different lines within the file's content.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用
awk
你可以做类似的事情:现在你可以像这样使用它:
Well with
awk
you can do something like:Now you can use it like:
你可以这样做......
但它不会很快或其他什么,因为你仍然扫描文件两次。
You can do this...
...but it's not going to be blazingly fast or anything, because you're still scanning the files twice.
使用
awk
执行的 exakt 单行代码与问题中的脚本完全相同,如下所示:感谢大家帮助我找到这个!
c=0
和d=0
很重要,这样 awk 就不会在输出文件assetsToDelete.txt
中多次打印相同的文件名。The exakt one-liner with
awk
doing exactly the same as the script in the question is the following:Thanks everyone for helping me in finding this!
c=0
andd=0
are important so that awk does not print the same filename multiple times into the output fileassetsToDelete.txt
.