优化 shell 和 awk 脚本
我使用 shell 脚本、awk 脚本和 find 命令的组合来在数百个文件中执行多个文本替换。文件大小在几百字节到 20 KB 之间。
我正在寻找一种方法来加速这个脚本。
我正在使用cygwin。
我之前在超级用户上发布了这个问题,但我认为这个论坛更合适。
shell 脚本 -
#!/bin/bash
if [ $# = 0 ]; then
echo "Argument expected"
exit 1
fi
while [ $# -ge 1 ]
do
if [ ! -f $1 ]; then
echo "No such file as $1"
exit 1
fi
awk -f ~/scripts/parse.awk $1 > ${1}.$$
if [ $? != 0 ]; then
echo "Something went wrong with the script"
rm ${1}.$$
exit 1
fi
mv ${1}.$$ $1
shift
done
awk 脚本(简化) -
#! /usr/bin/awk -f
/HHH.Web/{
if ( index($0,"Email") == 0) {
sub(/HHH.Web/,"HHH.Web.Email");
}
printf("%s\r\n",$0);
next;
}
命令行
find . -type f | xargs ~/scripts/run_parser.sh
I am using a combination of a shell script, awk script and a find command to perform multiple text replacements in hundreds of files. The files sizes vary between a few hundred bytes and 20 kbytes.
I am looking for a way to speed up this script.
I am using cygwin.
I posted this question on superuser earlier, but I think this forum is more appropriate.
The shell script -
#!/bin/bash
if [ $# = 0 ]; then
echo "Argument expected"
exit 1
fi
while [ $# -ge 1 ]
do
if [ ! -f $1 ]; then
echo "No such file as $1"
exit 1
fi
awk -f ~/scripts/parse.awk $1 > ${1}.$
if [ $? != 0 ]; then
echo "Something went wrong with the script"
rm ${1}.$
exit 1
fi
mv ${1}.$ $1
shift
done
The awk script (simplified) -
#! /usr/bin/awk -f
/HHH.Web/{
if ( index($0,"Email") == 0) {
sub(/HHH.Web/,"HHH.Web.Email");
}
printf("%s\r\n",$0);
next;
}
The command line
find . -type f | xargs ~/scripts/run_parser.sh
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您知道要处理的指定文件,您可以添加
-iname
选项if you know the specified file you will be process, you may add
-iname
option在 Cygwin 上,最重要的是尽可能避免 fork()-exec()。
从设计上来说,Windows 并不是为了像 Linux 那样处理多个进程而构建的。
它没有 fork(),牛坏了。
因此,在编写脚本时,请尝试尽可能地从单个进程执行。
在本例中,我们需要 awk,而且只有 1 个 awk。不惜一切代价避免 xargs。
另一件事是,如果你必须扫描多个文件,Windows中的磁盘缓存只是一个笑话。
更好的方法是让 grep 访问所有文件,而不是访问所有文件
仅查找符合给定要求的文件
所以你会
在“~/scripts/parse.awk”中,你必须在awk中打开和关闭()文件,以加快速度。
尽可能不要使用system()。
When on Cygwin, the most important thing is to avoid fork()-exec() as much as possible.
Windows, by desgin, is not built to handle multiple processes like linux.
It does not have fork(), cow is broken.
Therefore, when writing a script, try to perfrom as much as possible from a single process.
In this case, we want awk, and only 1 awk. Avoid xargs at all cost.
Another thing is, if you have to scan for multiple files, the disk cache in windows is just a joke.
Instead of accessing all files, a better approach is to let grep
find only the files that match given requirements
so you would have
And within "~/scripts/parse.awk", you must open and close() files within awk, to speed things up.
Do not use system() as much as possible.
您正在为每个文件生成一个新的 awk 进程。我认为可能有类似
awk_script.awk 检查文件的地方(我知道这不是常见的做法)。也许也可以执行 mv ${f}.$$ $f ,但您可以将其作为与 bash 单独的传递来执行。
希望这有帮助。
You're spawning a new awk process for each file. I would think it is possible to have something like
where the awk_script.awk checks for the file (not common practice I know). Could probably do the mv ${f}.$$ $f too, but you could do that as a separate pass from bash.
Hope this helps.