优化 shell 和 awk 脚本

发布于 2024-10-18 02:05:37 字数 861 浏览 2 评论 0原文

我使用 shell 脚本、awk 脚本和 find 命令的组合来在数百个文件中执行多个文本替换。文件大小在几百字节到 20 KB 之间。

我正在寻找一种方法来加速这个脚本。

我正在使用cygwin。

我之前在超级用户上发布了这个问题，但我认为这个论坛更合适。

shell 脚本 -

#!/bin/bash

if [ $# = 0 ]; then
 echo "Argument expected"
 exit 1
fi



while [ $# -ge 1 ]
do
   if [ ! -f $1 ]; then
     echo "No such file as $1"
     exit 1
   fi


  awk -f ~/scripts/parse.awk $1  > ${1}.$$

   if [ $? != 0 ]; then
      echo "Something went wrong with the script"
     rm ${1}.$$
      exit 1
   fi
mv ${1}.$$ $1
shift
done

awk 脚本（简化） -

#! /usr/bin/awk -f

/HHH.Web/{
    if ( index($0,"Email") == 0)  {
        sub(/HHH.Web/,"HHH.Web.Email");
    }
    printf("%s\r\n",$0); 
    next;
}

命令行

find .  -type f  | xargs ~/scripts/run_parser.sh

原文

I am using a combination of a shell script, awk script and a find command to perform multiple text replacements in hundreds of files. The files sizes vary between a few hundred bytes and 20 kbytes.

I am looking for a way to speed up this script.

I am using cygwin.

I posted this question on superuser earlier, but I think this forum is more appropriate.

The shell script -

#!/bin/bash

if [ $# = 0 ]; then
 echo "Argument expected"
 exit 1
fi



while [ $# -ge 1 ]
do
   if [ ! -f $1 ]; then
     echo "No such file as $1"
     exit 1
   fi


  awk -f ~/scripts/parse.awk $1  > ${1}.$

   if [ $? != 0 ]; then
      echo "Something went wrong with the script"
     rm ${1}.$
      exit 1
   fi
mv ${1}.$ $1
shift
done

The awk script (simplified) -

#! /usr/bin/awk -f

/HHH.Web/{
    if ( index($0,"Email") == 0)  {
        sub(/HHH.Web/,"HHH.Web.Email");
    }
    printf("%s\r\n",$0); 
    next;
}

The command line

find .  -type f  | xargs ~/scripts/run_parser.sh

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九公里浅绿 2024-10-25 02:05:37

find .  -type f | while read -r file
do
  awk '/HHH.Web/ && !/Email/ {
     sub(/HHH.Web/,"HHH.Web.Email");
     printf("%s\r\n",$0); 
     next;
  }
  ' "$file" > ${file}.$ && mv ${file}.$ "$file" 
done

如果您知道要处理的指定文件，您可以添加 -iname 选项

find .  -type f | while read -r file
do
  awk '/HHH.Web/ && !/Email/ {
     sub(/HHH.Web/,"HHH.Web.Email");
     printf("%s\r\n",$0); 
     next;
  }
  ' "$file" > ${file}.$ && mv ${file}.$ "$file" 
done

if you know the specified file you will be process, you may add -iname option

回复收藏 0 原文

冰葑 2024-10-25 02:05:37

在 Cygwin 上，最重要的是尽可能避免 fork()-exec()。
从设计上来说，Windows 并不是为了像 Linux 那样处理多个进程而构建的。
它没有 fork()，牛坏了。
因此，在编写脚本时，请尝试尽可能地从单个进程执行。

在本例中，我们需要 awk，而且只有 1 个 awk。不惜一切代价避免 xargs。
另一件事是，如果你必须扫描多个文件，Windows中的磁盘缓存只是一个笑话。
更好的方法是让 grep 访问所有文件，而不是访问所有文件
仅查找符合给定要求的文件
所以你会

grep -r "some-pattern-prahaps-HHH.Web-or-so" "/dir/to/where/you/have/millions/of/files/" |awk -f ~/scripts/parse.awk

在“~/scripts/parse.awk”中，你必须在awk中打开和关闭()文件，以加快速度。
尽可能不要使用system()。

#!/bin/awk
BEGIN{
    id=PROCINFO["pid"];
}
# int staticlibs_codesize_grep( option, regexp, filepath, returnArray, returnArray_linenum  )
# small code size
# Code size is choosen instead of speed. Search may be slow on large files
# "-n" option supported
function staticlibs_codesize_grep(o, re, p, B, C, this, r, v, c){
 if(c=o~"-n")C[0]=0;B[0]=0;while((getline r<p)>0){if(o~"-o"){while(match(r,re)){
 B[B[0]+=1]=substr(r,RSTART,RLENGTH);r=substr(r,RSTART+RLENGTH);if(c)C[C[0]+=1]=c;}
 }else{if(!match(r,re)!=!(o~"-v")){B[B[0]+=1]=r;if(c)C[C[0]+=1]=c;}}c++}return B[0]}
# Total: 293 byte , Codesize: > 276 byte, Depend: 0 byte

{
    file = $0;
    outfile = $0"."id; # Whatever.
    # If you have multiple replacements, or multiline replacements, 
    # be carefull in the order you replace. writing a k-map for efficient condition branch is a must.
    # Also, try to unroll the loop.

    # The unrolling can be anyting, this is a trade between code size for speed.
    # Here is a example of a unrolled loop
    # instead of having while((getline r<file)>0){if(file~html){print "foo";}else{print "bar";};};
    # we have moved the condition outside of the while() loop.
    if(file~".htm$"){
        while((getline r<file)>0){
            # Try to perform minimum replacement required for given file. 
            # Try to avoid branching by if(){}else{} if you are inside a loop.
            # Keep it minimalit and small.
            print "foo" > outfile;
        }
    }else{
        while((getline r<file)>0){
            # Here, as a example, we unrolled the loop into two, one for htm files, one for other files.
            print "bar" > outfile;
            # if a condition is required, match() is better
            if(match(r,"some-pattern-you-want-to-match")){
                # do whatever complex replacement you want. We reuse the RSTART,RLENGTH from match()
                before_match = substr(r,1,RSTART);
                matched_data = substr(r,RSTART,RLENGTH);
                after_match = substr(r,1,RSTART+RLENGTH);
                # if you want further matches, like grep -o, extracting only the match
                a=r;
                while(match(a,re)){
                    B[B[0]+=1]=substr(a,RSTART,RLENGTH);
                    a=substr(a,RSTART+RLENGTH);
                }
                # Avobe stores multiple matches from a single line, into B
            }
            # If you want to perform even further complex matches. try the grep() option.
            # staticlibs_codesize_grep() handles -o , -n , -v options. It sould satisfy most of the daily needs.
            # for a grep-like output, use printf("%4s\t\b:%s\n", returnArray_linenum[index] , returnArray[index] );

            # Example of multiple matches, against data that may or may not been replaced by the previous cond.
            if(match(r,"another-pattern-you-want-to-match")){
                # whatever
                # if you decide that replaceing is not good, you can abort
                if(for_whatever_reason_we_want_to_abort){
                    break;
                }
            }
            # notice that we always need to output a line.
            print r > outfile;
        }
    }
    # If we forget to close file, we will run out of FD
    close(file);
    close(outfile);
    # now we can move the file, however I would not do it here.
    # The reason is, system() is a very heavy operation, and second is our replacement may be imcomplete, by human error.
    # system("mv \""outfile"\" \""file"\" ")
    # I would advice output to another file, for later move by bash or any other shell with builtin mv command.
    # NOTE[*1]
    print "mv \""outfile"\" \""file"\" " > "files.to.update.list";
}
END{
    # Assuming we are all good, we should have a log file that records what has been modified
    close("files.to.update.list");
}

# Now when all is ready, meaning you have checked the result and it is what you desire, perform
#  source "files.to.update.list" 
# inside a terminal , or
#  cat "files.to.update.list" |bash
# and you are done
# NOTE[*1] if you have file names containing \x27 in them, the escape with \x22 is incomplete.
# Always check "files.to.update.list" for \x27 to avoid problems
# prahaps 
# grep -v -- "`echo -ne "\x27"`" > "files.to.update.list.safe"
# then 
# grep -- "`echo -ne "\x27"`" > "files.to.update.list.unsafe"
# may be a good idea.

When on Cygwin, the most important thing is to avoid fork()-exec() as much as possible.
Windows, by desgin, is not built to handle multiple processes like linux.
It does not have fork(), cow is broken.
Therefore, when writing a script, try to perfrom as much as possible from a single process.

In this case, we want awk, and only 1 awk. Avoid xargs at all cost.
Another thing is, if you have to scan for multiple files, the disk cache in windows is just a joke.
Instead of accessing all files, a better approach is to let grep
find only the files that match given requirements
so you would have

grep -r "some-pattern-prahaps-HHH.Web-or-so" "/dir/to/where/you/have/millions/of/files/" |awk -f ~/scripts/parse.awk

And within "~/scripts/parse.awk", you must open and close() files within awk, to speed things up.
Do not use system() as much as possible.

#!/bin/awk
BEGIN{
    id=PROCINFO["pid"];
}
# int staticlibs_codesize_grep( option, regexp, filepath, returnArray, returnArray_linenum  )
# small code size
# Code size is choosen instead of speed. Search may be slow on large files
# "-n" option supported
function staticlibs_codesize_grep(o, re, p, B, C, this, r, v, c){
 if(c=o~"-n")C[0]=0;B[0]=0;while((getline r<p)>0){if(o~"-o"){while(match(r,re)){
 B[B[0]+=1]=substr(r,RSTART,RLENGTH);r=substr(r,RSTART+RLENGTH);if(c)C[C[0]+=1]=c;}
 }else{if(!match(r,re)!=!(o~"-v")){B[B[0]+=1]=r;if(c)C[C[0]+=1]=c;}}c++}return B[0]}
# Total: 293 byte , Codesize: > 276 byte, Depend: 0 byte

{
    file = $0;
    outfile = $0"."id; # Whatever.
    # If you have multiple replacements, or multiline replacements, 
    # be carefull in the order you replace. writing a k-map for efficient condition branch is a must.
    # Also, try to unroll the loop.

    # The unrolling can be anyting, this is a trade between code size for speed.
    # Here is a example of a unrolled loop
    # instead of having while((getline r<file)>0){if(file~html){print "foo";}else{print "bar";};};
    # we have moved the condition outside of the while() loop.
    if(file~".htm$"){
        while((getline r<file)>0){
            # Try to perform minimum replacement required for given file. 
            # Try to avoid branching by if(){}else{} if you are inside a loop.
            # Keep it minimalit and small.
            print "foo" > outfile;
        }
    }else{
        while((getline r<file)>0){
            # Here, as a example, we unrolled the loop into two, one for htm files, one for other files.
            print "bar" > outfile;
            # if a condition is required, match() is better
            if(match(r,"some-pattern-you-want-to-match")){
                # do whatever complex replacement you want. We reuse the RSTART,RLENGTH from match()
                before_match = substr(r,1,RSTART);
                matched_data = substr(r,RSTART,RLENGTH);
                after_match = substr(r,1,RSTART+RLENGTH);
                # if you want further matches, like grep -o, extracting only the match
                a=r;
                while(match(a,re)){
                    B[B[0]+=1]=substr(a,RSTART,RLENGTH);
                    a=substr(a,RSTART+RLENGTH);
                }
                # Avobe stores multiple matches from a single line, into B
            }
            # If you want to perform even further complex matches. try the grep() option.
            # staticlibs_codesize_grep() handles -o , -n , -v options. It sould satisfy most of the daily needs.
            # for a grep-like output, use printf("%4s\t\b:%s\n", returnArray_linenum[index] , returnArray[index] );

            # Example of multiple matches, against data that may or may not been replaced by the previous cond.
            if(match(r,"another-pattern-you-want-to-match")){
                # whatever
                # if you decide that replaceing is not good, you can abort
                if(for_whatever_reason_we_want_to_abort){
                    break;
                }
            }
            # notice that we always need to output a line.
            print r > outfile;
        }
    }
    # If we forget to close file, we will run out of FD
    close(file);
    close(outfile);
    # now we can move the file, however I would not do it here.
    # The reason is, system() is a very heavy operation, and second is our replacement may be imcomplete, by human error.
    # system("mv \""outfile"\" \""file"\" ")
    # I would advice output to another file, for later move by bash or any other shell with builtin mv command.
    # NOTE[*1]
    print "mv \""outfile"\" \""file"\" " > "files.to.update.list";
}
END{
    # Assuming we are all good, we should have a log file that records what has been modified
    close("files.to.update.list");
}

# Now when all is ready, meaning you have checked the result and it is what you desire, perform
#  source "files.to.update.list" 
# inside a terminal , or
#  cat "files.to.update.list" |bash
# and you are done
# NOTE[*1] if you have file names containing \x27 in them, the escape with \x22 is incomplete.
# Always check "files.to.update.list" for \x27 to avoid problems
# prahaps 
# grep -v -- "`echo -ne "\x27"`" > "files.to.update.list.safe"
# then 
# grep -- "`echo -ne "\x27"`" > "files.to.update.list.unsafe"
# may be a good idea.

回复收藏 0 原文

雨的味道风的声音 2024-10-25 02:05:37

您正在为每个文件生成一个新的 awk 进程。我认为可能有类似

find . -type f | xargs ./awk_script.awk

awk_script.awk 检查文件的地方（我知道这不是常见的做法）。也许也可以执行 mv ${f}.$$ $f ，但您可以将其作为与 bash 单独的传递来执行。

希望这有帮助。

You're spawning a new awk process for each file. I would think it is possible to have something like

find . -type f | xargs ./awk_script.awk

where the awk_script.awk checks for the file (not common practice I know). Could probably do the mv ${f}.$$ $f too, but you could do that as a separate pass from bash.

Hope this helps.

回复收藏 0 原文

~没有更多了~