当前位置：文江博客话题详情

bash脚本-进程替换的变量扩展

发布于 2025-01-04 22:04:24 字数 2820 浏览 4 评论 0 原文

我正在尝试对多列文件的每一列执行分析，并使用粘贴将各列重新连接在一起。我事先不知道有多少列，因此我使用“wc -w”和循环来定义命令数组。每个命令都是一个进程替换。以下脚本显示了我正在尝试的内容，然后显示了输出。值得注意的是，如果我将命令数组回显到终端，然后用鼠标剪切粘贴它，它就可以正常工作，因此它一定是变量扩展和进程替换的顺序。

简而言之，我需要在 shell 变量内进行进程替换。有什么想法吗？提前致谢。

-------------- script.sh ----------------

#!/bin/bash
f="file.txt";
echo "File contents"
cat $f; 
         # simple solution
echo; echo "First try";
paste <(cat $f) <(tac $f)
         # now define cmd[1] and cmd[2] and merge together with paste
echo; echo "Second try";
cmd[0]="paste";
cmd[1]="cat $f";
cmd[2]="tac $f";
${cmd[0]} <(${cmd[1]}) <(${cmd[1]})
         # but what I really want is something like:
echo; echo "Third try";
cmd[1]="<(cat $f)";
cmd[2]="<(tac $f)";
${cmd[0]} ${cmd[1]} ${cmd[2]}
         # or even better:
echo; echo "Fourth try";
${cmd[*]}
echo; echo "Show the array";
echo ${cmd[*]}

------------- 输出 --- -----------------

$ ./scipt.sh 
File contents
A B C
D E F
G H I

First try
A B C   G H I
D E F   D E F
G H I   A B C

Second try
A B C   A B C
D E F   D E F
G H I   G H I

Third try
paste: <(cat: No such file or directory

Fourth try
paste: <(cat: No such file or directory

Show the array
paste <(cat file.txt) <(tac file.txt)
$ paste <(cat file.txt) <(tac file.txt)
A B C   G H I
D E F   D E F
G H I   A B C
$

在回复 shellter 时，这里是一些示例输入。

    7.74336e-08 7.30689e-08 0.359106        19.981796       -0.160611       0.027
    7.74336e-08 7.30689e-08 0.363938        19.985069       0.041319        0.035
    7.74336e-08 7.30689e-08 0.363133        19.982094       0.041319        0.068
    7.74336e-08 7.30689e-08 0.360716        19.981796       -0.160611       0.006
    7.74336e-08 7.30689e-08 0.361522        19.981796       0.243249        0.049
    7.74336e-08 7.30689e-08 0.357897        19.986260       0.041319        0.035

这样的数据可能有一亿行。我需要分离每一列，将每一列分成（例如）1000 个块，然后对块中的每个元素进行平均，然后再次将平均列合并在一起。对于下面的示例，如果我仅对 2 个块（每个块 3 个元素）进行平均（而不是每个 100K 的 1000 个元素），那么第 6 列的输出将是：

0.0165     # =(0.027+0.006)/2 - 1st row from each size-3 block
0.042      # =(0.035+0.049)/2 - 2nd row
0.0515     # =(0.068+0.035)/2 - 3rd row

我已经有了执行此平均操作的程序（即“Some_Complicated_Analysis”）而且效果很好。因此，我的脚本所需要做的就是分离列，将其输入 Some_Comp_Analysis，然后使用 paste 将各种输出再次合并回列中。但是，这些文件非常大，而且我不知道有多少列。如果我知道只有 2 列，那么 paste <(${cmd[1]}) <(${cmd[2]}) 就可以正常工作。

找到解决方案

更新：已找到答案 - 如下面 Glenn jackman 的回复更新所示。 paste 命令前面必须有 eval。我不知道为什么这是必要的，但如果没有它们，变量扩展 ${cmd[]} 会搞乱进程替换 <(...) 。上面的答案还在数组扩展 "${cmd[*]}" 周围添加了双引号，但是这些似乎并不那么重要 - 尽管没有它们 cmd[] 中的一些其他扩展代码>可能会失败。但是，eval 是必要的。

原文

I'm trying to perform analyses on each column of a multi-column file, and using paste, to rejoin the columns together. I don't know a priori how many columns there are, so I use "wc -w" and a loop to define an array of commands. Each command then is a process substitution. The following script shows what I'm trying, and the output is shown after. Notably if I echo the array of commands to the terminal, then, with the mouse, cut-n-paste it, it works fine, so it must be the order of variable expansion and process substitution.

In short, I need to have a process substitution inside a shell variable. Any ideas? Thanks in advance.

-------------- script.sh ----------------

#!/bin/bash
f="file.txt";
echo "File contents"
cat $f; 
         # simple solution
echo; echo "First try";
paste <(cat $f) <(tac $f)
         # now define cmd[1] and cmd[2] and merge together with paste
echo; echo "Second try";
cmd[0]="paste";
cmd[1]="cat $f";
cmd[2]="tac $f";
${cmd[0]} <(${cmd[1]}) <(${cmd[1]})
         # but what I really want is something like:
echo; echo "Third try";
cmd[1]="<(cat $f)";
cmd[2]="<(tac $f)";
${cmd[0]} ${cmd[1]} ${cmd[2]}
         # or even better:
echo; echo "Fourth try";
${cmd[*]}
echo; echo "Show the array";
echo ${cmd[*]}

------------- output --------------------

$ ./scipt.sh 
File contents
A B C
D E F
G H I

First try
A B C   G H I
D E F   D E F
G H I   A B C

Second try
A B C   A B C
D E F   D E F
G H I   G H I

Third try
paste: <(cat: No such file or directory

Fourth try
paste: <(cat: No such file or directory

Show the array
paste <(cat file.txt) <(tac file.txt)
$ paste <(cat file.txt) <(tac file.txt)
A B C   G H I
D E F   D E F
G H I   A B C
$

In reply to shellter, here is some sample input.

    7.74336e-08 7.30689e-08 0.359106        19.981796       -0.160611       0.027
    7.74336e-08 7.30689e-08 0.363938        19.985069       0.041319        0.035
    7.74336e-08 7.30689e-08 0.363133        19.982094       0.041319        0.068
    7.74336e-08 7.30689e-08 0.360716        19.981796       -0.160611       0.006
    7.74336e-08 7.30689e-08 0.361522        19.981796       0.243249        0.049
    7.74336e-08 7.30689e-08 0.357897        19.986260       0.041319        0.035

There might be 100 million lines of data like this. I need to separate off each column, separate each column into blocks of (say) 1000, then perform an average of each element in the blocks, then merge the averaged columns back together again. For the example below, if I were averaging over just 2 blocks of 3 elements each (instead of 100K of 1000 each), then the output from column 6 would be:

0.0165     # =(0.027+0.006)/2 - 1st row from each size-3 block
0.042      # =(0.035+0.049)/2 - 2nd row
0.0515     # =(0.068+0.035)/2 - 3rd row

I already have the program to do this averaging (that's "Some_Complicated_Analysis") and it works fine. So all I need to for my script to separate off the columns, feed it into Some_Comp_Analysis, then merge the various outputs back into columns again with paste. But, the files are v. large, and I don't know a priori how many columns there are. If I knew there would be only 2 columns, then paste <(${cmd[1]}) <(${cmd[2]}) would work fine.

SOLUTION FOUND

UPDATE: an answer has been found - it is as shown in the reply update by glenn jackman below. The paste command must be preceded by eval. I don't know exactly why this is necessary, but without them, the variable expansion ${cmd[]} messes up the process substitution <(...). The answer above also puts double-quotes around the array expansion "${cmd[*]}" however these seem not so important - though without them some other expansions within cmd[] might fail. However, the eval is necessary.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天气好吗我好吗 2025-01-11 22:04:25

将每个“cmd1”和“cmd2”定义为单独的数组

$ cmd1=(cat $f)
$ cmd2=(tac $f)
$ paste <("${cmd1[@]}") <("${cmd2[@]}")
A B C   G H I
D E F   D E F
G H I   A B C

更新：您只需评估您构建的流程替换：

cols=$(head -n 1 $f|wc -w)
for (( i=1 ; i<=cols ; i++ )); do
  cmd[i]="<(cat $f|cut -f$i|Some_Complicated_Analysis)"
done
eval paste "${cmd[*]}"   # quotes are important here

define each of "cmd1" and "cmd2" as individual arrays

$ cmd1=(cat $f)
$ cmd2=(tac $f)
$ paste <("${cmd1[@]}") <("${cmd2[@]}")
A B C   G H I
D E F   D E F
G H I   A B C

update: You just need to eval your constructed process substitutions:

cols=$(head -n 1 $f|wc -w)
for (( i=1 ; i<=cols ; i++ )); do
  cmd[i]="<(cat $f|cut -f$i|Some_Complicated_Analysis)"
done
eval paste "${cmd[*]}"   # quotes are important here

回复收藏 0 原文

年华零落成诗 2025-01-11 22:04:25

如果您的脚本仍然无法工作，这可能会额外有用。仍然遵循 Glenn jackman 的答案，但此外您可能想在脚本

集 +o posix

内执行此操作http://www.linuxjournal.com/content/shell-process-redirection

来自链接：
“进程替换不是 POSIX 兼容功能，因此可能必须通过以下方式启用：set +o posix”

回复收藏 0 原文

萌梦深 2025-01-11 22:04:25

试试这个：

{
  cat -<<EOS
10 7.74336e-08 7.30689e-08 0.359106        19.981796       -0.160611       0.027
10 7.74336e-08 7.30689e-08 0.363938        19.985069       0.041319        0.035
10 7.74336e-08 7.30689e-08 0.363133        19.982094       0.041319        0.068
10 7.74336e-08 7.30689e-08 0.360716        19.981796       -0.160611       0.006
10 7.74336e-08 7.30689e-08 0.361522        19.981796       0.243249        0.049
10 7.74336e-08 7.30689e-08 0.357897        19.986260       0.041319        0.035
EOS
} |
awk '
  BEGIN { binSz=3; binSzLim=binSz ; binSzLim++ }
  NR==1{
    # base error checking on number of cols in first record
    maxCols=NF
    maxColsLim=maxCols ; maxColsLim++
    r=0
  }
  {
    if (NF != maxCols) {
      print "Skipping record, Mismatch in data, expected " maxCols ", found " NF " recs at " NR ":" $0
      next
    }
    r++
    #dbg print "r="r" NR=" NR ":$0=" $0

    # load data into temp arr[] by column
    for (c=1;c<maxColsLim;c++) {
      arr[r,c]+=$c
      #dbg printf ("arr["r","c"]=" arr[r,c] " " )

      avgArr[r,c]++
      #dbg print "avgArr["r","c"]="avgArr[r,c]
    }

    if(r>=binSz) {
      r=0
    }
  }
  END {
    for (r=1;r<binSzLim;r++) {
      #dbg print "r=" r " binSzLim=" binSzLim " " (r<binSzLim) "\t"
      for (c=1;c<maxColsLim;c++) {
        #dbg printf("arr["r","c"]=" arr[r,c] "\tavg=" arr[r,c]/binSz "\t")
        printf(  arr[r,c]/avgArr[r,c] " ")
      }
      printf "\n"
    }
  }
'

生成输出

10 7.74336e-08 7.30689e-08 0.359911 19.9818 -0.160611 0.0165
10 7.74336e-08 7.30689e-08 0.36273 19.9834 0.142284 0.042
10 7.74336e-08 7.30689e-08 0.360515 19.9842 0.041319 0.0515

我将第一列 10 添加到数据中，以便轻松调试 sum 和 avg 是否正常工作。

一个有趣的问题，感谢您发帖并感谢您继续回答我的问题的善意；-)

{ cat -< 只是为了让它变得简单在一份副本/粘贴中运行整个过程并查看它是否正常工作。如果文件顶部带有 #!/bin/awk -f ，您可以将 awk 代码放入 chmod 755 myScript.awk 并将其作为 myScript 运行。 awk 大文件 > AvgsFile。

您只需在 BEGIN 块中更改为 binSz=1000 即可按您的预期处理文件。

我很想知道这个版本的发布时间。

Try this:

{
  cat -<<EOS
10 7.74336e-08 7.30689e-08 0.359106        19.981796       -0.160611       0.027
10 7.74336e-08 7.30689e-08 0.363938        19.985069       0.041319        0.035
10 7.74336e-08 7.30689e-08 0.363133        19.982094       0.041319        0.068
10 7.74336e-08 7.30689e-08 0.360716        19.981796       -0.160611       0.006
10 7.74336e-08 7.30689e-08 0.361522        19.981796       0.243249        0.049
10 7.74336e-08 7.30689e-08 0.357897        19.986260       0.041319        0.035
EOS
} |
awk '
  BEGIN { binSz=3; binSzLim=binSz ; binSzLim++ }
  NR==1{
    # base error checking on number of cols in first record
    maxCols=NF
    maxColsLim=maxCols ; maxColsLim++
    r=0
  }
  {
    if (NF != maxCols) {
      print "Skipping record, Mismatch in data, expected " maxCols ", found " NF " recs at " NR ":" $0
      next
    }
    r++
    #dbg print "r="r" NR=" NR ":$0=" $0

    # load data into temp arr[] by column
    for (c=1;c<maxColsLim;c++) {
      arr[r,c]+=$c
      #dbg printf ("arr["r","c"]=" arr[r,c] " " )

      avgArr[r,c]++
      #dbg print "avgArr["r","c"]="avgArr[r,c]
    }

    if(r>=binSz) {
      r=0
    }
  }
  END {
    for (r=1;r<binSzLim;r++) {
      #dbg print "r=" r " binSzLim=" binSzLim " " (r<binSzLim) "\t"
      for (c=1;c<maxColsLim;c++) {
        #dbg printf("arr["r","c"]=" arr[r,c] "\tavg=" arr[r,c]/binSz "\t")
        printf(  arr[r,c]/avgArr[r,c] " ")
      }
      printf "\n"
    }
  }
'

Produces output

10 7.74336e-08 7.30689e-08 0.359911 19.9818 -0.160611 0.0165
10 7.74336e-08 7.30689e-08 0.36273 19.9834 0.142284 0.042
10 7.74336e-08 7.30689e-08 0.360515 19.9842 0.041319 0.0515

I added the first column with 10 into the data to make it easy to debug that sum and avg were working correctly.

An interesting problem, thanks for posting and thanks for your good will in continuing to answer my questions;-)

The { cat -<<EOS ... EOS }| was just to make it easy to run the whole thing in one copy/paste and see that it is working. You can put the awk code if a file with #!/bin/awk -f at the top, chmod 755 myScript.awk and run it as myScript.awk BigFile > AvgsFile.

You should only need to change to binSz=1000 in the BEGIN block to process your files as you intended.

I'd be interested to know what the timing is on this version.

回复收藏 0 原文

~没有更多了~