我正在尝试对多列文件的每一列执行分析,并使用粘贴将各列重新连接在一起。我事先不知道有多少列,因此我使用“wc -w”和循环来定义命令数组。每个命令都是一个进程替换。以下脚本显示了我正在尝试的内容,然后显示了输出。值得注意的是,如果我将命令数组回显到终端,然后用鼠标剪切粘贴它,它就可以正常工作,因此它一定是变量扩展和进程替换的顺序。
简而言之,我需要在 shell 变量内进行进程替换。有什么想法吗?提前致谢。
-------------- script.sh ----------------
#!/bin/bash
f="file.txt";
echo "File contents"
cat $f;
# simple solution
echo; echo "First try";
paste <(cat $f) <(tac $f)
# now define cmd[1] and cmd[2] and merge together with paste
echo; echo "Second try";
cmd[0]="paste";
cmd[1]="cat $f";
cmd[2]="tac $f";
${cmd[0]} <(${cmd[1]}) <(${cmd[1]})
# but what I really want is something like:
echo; echo "Third try";
cmd[1]="<(cat $f)";
cmd[2]="<(tac $f)";
${cmd[0]} ${cmd[1]} ${cmd[2]}
# or even better:
echo; echo "Fourth try";
${cmd[*]}
echo; echo "Show the array";
echo ${cmd[*]}
------------- 输出 --- -----------------
$ ./scipt.sh
File contents
A B C
D E F
G H I
First try
A B C G H I
D E F D E F
G H I A B C
Second try
A B C A B C
D E F D E F
G H I G H I
Third try
paste: <(cat: No such file or directory
Fourth try
paste: <(cat: No such file or directory
Show the array
paste <(cat file.txt) <(tac file.txt)
$ paste <(cat file.txt) <(tac file.txt)
A B C G H I
D E F D E F
G H I A B C
$
在回复 shellter 时,这里是一些示例输入。
7.74336e-08 7.30689e-08 0.359106 19.981796 -0.160611 0.027
7.74336e-08 7.30689e-08 0.363938 19.985069 0.041319 0.035
7.74336e-08 7.30689e-08 0.363133 19.982094 0.041319 0.068
7.74336e-08 7.30689e-08 0.360716 19.981796 -0.160611 0.006
7.74336e-08 7.30689e-08 0.361522 19.981796 0.243249 0.049
7.74336e-08 7.30689e-08 0.357897 19.986260 0.041319 0.035
这样的数据可能有一亿行。我需要分离每一列,将每一列分成(例如)1000 个块,然后对块中的每个元素进行平均,然后再次将平均列合并在一起。对于下面的示例,如果我仅对 2 个块(每个块 3 个元素)进行平均(而不是每个 100K 的 1000 个元素),那么第 6 列的输出将是:
0.0165 # =(0.027+0.006)/2 - 1st row from each size-3 block
0.042 # =(0.035+0.049)/2 - 2nd row
0.0515 # =(0.068+0.035)/2 - 3rd row
我已经有了执行此平均操作的程序(即“Some_Complicated_Analysis”)而且效果很好。因此,我的脚本所需要做的就是分离列,将其输入 Some_Comp_Analysis,然后使用 paste
将各种输出再次合并回列中。但是,这些文件非常大,而且我不知道有多少列。如果我知道只有 2 列,那么 paste <(${cmd[1]}) <(${cmd[2]})
就可以正常工作。
找到解决方案
更新:已找到答案 - 如下面 Glenn jackman 的回复更新所示。 paste
命令前面必须有 eval
。我不知道为什么这是必要的,但如果没有它们,变量扩展 ${cmd[]}
会搞乱进程替换 <(...)
。上面的答案还在数组扩展 "${cmd[*]}"
周围添加了双引号,但是这些似乎并不那么重要 - 尽管没有它们 cmd[]
中的一些其他扩展代码>可能会失败。但是,eval
是必要的。
I'm trying to perform analyses on each column of a multi-column file, and using paste, to rejoin the columns together. I don't know a priori how many columns there are, so I use "wc -w" and a loop to define an array of commands. Each command then is a process substitution. The following script shows what I'm trying, and the output is shown after. Notably if I echo the array of commands to the terminal, then, with the mouse, cut-n-paste it, it works fine, so it must be the order of variable expansion and process substitution.
In short, I need to have a process substitution inside a shell variable. Any ideas? Thanks in advance.
-------------- script.sh ----------------
#!/bin/bash
f="file.txt";
echo "File contents"
cat $f;
# simple solution
echo; echo "First try";
paste <(cat $f) <(tac $f)
# now define cmd[1] and cmd[2] and merge together with paste
echo; echo "Second try";
cmd[0]="paste";
cmd[1]="cat $f";
cmd[2]="tac $f";
${cmd[0]} <(${cmd[1]}) <(${cmd[1]})
# but what I really want is something like:
echo; echo "Third try";
cmd[1]="<(cat $f)";
cmd[2]="<(tac $f)";
${cmd[0]} ${cmd[1]} ${cmd[2]}
# or even better:
echo; echo "Fourth try";
${cmd[*]}
echo; echo "Show the array";
echo ${cmd[*]}
------------- output --------------------
$ ./scipt.sh
File contents
A B C
D E F
G H I
First try
A B C G H I
D E F D E F
G H I A B C
Second try
A B C A B C
D E F D E F
G H I G H I
Third try
paste: <(cat: No such file or directory
Fourth try
paste: <(cat: No such file or directory
Show the array
paste <(cat file.txt) <(tac file.txt)
$ paste <(cat file.txt) <(tac file.txt)
A B C G H I
D E F D E F
G H I A B C
$
In reply to shellter, here is some sample input.
7.74336e-08 7.30689e-08 0.359106 19.981796 -0.160611 0.027
7.74336e-08 7.30689e-08 0.363938 19.985069 0.041319 0.035
7.74336e-08 7.30689e-08 0.363133 19.982094 0.041319 0.068
7.74336e-08 7.30689e-08 0.360716 19.981796 -0.160611 0.006
7.74336e-08 7.30689e-08 0.361522 19.981796 0.243249 0.049
7.74336e-08 7.30689e-08 0.357897 19.986260 0.041319 0.035
There might be 100 million lines of data like this. I need to separate off each column, separate each column into blocks of (say) 1000, then perform an average of each element in the blocks, then merge the averaged columns back together again. For the example below, if I were averaging over just 2 blocks of 3 elements each (instead of 100K of 1000 each), then the output from column 6 would be:
0.0165 # =(0.027+0.006)/2 - 1st row from each size-3 block
0.042 # =(0.035+0.049)/2 - 2nd row
0.0515 # =(0.068+0.035)/2 - 3rd row
I already have the program to do this averaging (that's "Some_Complicated_Analysis") and it works fine. So all I need to for my script to separate off the columns, feed it into Some_Comp_Analysis, then merge the various outputs back into columns again with paste
. But, the files are v. large, and I don't know a priori how many columns there are. If I knew there would be only 2 columns, then paste <(${cmd[1]}) <(${cmd[2]})
would work fine.
SOLUTION FOUND
UPDATE: an answer has been found - it is as shown in the reply update by glenn jackman below. The paste
command must be preceded by eval
. I don't know exactly why this is necessary, but without them, the variable expansion ${cmd[]}
messes up the process substitution <(...)
. The answer above also puts double-quotes around the array expansion "${cmd[*]}"
however these seem not so important - though without them some other expansions within cmd[]
might fail. However, the eval
is necessary.
发布评论
评论(3)
将每个“cmd1”和“cmd2”定义为单独的数组
更新:您只需
评估
您构建的流程替换:define each of "cmd1" and "cmd2" as individual arrays
update: You just need to
eval
your constructed process substitutions:如果您的脚本仍然无法工作,这可能会额外有用。仍然遵循 Glenn jackman 的答案,但此外您可能想在脚本
集 +o posix
内执行此操作http://www.linuxjournal.com/content/shell-process-redirection
来自链接:
“进程替换不是 POSIX 兼容功能,因此可能必须通过以下方式启用:set +o posix”
This may be additionally helpful if your script is still not working. Still follow glenn jackman's answer but in addition you may want to do this inside the script
set +o posix
http://www.linuxjournal.com/content/shell-process-redirection
From link:
"Process substitution is not a POSIX compliant feature and so it may have to be enabled via: set +o posix"
试试这个:
生成输出
我将第一列 10 添加到数据中,以便轻松调试 sum 和 avg 是否正常工作。
一个有趣的问题,感谢您发帖并感谢您继续回答我的问题的善意;-)
{ cat -< 只是为了让它变得简单在一份副本/粘贴中运行整个过程并查看它是否正常工作。如果文件顶部带有
#!/bin/awk -f
,您可以将 awk 代码放入chmod 755 myScript.awk
并将其作为myScript 运行。 awk 大文件 > AvgsFile
。您只需在 BEGIN 块中更改为
binSz=1000
即可按您的预期处理文件。我很想知道这个版本的发布时间。
Try this:
Produces output
I added the first column with 10 into the data to make it easy to debug that sum and avg were working correctly.
An interesting problem, thanks for posting and thanks for your good will in continuing to answer my questions;-)
The
{ cat -<<EOS ... EOS }|
was just to make it easy to run the whole thing in one copy/paste and see that it is working. You can put the awk code if a file with#!/bin/awk -f
at the top,chmod 755 myScript.awk
and run it asmyScript.awk BigFile > AvgsFile
.You should only need to change to
binSz=1000
in the BEGIN block to process your files as you intended.I'd be interested to know what the timing is on this version.