Bash:基于数组变量的 Cat
我想连接两个或多个文件,具体取决于名称是否包含数组中的元素。
我正在逐行读取此类文件(proteome.pisa):
2PJY_p chain=(B C) hresname=() hresnumber=() hatom=() model=() altconf=()
2Q7N_p chain=(A E F G H I J K L) hresname=(FUC MAN NAG) hresnumber=() hatom=() model=() altconf=()
对于每一行,脚本提取第一列上的字符串并将其定义为变量 pdbid。然后它获取第二列并将其定义为一个数组(元素链 $c)。然后它检查名为 ${pdbid}_${c}_p.pdb 的文件是否存在,如果存在,则将其内容合并到文件 ${pdbid}_p_${chains}.pdb
是脚本:
while read line ; do
echo "$line" > pdb.line
cut -f1 pdb.line > pdb.list
sed -i 's/.*/\"&\"/' pdb.list
sed -i 's/_p//g' pdb.list
awk '{ printf "pdbid="; print }' pdb.list > pdbid.list
cut -f2 pdb.line > chain.list
source pdbid.list
source chain.list
chains=`printf "%s" "${chain[@]}"`
for c in ${chain[@]} ; do
if [ ${#chain[@]} -gt 1 ] && \
[ -f ${pdbid}_${c}_p.pdb ] ; then
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
fi
done
done < proteome.pisa
这 例如,预期行为是将第一行的 2PJY_p_B.pdb 和 2PJY_p_C.pdb 合并到名为 2PJY_p_BC.pdb 的文件中。然而,它实际上所做的是将第一个文件合并两次。我不明白为什么...
I want to concatenate two or more files depending if there names contain or not elements from an array.
I am reading this kind of file line by line (proteome.pisa):
2PJY_p chain=(B C) hresname=() hresnumber=() hatom=() model=() altconf=()
2Q7N_p chain=(A E F G H I J K L) hresname=(FUC MAN NAG) hresnumber=() hatom=() model=() altconf=()
For each line, the script extracts the string on the first column and defines it as the variable pdbid. Then it takes the second column and defines it as an array (chain of elements $c). Then it checks if a file called ${pdbid}_${c}_p.pdb exists and, if it does, it merges its content into the file ${pdbid}_p_${chains}.pdb
This is the script:
while read line ; do
echo "$line" > pdb.line
cut -f1 pdb.line > pdb.list
sed -i 's/.*/\"&\"/' pdb.list
sed -i 's/_p//g' pdb.list
awk '{ printf "pdbid="; print }' pdb.list > pdbid.list
cut -f2 pdb.line > chain.list
source pdbid.list
source chain.list
chains=`printf "%s" "${chain[@]}"`
for c in ${chain[@]} ; do
if [ ${#chain[@]} -gt 1 ] && \
[ -f ${pdbid}_${c}_p.pdb ] ; then
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
fi
done
done < proteome.pisa
The expected behaviour was to merge for instance, for the first row, 2PJY_p_B.pdb and 2PJY_p_C.pdb in a file called 2PJY_p_BC.pdb. However, what it actually does is merging the first file twice. I cannot understand why...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一个很好的问题,因为它表明 bash 无法独自完成所有事情。相反,它需要 awk、cut 等帮助器,...我查看了您的解决方案,似乎在两行 source 行之后,您期望设置变量 pdbid、chain 和 strings。但是,您的脚本没有正确设置它们,我可以帮助您完成该部分。我不太了解 Perl,但认为 Perl 在这种情况下会很好地工作。这是 makevars.pl:
这是 shell 脚本:
我希望这会有所帮助。
This is a great question, for it demonstrates that bash cannot do everything on its own. Instead, it needs helpers such as awk, cut, ... I looked through your solution and it seems after the two source lines, you expect to have variables pdbid, chain, and chains set. However, your script did not set them correctly and I can help with that part. I don't know Perl that much, but think Perl will work nicely in this case. Here is makevars.pl:
And here is the shell script:
I hope this helps.
我建议使用 sed 将输入预处理为更简单的形式,然后对其进行循环。这是假设
chain=(...)
始终是一行中的第一个此类属性。这可以避免使用临时文件来困扰您的第一个脚本;获取生成的文件即使不令人震惊,也看起来相当令人吃惊(通常您可以使用反引号来进行此类操作,但这里并不真正需要它们)。
sed
有多种变体;一些(例如Linux)希望文字括号带有反斜杠,另一些(例如Mac OSX)则不需要。如果这不起作用,请尝试去掉反斜杠。具有多个变量名称的
read
会在空格上分割输入,以便第一个变量名称接收第一个标记,等等;最后命名的变量接收剩下的内容,没有额外的空格分割。continue
跳转到封闭的for
或while
循环的下一次迭代。除此之外,这应该是相当不言自明的。如果您真的被迫在纯 Bourne shell 中完成这一切,那么开头的 sed 替换可能会被涉及字符串替换的内容替换。I would suggest preprocessing the input into a simpler form with
sed
, then looping over that. This is assuming thechain=(...)
is always the first such attribute on a line.This avoids the use of temporary files which riddled your first script; sourcing a generated file also looks rather startling, if not alarming (usually you can use backticks for that sort of thing, but they are not really required here).
There are multiple variants of
sed
; some (e.g. Linux) want a literal parenthesis to be backslashed, others (e.g. Mac OSX) don't. If this doesn't work, try taking out the backslashes.read
with multiple variable names splits the input on whitespace so that the first variable name receives the first token, etc; the last named variable receives whatever is left, without additional whitespace splitting.continue
jumps to the next iteration of the enclosingfor
orwhile
loop. Other than that, this should be fairly self-explanatory. If you are really pressed to do it all in pure Bourne shell, thesed
replacement at the beginning could probably be replaced with something involving string substitutions.问题似乎是这一行中数组的定义:
将其更改为 :
似乎可以解决问题。
此外,我对所有出现的“${chain[@]}”都用双引号引起来。
The problems appears to be the definition of the array in this line:
Changing it to :
appears to solve the problem.
In addition, I have double-quoted all occurrences of "${chain[@]}".