Bash：基于数组变量的 Cat

发布于 2024-12-03 00:35:25 字数 1096 浏览 5 评论 0原文

我想连接两个或多个文件，具体取决于名称是否包含数组中的元素。

我正在逐行读取此类文件（proteome.pisa）：

2PJY_p  chain=(B C) hresname=() hresnumber=()   hatom=()    model=()    altconf=()
2Q7N_p  chain=(A E F G H I J K L)   hresname=(FUC MAN NAG)  hresnumber=()   hatom=()    model=()    altconf=()

对于每一行，脚本提取第一列上的字符串并将其定义为变量 pdbid。然后它获取第二列并将其定义为一个数组（元素链 $c）。然后它检查名为 ${pdbid}_${c}_p.pdb 的文件是否存在，如果存在，则将其内容合并到文件 ${pdbid}_p_${chains}.pdb

是脚本：

while read line ; do

echo "$line" > pdb.line
cut -f1 pdb.line > pdb.list
sed -i 's/.*/\"&\"/' pdb.list
sed -i 's/_p//g' pdb.list
awk '{ printf "pdbid="; print }' pdb.list > pdbid.list

cut -f2 pdb.line > chain.list

source pdbid.list
source chain.list

chains=`printf "%s" "${chain[@]}"`

for c in ${chain[@]} ; do
if [ ${#chain[@]} -gt 1 ] && \
   [ -f ${pdbid}_${c}_p.pdb ] ; then  
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
fi
done

done < proteome.pisa

这例如，预期行为是将第一行的 2PJY_p_B.pdb 和 2PJY_p_C.pdb 合并到名为 2PJY_p_BC.pdb 的文件中。然而，它实际上所做的是将第一个文件合并两次。我不明白为什么...

原文

I want to concatenate two or more files depending if there names contain or not elements from an array.

I am reading this kind of file line by line (proteome.pisa):

2PJY_p  chain=(B C) hresname=() hresnumber=()   hatom=()    model=()    altconf=()
2Q7N_p  chain=(A E F G H I J K L)   hresname=(FUC MAN NAG)  hresnumber=()   hatom=()    model=()    altconf=()

For each line, the script extracts the string on the first column and defines it as the variable pdbid. Then it takes the second column and defines it as an array (chain of elements $c). Then it checks if a file called ${pdbid}_${c}_p.pdb exists and, if it does, it merges its content into the file ${pdbid}_p_${chains}.pdb

This is the script:

while read line ; do

echo "$line" > pdb.line
cut -f1 pdb.line > pdb.list
sed -i 's/.*/\"&\"/' pdb.list
sed -i 's/_p//g' pdb.list
awk '{ printf "pdbid="; print }' pdb.list > pdbid.list

cut -f2 pdb.line > chain.list

source pdbid.list
source chain.list

chains=`printf "%s" "${chain[@]}"`

for c in ${chain[@]} ; do
if [ ${#chain[@]} -gt 1 ] && \
   [ -f ${pdbid}_${c}_p.pdb ] ; then  
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
fi
done

done < proteome.pisa

The expected behaviour was to merge for instance, for the first row, 2PJY_p_B.pdb and 2PJY_p_C.pdb in a file called 2PJY_p_BC.pdb. However, what it actually does is merging the first file twice. I cannot understand why...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

演多会厌 2024-12-10 00:35:25

这是一个很好的问题，因为它表明 bash 无法独自完成所有事情。相反，它需要 awk、cut 等帮助器，...我查看了您的解决方案，似乎在两行 source 行之后，您期望设置变量 pdbid、chain 和 strings。但是，您的脚本没有正确设置它们，我可以帮助您完成该部分。我不太了解 Perl，但认为 Perl 在这种情况下会很好地工作。这是 makevars.pl：

while (<STDIN>) {
    my($line) = $_;
    if ($line =~ /^(.*)_p.*chain=\((.*)\).*hresname.*$/) {
        print "pdbid=$1\n";
        print "chain=($2)\n";
        $chains = $2;
        $chains =~ s/ //g;
        print "chains=$chains\n";
    }
}

这是 shell 脚本：

while read line
do

    echo "$line" | perl makevars.pl >setvars.sh
    source setvars.sh
    # Now, pdbid, chain, and chains are set, do your things

done < proteome.pisa

我希望这会有所帮助。

This is a great question, for it demonstrates that bash cannot do everything on its own. Instead, it needs helpers such as awk, cut, ... I looked through your solution and it seems after the two source lines, you expect to have variables pdbid, chain, and chains set. However, your script did not set them correctly and I can help with that part. I don't know Perl that much, but think Perl will work nicely in this case. Here is makevars.pl:

while (<STDIN>) {
    my($line) = $_;
    if ($line =~ /^(.*)_p.*chain=\((.*)\).*hresname.*$/) {
        print "pdbid=$1\n";
        print "chain=($2)\n";
        $chains = $2;
        $chains =~ s/ //g;
        print "chains=$chains\n";
    }
}

And here is the shell script:

while read line
do

    echo "$line" | perl makevars.pl >setvars.sh
    source setvars.sh
    # Now, pdbid, chain, and chains are set, do your things

done < proteome.pisa

I hope this helps.

回复收藏 0 原文

肩上的翅膀 2024-12-10 00:35:25

我建议使用 sed 将输入预处理为更简单的形式，然后对其进行循环。这是假设 chain=(...) 始终是一行中的第一个此类属性。

#!/bin/sh

# Replace 2ICQ_p chain=(A B C ... Z) attribs= ...   with
# 2ICQ_p A B C ... Z
sed 's/ chain=\(//;s/\).*//' <proteome.pisa |
while read pdbid chain; do
    chains=${chain/ /}
    for c in $chain; do
        test -e ${pdbid}_${c}_p.pdb || continue
        cat ${pdbdid}_${c}_p.pdb
    done >${pdbid}_p_${chains}.pdb
done

这可以避免使用临时文件来困扰您的第一个脚本；获取生成的文件即使不令人震惊，也看起来相当令人吃惊（通常您可以使用反引号来进行此类操作，但这里并不真正需要它们）。

sed 有多种变体；一些（例如Linux）希望文字括号带有反斜杠，另一些（例如Mac OSX）则不需要。如果这不起作用，请尝试去掉反斜杠。

具有多个变量名称的 read 会在空格上分割输入，以便第一个变量名称接收第一个标记，等等；最后命名的变量接收剩下的内容，没有额外的空格分割。 continue 跳转到封闭的 for 或 while 循环的下一次迭代。除此之外，这应该是相当不言自明的。如果您真的被迫在纯 Bourne shell 中完成这一切，那么开头的 sed 替换可能会被涉及字符串替换的内容替换。

I would suggest preprocessing the input into a simpler form with sed, then looping over that. This is assuming the chain=(...) is always the first such attribute on a line.

#!/bin/sh

# Replace 2ICQ_p chain=(A B C ... Z) attribs= ...   with
# 2ICQ_p A B C ... Z
sed 's/ chain=\(//;s/\).*//' <proteome.pisa |
while read pdbid chain; do
    chains=${chain/ /}
    for c in $chain; do
        test -e ${pdbid}_${c}_p.pdb || continue
        cat ${pdbdid}_${c}_p.pdb
    done >${pdbid}_p_${chains}.pdb
done

This avoids the use of temporary files which riddled your first script; sourcing a generated file also looks rather startling, if not alarming (usually you can use backticks for that sort of thing, but they are not really required here).

There are multiple variants of sed; some (e.g. Linux) want a literal parenthesis to be backslashed, others (e.g. Mac OSX) don't. If this doesn't work, try taking out the backslashes.

read with multiple variable names splits the input on whitespace so that the first variable name receives the first token, etc; the last named variable receives whatever is left, without additional whitespace splitting. continue jumps to the next iteration of the enclosing foror while loop. Other than that, this should be fairly self-explanatory. If you are really pressed to do it all in pure Bourne shell, the sed replacement at the beginning could probably be replaced with something involving string substitutions.

回复收藏 0 原文

ま昔日黯然 2024-12-10 00:35:25

问题似乎是这一行中数组的定义：

cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb

将其更改为 :

cat ${pdbid}_${c}_p.pdb >> ${pdbid}_p_${chains}.pdb

似乎可以解决问题。

此外，我对所有出现的“${chain[@]}”都用双引号引起来。

The problems appears to be the definition of the array in this line:

cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb

Changing it to :

cat ${pdbid}_${c}_p.pdb >> ${pdbid}_p_${chains}.pdb

appears to solve the problem.

In addition, I have double-quoted all occurrences of "${chain[@]}".

回复收藏 0 原文

~没有更多了~

关于作者

最丧也最甜

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

Bash：基于数组变量的 Cat

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

Bash：基于数组变量的 Cat

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。