是否可以将 bash 数组作为变量传递给 awk？

发布于 2024-12-01 13:27:37 字数 929 浏览 0 评论 0原文

我有大量数据要从文本文件导入。这些文件已预先格式化，以便我可以将每一列导入为 bash 数组：

2GYS 链=(AB) hresname=(BMA FUC NAG NDG) hresnumber=( ) hatom=( )

现在我想从包含多行格式的文件中提取信息，如下所示：

原子 1 N THR A 4 30.127 13.123 1.297 1.00 39.96 N

例如，我想提取第一列是 ATOM 并且第五列与链数组匹配的所有行（在这种情况下，它既是 A 又是 B）。

更新。这是我尝试过的：

for c in "${chain[@]}" ; do
  awk -v pdbid="$pdbid" -v c="$c" '{ if($1 == "ATOM" && $5==c) { print $0 } }' ${pdbid}.pdb >> ../../properpdb/${pdbid}_${c}.pdb
done

for c in "${chain[@]}" ; do
 for r in "${hresname[@]}" ; do
   awk -v pdbid="$pdbid" -v c="$c" -v r="$r" '{ if($1 == "HETATM" && $5==c && $4==r) { print $0 } }' ${pdbid}.pdb >> ../../properpdb/${pdbid}_${c}.pdb
 done
done

问题是，正如预期的那样，这会生成具有链 A 或 B 的文件，但不会生成同时具有两者的文件。此外，它不会生成数组“chain”和“hresname”的所有可能组合，它只是将“hresname”添加到只有一个“chain”可用的文件中。

原文

I have a large number data that I am importing from text files. The files are preformatted so that I can import each column as a bash array:

2GYS chain=(A B) hresname=(BMA FUC NAG NDG) hresnumber=( ) hatom=( )

Now I would like to extract information from files containing several lines formatted like this:

ATOM 1 N THR A 4 30.127 13.123 1.297 1.00 39.96 N

For instance, I would like to extract all lines in which the first column is ATOM and the fifth column matches the chain array (in this case, it would be both A and B).

UPDATE. This is what I have tried:

for c in "${chain[@]}" ; do
  awk -v pdbid="$pdbid" -v c="$c" '{ if($1 == "ATOM" && $5==c) { print $0 } }' ${pdbid}.pdb >> ../../properpdb/${pdbid}_${c}.pdb
done

for c in "${chain[@]}" ; do
 for r in "${hresname[@]}" ; do
   awk -v pdbid="$pdbid" -v c="$c" -v r="$r" '{ if($1 == "HETATM" && $5==c && $4==r) { print $0 } }' ${pdbid}.pdb >> ../../properpdb/${pdbid}_${c}.pdb
 done
done

The problem is that, as expected this produces files with either chain A or B but not the file with both. In addition it does not produce all possible combinations of the arrays "chain" and "hresname", it just adds "hresname" to the files for which only one "chain" was available.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回眸一遍 2024-12-08 13:27:37

我的解决方案是在 bash 中构建 awk 脚本的一部分，特别是匹配函数。

您似乎想要匹配 $1 == "ATOM" && 的字段($5==c[0] || $5==c[1]...) {print $0} 导出到文件。

在 bash 中，将匹配函数构造为：

cmatch="\$5==\"${chain[0]}\""
for element in $(seq 1 $((${#chain[@]} - 1))); do cmatch+=" || \$5==\"${chain[$element]}\""; done
#cmatch should now be of the form "$5==A || $5==B"

#do the same thing for rmatch
rmatch="\$4==\"${hresname[0]}\""
for element in $(seq 1 $((${#hresname[@]} - 1))); do rmatch+=" || \$4==\"${hresname[$element]}\""; done

现在可以调整 awk 脚本以包含所需的位：（引号仍然很麻烦，因为您需要确保 $1 不受干扰地降到 awk，但会评估 $cmatch。）

rmatch='$1=="HETATM" && ('"$cmatch"') && ('"$rmatch"')'  #order is important here :)
cmatch='$1=="ATOM" && ('"$cmatch"')'

现在你的匹配脚本应该完成了。

awk "$cmatch" ${pdbid}.pdb >> ../../properpdb/${pdbid}_c.pdb
awk "$rmatch" ${pdbid}.pdb >> ../../properpdb/${pdbid}_c.pdb

我不太理解输出文件名 ../../properpdb/${pdbid}_${c}.pdb，因为这似乎表示每个元素的单独文件c，这是你不想要的？

如果你想要将它们除以 c 的元素，那么稍微简单一点，像上面一样构造 rmatch 数组，然后执行类似的操作

for c in "${chain[@]}" ; do
  awk -v c="$c" '$1=="ATOM" && $5==c' ${pdbid}.pdb  >> ../../properpdb/${pdbid}_${c}.pdb
  awk -v c="$c" '$1=="HETATM" && $5==c && ('"$rmatch"')' ${pdbid}.pdb  >> ../../properpdb/${pdbid}_${c}.pdb
done

如果你首先想要所有 ATOM 元素，或者...

for c in "${chain[@]}" ; do
  awk -v c="$c" '$5==c && ($1=="ATOM" || ($1=="HETATM" && ('"$rmatch"')))' ${pdbid}.pdb  >> ../../properpdb/${pdbid}_${c}.pdb
done

如果你想要它们混合

My solution would be to build part of your awk script in bash, specifically the matching function.

You seem to want fields that match $1 == "ATOM" && ($5==c[0] || $5==c[1]...) {print $0} exported to the file.

In bash, construct the matching function as:

cmatch="\$5==\"${chain[0]}\""
for element in $(seq 1 $((${#chain[@]} - 1))); do cmatch+=" || \$5==\"${chain[$element]}\""; done
#cmatch should now be of the form "$5==A || $5==B"

#do the same thing for rmatch
rmatch="\$4==\"${hresname[0]}\""
for element in $(seq 1 $((${#hresname[@]} - 1))); do rmatch+=" || \$4==\"${hresname[$element]}\""; done

Now your awk-scripts can be adjusted to include the needed bits: (Quotes continue to be a pain, since you need to make sure $1 gets down to awk unmolested, but $cmatch is evaluated.)

rmatch='$1=="HETATM" && ('"$cmatch"') && ('"$rmatch"')'  #order is important here :)
cmatch='$1=="ATOM" && ('"$cmatch"')'

So now your matching script should be complete.

awk "$cmatch" ${pdbid}.pdb >> ../../properpdb/${pdbid}_c.pdb
awk "$rmatch" ${pdbid}.pdb >> ../../properpdb/${pdbid}_c.pdb

I don't really understand the output file name, ../../properpdb/${pdbid}_${c}.pdb, since that would seem to indicate seperate files for each element of c, which is what you don't want?

If you want these divided by elements of c, then it's slightly simpler, construct the rmatch array like above, and then do something like

for c in "${chain[@]}" ; do
  awk -v c="$c" '$1=="ATOM" && $5==c' ${pdbid}.pdb  >> ../../properpdb/${pdbid}_${c}.pdb
  awk -v c="$c" '$1=="HETATM" && $5==c && ('"$rmatch"')' ${pdbid}.pdb  >> ../../properpdb/${pdbid}_${c}.pdb
done

If you want all ATOM elements first, or...

for c in "${chain[@]}" ; do
  awk -v c="$c" '$5==c && ($1=="ATOM" || ($1=="HETATM" && ('"$rmatch"')))' ${pdbid}.pdb  >> ../../properpdb/${pdbid}_${c}.pdb
done

if you want them intermixed

回复收藏 0 原文

~没有更多了~