我如何从文件中随机删除集合数量的字符数？

发布于 2025-02-06 16:56:52 字数 1183 浏览 2 评论 0原文

我正在尝试随机在FastA文件中删除10％，15％和20％的核苷酸。

因此，假设我有一个这样的fasta文件...

>GCA_900186885_1_000000000001
ATGCAAACATTTGTAAAAAACTTAATCGAT

我想随机选择并删除10％的核苷酸，在这种情况下为3，导致带有相同标头的Fasta文件，但较少3个核苷酸：

>GCA_900186885_1_000000000001
ATGAAACATTGTAAAAACTTAATCGAT

以上是一个简单的示例，可以轻松手动进行，但是我有一个带有2132142核苷酸的大型Fasta文件，因此希望使用原始文件生成三个新的Fasta文件，但是具有1918928、1812321和1705714核苷酸，代表10个10。％，15％和20％的降低。

我已经搜索了像Stackoverflow和Biostars这样的论坛上找到了一些相关问题，但没有发现有用。

我尝试了以下对另一个用户的建议的改编，以从文件中随机删除行，但这无效。

filename=/Users/home/DETECTION/GCA_900186885.1_48903_D01_genomic_reformatted.fa
number=1918928

NT_count="$(grep -v ">" $filename | grep -E -o "G|C|T|A|N" | wc -l)"
NT_nums_to_delete="$(shuf -i "1-$NT_count" -n "$number")"
sed_script="$(printf '%dd;' $NT_nums_to_delete)"
sed -i.bak -e "$sed_script" "$filename"

sed_script =“ $（printf'％dd;'$ nt_nums_to_delete）命令导致以下错误，我无法弄清楚。

zsh: bad math expression: operator expected at `2122589\n98...'

任何洞察力都将不胜感激。谢谢你！

原文

I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.

So let's say I have a fasta file like this...

>GCA_900186885_1_000000000001
ATGCAAACATTTGTAAAAAACTTAATCGAT

I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:

>GCA_900186885_1_000000000001
ATGAAACATTGTAAAAACTTAATCGAT

The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.

I have searched forums like stackoverflow and biostars for some related questions, but have not found anything useful.

I tried the following adaptation of a suggestion from another user to randomly delete lines from a file, but it didn't work.

filename=/Users/home/DETECTION/GCA_900186885.1_48903_D01_genomic_reformatted.fa
number=1918928

NT_count="$(grep -v ">" $filename | grep -E -o "G|C|T|A|N" | wc -l)"
NT_nums_to_delete="$(shuf -i "1-$NT_count" -n "$number")"
sed_script="$(printf '%dd;' $NT_nums_to_delete)"
sed -i.bak -e "$sed_script" "$filename"

The sed_script="$(printf '%dd;' $NT_nums_to_delete)" command resulted in the following error that I could not figure out.

zsh: bad math expression: operator expected at `2122589\n98...'

Any insight would be much appreciated. Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

与风相奔跑 2025-02-13 16:56:52

当您标记为bash时，这是一个尴尬：

$ awk -v s=$RANDOM 'BEGIN {
    if(s)                                      # if seed provided as var s
        srand(s)                               # use s as seed 
    else                                       # or not
        srand()
}
!/^>/ {                                        # process non > starting records
    i=int(0.1*(length($0)))                    # 10 % of length to remove
    for(j=1;j<=i;j++) {
        p=int(rand()*length($0))+1             # p the point from where to remove
        $0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char 
    }
}
1' file                                        # output

样本输出：

>GCA_900186885_1_000000000001
ATGCAAACATGTAAAAAACTTAACGAT

As you tagged bash, here's one in awk:

$ awk -v s=$RANDOM 'BEGIN {
    if(s)                                      # if seed provided as var s
        srand(s)                               # use s as seed 
    else                                       # or not
        srand()
}
!/^>/ {                                        # process non > starting records
    i=int(0.1*(length($0)))                    # 10 % of length to remove
    for(j=1;j<=i;j++) {
        p=int(rand()*length($0))+1             # p the point from where to remove
        $0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char 
    }
}
1' file                                        # output

Sample output:

>GCA_900186885_1_000000000001
ATGCAAACATGTAAAAAACTTAACGAT

回复收藏 0 原文

破晓 2025-02-13 16:56:52

20％独立的Indels听起来比在生物学上相关的率要高得多。
您指定您的独立删除发生在
核苷酸水平而不是3碱密码子水平，
而且您没有提及任何维修 / reSync /外显子检测。

这是您的三个示例删除：

ATGCAAACATTTGTAAAAAACTTAATCGAT
ATG AAACATT GTAAAAA CTTAATCGAT

它们看起来好像他们彼此独立，
没有结块。

对于您提供的30台示例，我们可能会这样做：

from random import seed, shuffle
seed(42)  # arbitrary value, for reproducibility across runs
bases = list('ATGCAAACATTTGTAAAAAACTTAATCGAT')
idx = list(range(len(bases)))
shuffle(idx)
num_deletions = int(.10 * len(bases))  # 3
for i in idx[:num_deletions]:
    del bases[i]
return ''.join(bases)

删除索引并删除前3个非常简单，很简单，
但是它具有二次性能，所以您不想运行它
在200万基地上。

这是线性性能的同一件事：

...
deletions = set(idx[:num_deletions])
for i, base in enumerate(bases):
    if i not in deletions:
        yield base

这仍然是分配200万个整数的缺点。

因此，您可能更喜欢删除大约
正确数量的基础：

seed(42)
for base in bases:
    if random() >= .1:
        yield base

要准确获取正确的数字我们需要
根据输出字符串是否动态调整.1
看起来太长或太短的ATM。
实际上，在大块中进行会更简单
以某些精确的洗牌方法结束
末端方便的基础数量很少。

20% independent indels sounds like a much higher rate than would be biologically relevant.
You specified that your independent deletions happen at
the nucleotide level rather than the 3-base codon level,
and you didn't mention any repair / resync / exon detection.

Here are your three example deletions:

ATGCAAACATTTGTAAAAAACTTAATCGAT
ATG AAACATT GTAAAAA CTTAATCGAT

They look like they're independent of one another,
no clumping.

For the 30-base example you offered, we might do this:

from random import seed, shuffle
seed(42)  # arbitrary value, for reproducibility across runs
bases = list('ATGCAAACATTTGTAAAAAACTTAATCGAT')
idx = list(range(len(bases)))
shuffle(idx)
num_deletions = int(.10 * len(bases))  # 3
for i in idx[:num_deletions]:
    del bases[i]
return ''.join(bases)

Permuting the indexes and deleting the first 3 is very simple,
but it has quadratic performance so you wouldn't want to run it
on two million bases.

Here's the same thing with linear performance:

...
deletions = set(idx[:num_deletions])
for i, base in enumerate(bases):
    if i not in deletions:
        yield base

This still has the downside of allocating two million integers.

So you would probably prefer to delete approximately the
right number of bases:

seed(42)
for base in bases:
    if random() >= .1:
        yield base

To get exactly the right number we would need to
dynamically adjust that .1 according to whether the output string
is looking too long or too short ATM.
In practice it would be simpler to proceed in chunks,
finishing up with the exact shuffle approach on some
conveniently small number of bases at the end.

回复收藏 0 原文

~没有更多了~