如何连接由 Bio.SeqIO.index 创建的两个或多个字典?

发布于 2024-12-12 05:17:25 字数 694 浏览 4 评论 0原文

我希望能够连接存储在“indata”和“pairdata”中的两个“字典”,但是这段代码

indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)

会产生以下错误:

indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)

我尝试过使用,

indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)

它确实有效,但生成的字典占用了太多内存对于我拥有的 infile 和pairfile 的大小来说是实用的。

我探索的最后一个选项是:

indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)

它工作完美,但速度非常慢。有谁知道我如何/是否可以成功连接上面第一个示例中的两个索引?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,

indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)

produces the following error:

indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)

I have tried using,

indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)

which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.

The final option I have explored is:

indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)

which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

忘东忘西忘不掉你 2024-12-19 05:17:25

SeqIO.index 返回一个类似字典的只读对象,因此 update 将无法对其进行操作(对令人困惑的错误消息表示歉意;我刚刚检查了该问题的修复程序)到主 Biopython 存储库)。

最好的方法是使用index_db,这会慢一些,但是
只需要对文件进行一次索引,或者定义更高级别的对象
它就像多个文件的字典。这是一个
简单的例子:

from Bio import SeqIO

class MultiIndexDict:
    def __init__(self, *indexes):
        self._indexes = indexes
    def __getitem__(self, key):
        for idx in self._indexes:
            try:
                return idx[key]
            except KeyError:
                pass
        raise KeyError("{0} not found".format(key))

indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)

print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]

SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).

The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:

from Bio import SeqIO

class MultiIndexDict:
    def __init__(self, *indexes):
        self._indexes = indexes
    def __getitem__(self, key):
        for idx in self._indexes:
            try:
                return idx[key]
            except KeyError:
                pass
        raise KeyError("{0} not found".format(key))

indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)

print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
始终不够 2024-12-19 05:17:25

如果您不打算再次使用索引并且内存不是限制(在您的情况下这两者似乎都是正确的),您可以告诉 Bio.SeqIO.index_db(...) 使用内存中的 SQLite3 索引具有特殊索引名称“:memory:”,如下所示:

indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)

其中infile和pairfile是文件名,infmt是它们在Bio.SeqIO中定义的格式类型(例如“fasta”)。

这实际上是 Python 的 SQLite3 库的通用技巧。对于一小部分文件,这应该比在磁盘上构建 SQLite 索引快得多。

In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:

indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)

where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").

This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文