如何连接由 Bio.SeqIO.index 创建的两个或多个字典?
我希望能够连接存储在“indata”和“pairdata”中的两个“字典”,但是这段代码
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
会产生以下错误:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
我尝试过使用,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
它确实有效,但生成的字典占用了太多内存对于我拥有的 infile 和pairfile 的大小来说是实用的。
我探索的最后一个选项是:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
它工作完美,但速度非常慢。有谁知道我如何/是否可以成功连接上面第一个示例中的两个索引?
I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
SeqIO.index
返回一个类似字典的只读对象,因此update
将无法对其进行操作(对令人困惑的错误消息表示歉意;我刚刚检查了该问题的修复程序)到主 Biopython 存储库)。最好的方法是使用index_db,这会慢一些,但是
只需要对文件进行一次索引,或者定义更高级别的对象
它就像多个文件的字典。这是一个
简单的例子:
SeqIO.index
returns a read-only dictionary-like object, soupdate
will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
如果您不打算再次使用索引并且内存不是限制(在您的情况下这两者似乎都是正确的),您可以告诉 Bio.SeqIO.index_db(...) 使用内存中的 SQLite3 索引具有特殊索引名称“:memory:”,如下所示:
其中infile和pairfile是文件名,infmt是它们在Bio.SeqIO中定义的格式类型(例如“fasta”)。
这实际上是 Python 的 SQLite3 库的通用技巧。对于一小部分文件,这应该比在磁盘上构建 SQLite 索引快得多。
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.