磁盘上子字符串索引
我有一个想要索引的文件(具体来说是 fasta 文件),这样我就可以快速找到文件中的任何子字符串,然后找到原始 fasta 文件中的位置。
在许多情况下,使用 Trie 或子字符串数组很容易做到这一点,不幸的是,我需要索引的字符串是 800+ MB,这意味着在内存中执行它们是不可接受的,所以我正在寻找一种合理的方法来创建这个磁盘上的索引,占用内存最少。
(编辑以澄清)
我只对蛋白质的标题感兴趣,因此对于我感兴趣的最大数据库,这大约是 800 MB 的文本。
我希望能够根据输入字符串在 O(N) 时间内找到精确的子字符串。 这必须可以在 32 位机器上使用,因为它将被发送给随机的人,而这些人预计不会拥有 64 位机器。
我希望能够针对行内的任何断字进行索引,直至行尾(尽管行可能有几 MB 长)。
希望这能够澄清需要什么以及为什么当前给出的解决方案没有启发性。
我还应该补充一点,这需要在java内部完成,并且必须在各种操作系统上的客户端计算机上完成,所以我不能使用任何特定于操作系统的解决方案,并且它必须是一个编程解决方案。
I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.
This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.
(edit for clarification)
I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.
I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.
I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).
Hopefully this clarifies what is needed and why the current solutions given are not illuminating.
I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在某些语言中,程序员可以访问“直接字节数组”或“内存映射",由操作系统提供。 在java中,我们有 java.nio .MappedByteBuffer。 这允许人们像处理内存中的字节数组一样处理数据,而实际上它位于磁盘上。 可以使用的文件大小仅受操作系统虚拟内存功能的限制,对于 32 位计算机来说通常约为 <4GB。 64 位? 理论上是 16 艾字节(172 亿 GB),但我认为现代 CPU 仅限于 40 位(1 TB)或 48 位(128 TB)地址空间。
这将使您可以轻松地处理一个大文件。
In some languages programmers have access to "direct byte arrays" or "memory maps", which are provided by the OS. In java we have java.nio.MappedByteBuffer. This allows one to work with the data as if it were a byte array in memory, when in fact it is on the disk. The size of the file one can work with is only limited by the OS's virtual memory capabilities, and is typically ~<4GB for 32-bit computers. 64-bit? In theory 16 exabytes (17.2 billion GBs), but I think modern CPUs are limited to a 40-bit (1TB) or 48-bit (128TB) address space.
This would let you easily work with the one big file.
FASTA 文件格式非常稀疏。 我要做的第一件事是生成一个紧凑的二进制格式,并索引该 - 它应该是当前文件大小的 20-30%,并且编码/解码数据的过程应该是足够快(即使是 4GB),这不会成为问题。
此时,您的文件应该适合内存,即使在 32 位计算机上也是如此。 让操作系统对其进行分页,或者如果您想确定它全部在内存中,则制作一个虚拟磁盘。
请记住,内存每 GB 仅为 30 美元左右(并且越来越便宜),因此如果您有 64 位操作系统,那么您甚至可以处理内存中的完整文件,而无需将其编码为更紧凑的格式。
祝你好运!
-亚当
The FASTA file format is very sparse. The first thing I would do is generate a compact binary format, and index that - it should be maybe 20-30% the size of your current file, and the process for coding/decoding the data should be fast enough (even with 4GB) that it won't be an issue.
At that point, your file should fit within memory, even on a 32 bit machine. Let the OS page it, or make a ramdisk if you want to be certain it's all in memory.
Keep in mind that memory is only around $30 a GB (and getting cheaper) so if you have a 64 bit OS then you can even deal with the complete file in memory without encoding it into a more compact format.
Good luck!
-Adam
我和一些同事交谈过,他们只是在需要时使用 VIM/Grep 进行搜索。 大多数时候我不希望有人搜索这样的子字符串。
但我不明白为什么 MS 桌面搜索或聚光灯或谷歌的同等产品不能在这里帮助你。
我的建议是按基因或物种分割文件,希望输入序列不会交错。
I talked to a few co-workers and they just use VIM/Grep to search when they need to. Most of the time I wouldn't expect someone to search for a substring like this though.
But I don't see why MS Desktop search or spotlight or google's equivalent can't help you here.
My recommendation is splitting the file up --by gene or species, hopefully the input sequences aren't interleaved.
我不认为原始海报仍然存在这个问题,但任何需要 FASTA 文件索引和子序列提取的人都应该查看 fastahack: http://github.com/ekg/fastahack
它使用索引文件来计算换行符和序列起始偏移量。 索引生成后就可以快速提取子序列; 提取由 fseek64 驱动。
如果您的序列与海报的序列一样长,那么它会非常非常有效。 但是,如果您的 FASTA 文件中有数千或数百万个序列(如短读长测序或某些de novo组件的输出),您将需要使用另一种解决方案,例如作为磁盘支持的键值存储。
I don't imagine that the original poster still has this problem, but anyone needing FASTA file indexing and subsequence extraction should check out fastahack: http://github.com/ekg/fastahack
It uses an index file to count newlines and sequence start offsets. Once the index is generated you can rapidly extract subsequences; the extraction is driven by fseek64.
It will work very, very well in the case that your sequences are as long as the poster's. However, if you have many thousands or millions of sequences in your FASTA file (as is the case with the outputs from short-read sequencing or some de novo assemblies) you will want to use another solution, such as a disk-backed key-value store.