我有以下格式:
SOLEXA3_1:3:5:1473:616/1 gi|7367913151|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1473:616/1 gi|73921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1474:616/1 gi|32140171|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-2 46.1
SOLEXA3_1:3:5:1474:616/1 gi|7354921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1475:616/1 gi|73921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1475:616/1 gi|73921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
基本上它是一个制表符分隔的文件,我将多次点击输入(第一个字段:SOLEXA3_1:3:5:1474:616/1
作为示例)和特定输入的多次点击:
用于上述示例输入的 32140171
和 7354921565
)。我想要做的是构建某种特定读取的所有命中的内存中表示以及与每次命中相关的质量 - 这是倒数第二个字段 - 1e-5
和 1e -2
表示上述 2 次点击。所以我所做的如下:
我有一个 Map>>
。其中每个字符串基本上都是输入 ID,ArrayList 由来自 Trove 库的映射组成,该库包含双字符串对 - 该字符串是命中的 ID 和分数。我的输入文件大约有 1800 万行,并且有一堆 -Xmx12g 我的内存不足。有什么想法可以优化内存使用吗?请记住,实际分数会有所不同,因此我认为分享它们是不可行的。
I have the following format:
SOLEXA3_1:3:5:1473:616/1 gi|7367913151|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1473:616/1 gi|73921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1474:616/1 gi|32140171|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-2 46.1
SOLEXA3_1:3:5:1474:616/1 gi|7354921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1475:616/1 gi|73921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
SOLEXA3_1:3:5:1475:616/1 gi|73921565|ref|NC_007367.1| 100.00 23 0 0 27 49 3404561 3404539 1e-5 46.1
Basically it is a tab-delimited file and I will have multiple hits for inputs (the first field: SOLEXA3_1:3:5:1474:616/1
as an example) and multiple hits for a particular input:
32140171
and 7354921565
for the aforementioned example input). What I want to do is build some sort of in-memory representation of all hits for a particular read and the quality associated with every hits - it is the penultimate field - 1e-5
and 1e-2
for the aforementioned 2 hits. So what I have done is the following:
I have a Map<String, ArrayList<TObjectDoubleMap<String>>>
. Where every String is basically the input ID and the ArrayList consists of a map from the Trove library which holds a pair of String, double - the string being the Id of the hit and the score. My input file is around 18 millions of lines and with a heap of -Xmx12g I'm getting heap out of memory. Any ideas how I can optimize the memory usage? Bear in mind that the actual scores would vary so I don't think sharing them is feasible.
发布评论
评论(3)
我会使用:
其中 Key 只是 2 个字段的串联,并且您将质量和分数写入 ByteArrayOutputStream。
生成的数据结构将类似于:
然后,在读取质量和分数时,您只需使用 readByte() 和 readDouble() 直到到达流的末尾。
当然,这样做会使查询内容变得有点棘手,但是您将节省大量的内存分配。
例如:
使用这种方法,我可以在 <1GB 的内存中读取和存储约 2000 万条记录。 (在 MacBook Pro 上大约需要 10 秒)。
I would use:
Where the Key is simply a concatenation of the 2 fields, and you write the quality and score's to the ByteArrayOutputStream.
The resulting data structure would look something like:
Then when reading the qualities and scores you just use readByte() and readDouble() until you get to the end of the stream.
Of course, doing it this way makes querying stuff a little trickier, but you will save huge on memory allocation.
Ex:
Using this method I can read and store ~20mil records in <1GB of memory. (In around 10 seconds on a MacBook Pro).
我认为您使用列表地图的方法基本上很好,但可以设计得更紧凑。
首先,确保您正在规范化读取名称。也就是说,内存中应该只有一个包含字符“SOLEXA3_1:3:5:1473:616/1”的字符串实例;在使用名称之前,使用映射将名称简化为规范实例。
其次,命中标识符总是整数吗?如果是这样,则将它们存储为长整型,因为有些显然太大而无法放入整型中。
第三,我认为您可以将命中及其分数存储在一个非常紧凑的结构中,只要您准备好做一些工作,通过手动将两者打包成一个长(!)。然后,您可以为每个输入存储一个排序的长整型数组。
这是我规范化读取名称的方法:
我可以想到至少一种更聪明的方法来做到这一点,但目前就这样。
以下是我处理命中评分的方式:
14 是因为 14 个二进制位足够大,可以容纳高达 16384 的值,这足以包含 0 到 10000 的范围。请注意,您只能获得 50 位的存储空间。 ID,因此值得检查是否没有 ID 大于 1125899906842623。
由于您使用的是 Trove,因此您可以将打包的长整型存储在一个 TLongArrayList。通过使用 binarySearch 为列表中的每个长连接找到适当的位置,然后插入将其放置在那里,从而保持列表排序。要在列表中查找值,请再次使用二元搜索。
I think your approach of using a map of lists is basically good, but could be engineered to be more compact.
Firstly, make sure you are canonicalising the read names. That is, there should only be one instance of a string with the characters "SOLEXA3_1:3:5:1473:616/1" in memory; use a map to reduce the names to a canonical instance before using them.
Secondly, are the hit identifiers always integers? If so, store them as such (as longs, since some are evidently too big to fit in ints).
Thirdly, i think you can store the hits and their scores in a very compact structure, as long as you're prepared to do some work, by manually packing both into a long (!). Then, you can just store a sorted array of longs for each input.
Here's how i'd canonicalise the read names:
I can think of at least one smarter way to do this, but this will do for now.
Here's how i'd handle the hit scoring:
The 14 there is because 14 binary bits are big enough to hold values up to 16384, which is enough to contain the range 0 to 10000. Note that you only get 50 bits of storage for the ID, so it would be worth checking that no ID was bigger than 1125899906842623.
Since you're using Trove, you can store the packed longs in a TLongArrayList. Keep the list sorted by using binarySearch to find the appropriate place for each long joining the list, and insert to put it there. To look up a value in the list, use a binarySearch again.
有多种选项可用,但我会使用嵌入式数据库(但前提是“传统”数据库不存在问题) - H2。除非您对结果数据进行大量计算,否则这是一个安全的选择。
只是其他选项的快速列表:
您甚至可以组合使用它们。
There are several options available for this but I would use an embedded database (but only if a "traditional" database is out of question) - H2, for instance. Unless you're doing so very heavy computations on the resulting data, that would be a safe bet.
Just a quick list of other options:
You may even use a combination of them all.