比较 Clojure 中的两个大文件（即，在顶帽比对中查找未映射的读取）

发布于 2024-11-30 04:04:17 字数 1848 浏览 7 评论 0原文

问题：查找一个文件中存在但另一文件中不存在的 id。每个文件约 6.5 GB。具体来说（对于生物信息学领域的人），一个文件是测序读数的 fastq 文件，另一个是来自 tophat 运行的 sam 比对文件。我想确定 fastq 文件中的哪些读取不在 sam 对齐文件中。

我收到 java.lang.OutOfMemory: Java heap space 错误。按照建议（ref1，< a href="https://stackoverflow.com/questions/3538834/how-to-process-large-binary-data-in-clojure">ref2）我正在使用惰性序列。但是，我的内存仍然不足。我看过本教程，但我还不太明白。因此，我发布了我对解决方案的不太复杂的尝试，希望我只是犯了一个小错误。

我的尝试：

由于两个文件都无法装入内存，因此一次读取 sam 文件中的行，并将该块中每行的 id 放入一个集合中。然后使用集合中的 sam id 过滤 fastq id 的惰性列表，仅保留那些不在集合中的 id。对下一个 sam 行块和剩余的 fastq id 重复此操作。

(defn ids-not-in-sam 
  [ids samlines chunk-size]
  (lazy-seq
    (if (seq samlines)
      (ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
                      (drop chunk-size samlines) chunk-size)
      ids)))

not-in 确定哪些 id 不在集合中。

(defn not-in 
  ; Return the elements x of xs which are not in the set s
  [s xs]
  (filter (complement s) xs))

qnames 从 sam 文件中的一行获取 id 字段。

(defn qnames [samlines]
  (map #(first (.split #"\t" %)) samlines))

最后，它与 io 结合在一起（使用来自 clojure.contrib.io 的 read-lines 和 write-lines。

(defn write-fq-not-in-sam [fqfile samfile fout chunk-size] 
    (io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
                                         (read-sam samfile) chunk-size)))

我很确定我是但我可能在我没有注意到的地方保留了序列的头部，

导致堆填满。更重要的是，我的方法是？问题全错了吗？这是否适合惰性序列，我期待太多了吗？

（错误可能出现在 read-sam 和 read-fastq 函数中，但我的帖子已经有点长了。我可以稍后展示它们如果需要的话）。

原文

Problem: find ids that are in one file but not in another. Each file is about 6.5 GB. Specifically (for those in the bioinformatics domain), one file is a fastq file of sequencing reads and the other is a sam alignment file from a tophat run. I would like to determine which reads in the fastq file are not in the sam alignment file.

I am getting java.lang.OutOfMemory: Java heap space errors. As suggested (ref1, ref2) I am using lazy sequences. However, I am still running out of memory. I have looked at this tutorial, but I don't quite understand it yet. So I am posting my less sophisticated attempt at a solution with the hope that I am only making a minor mistake.

My attempt:

Since neither file will fit into memory, the lines from the sam file are read a chunk at a time and the ids of each line in the chunk are put into a set. A lazy list of fastq ids are then filtered using the sam ids in the set keeping only those ids that are not in the set. This is repeated with the next chunk of sam lines and the remaining fastq ids.

(defn ids-not-in-sam 
  [ids samlines chunk-size]
  (lazy-seq
    (if (seq samlines)
      (ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
                      (drop chunk-size samlines) chunk-size)
      ids)))

not-in determines which ids are not in the set.

(defn not-in 
  ; Return the elements x of xs which are not in the set s
  [s xs]
  (filter (complement s) xs))

qnames gets the id field from a line in the sam file.

(defn qnames [samlines]
  (map #(first (.split #"\t" %)) samlines))

Finally, its put together with io (using read-lines and write-lines from clojure.contrib.io.

(defn write-fq-not-in-sam [fqfile samfile fout chunk-size] 
    (io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
                                         (read-sam samfile) chunk-size)))

I am pretty sure I am doing every thing in a lazy manner. But I may be holding onto the head of a sequence somewhere that I do not notice.

Is there an error in the code above that is causing the heap to fill up? More importantly, is my approach to the problem all wrong? Is this an appropriate use for lazy sequences, am I expecting too much?

(The errors could be in the read-sam and read-fastq functions, but my post is already a bit long. I can show those later if need be).

分享到QQ

分享到微博