改进迭代文本解析的 clojure lazy-seq 使用

发布于 2024-09-11 01:02:51 字数 2259 浏览 5 评论 0原文

我正在编写此编码的 Clojure 实现挑战，尝试查找 Fasta 格式的序列记录的平均长度：

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

有关更多背景信息，请参阅此有关 Erlang 解决方案的相关 StackOverflow 帖子。

我的 Clojure 初学者尝试使用lazy-seq 尝试一次读入文件中的一条记录，以便它将扩展到大文件。然而，它相当消耗内存并且速度很慢，所以我怀疑它没有得到最佳实现。这是使用 BioJava 库来抽象记录解析的解决方案：

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

以及无需外部库的等效方法

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

：当前的实现在大文件上需要 44 秒，而 Python 实现则需要 7 秒。您能提供一些关于加快代码速度并使其更直观的建议吗？使用lazy-seq是否可以按预期正确地逐条记录地解析文件？

原文

I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

For more background see this related StackOverflow post about an Erlang solution.

My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

and an equivalent approach without external libraries:

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

安静被遗忘 2024-09-18 01:02:52

这可能并不重要，但平均值占据了长度序列的头部。
以下是一种完全未经测试但更懒惰的方式来做我认为你想要的事情。

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))

It probably doesn't matter, but average is holding onto the head of the seq of lengths.
The following is a wholly untested, but lazier way to do what I think you want.

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))

回复收藏 0 原文

呢古 2024-09-18 01:02:52

您的 average 函数是非惰性的——它需要在保持其头部的同时实现整个 coll 参数。 更新：刚刚意识到我原来的答案包含了一个关于如何解决上述问题的荒谬建议......啊。幸运的是 ataggart 此后发布了正确的解决方案。

除此之外，您的代码乍一看确实很懒，尽管目前不鼓励使用 read-lines （使用 line-seq 代替）。

如果文件非常大并且您的函数将被调用很多次，请在 seq-length 的参数向量中键入提示 seq-iter -- ^NameOfBiojavaSeqIterClass seq-iter，如果您使用的是 Clojure 1.1，请使用 #^ 代替 ^ ——可能会产生显着差异。事实上，(set! *warn-on-reflection* true)，然后编译代码并添加类型提示以删除所有反射警告。