改进迭代文本解析的 clojure lazy-seq 使用
我正在编写 此编码的 Clojure 实现挑战,尝试查找 Fasta 格式的序列记录的平均长度:
>1
GATCGA
GTC
>2
GCA
>3
AAAAA
有关更多背景信息,请参阅此 有关 Erlang 解决方案的相关 StackOverflow 帖子。
我的 Clojure 初学者尝试使用lazy-seq 尝试一次读入文件中的一条记录,以便它将扩展到大文件。然而,它相当消耗内存并且速度很慢,所以我怀疑它没有得到最佳实现。这是使用 BioJava 库来抽象记录解析的解决方案:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])
(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))
(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))
以及无需外部库的等效方法
(use '[clojure.contrib.duck-streams :only (read-lines)])
(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))
(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))
:当前的实现在大文件上需要 44 秒,而 Python 实现则需要 7 秒。您能提供一些关于加快代码速度并使其更直观的建议吗?使用lazy-seq是否可以按预期正确地逐条记录地解析文件?
I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:
>1
GATCGA
GTC
>2
GCA
>3
AAAAA
For more background see this related StackOverflow post about an Erlang solution.
My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])
(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))
(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))
and an equivalent approach without external libraries:
(use '[clojure.contrib.duck-streams :only (read-lines)])
(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))
(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))
The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可能并不重要,但
平均值
占据了长度序列的头部。以下是一种完全未经测试但更懒惰的方式来做我认为你想要的事情。
It probably doesn't matter, but
average
is holding onto the head of the seq of lengths.The following is a wholly untested, but lazier way to do what I think you want.
您的
average
函数是非惰性的——它需要在保持其头部的同时实现整个coll
参数。 更新:刚刚意识到我原来的答案包含了一个关于如何解决上述问题的荒谬建议......啊。幸运的是 ataggart 此后发布了正确的解决方案。除此之外,您的代码乍一看确实很懒,尽管目前不鼓励使用
read-lines
(使用line-seq
代替)。如果文件非常大并且您的函数将被调用很多次,请在
seq-length
的参数向量中键入提示seq-iter
--^NameOfBiojavaSeqIterClass seq-iter
,如果您使用的是 Clojure 1.1,请使用#^
代替^
——可能会产生显着差异。事实上,(set! *warn-on-reflection* true)
,然后编译代码并添加类型提示以删除所有反射警告。Your
average
function is non-lazy -- it needs to realise the entirecoll
argument while holding onto its head. Update: Just realised that my original answer included a nonsensical suggestion as to how to solve the above problem... argh. Fortunately ataggart has since posted a correct solution.Other than that, your code does seem lazy at first glance, though the use of
read-lines
is currently discouraged (useline-seq
instead).If the file is really large and your functions will be called a large number of times, type-hinting
seq-iter
in the argument vector ofseq-length
--^NameOfBiojavaSeqIterClass seq-iter
, use#^
in place of^
if you're on Clojure 1.1 -- might make a significant difference. In fact,(set! *warn-on-reflection* true)
, then compile your code and add type hints to remove all reflection warnings.