“fasta 文件中序列的平均长度”:你能改进这个 Erlang 代码吗?
我正在尝试使用 Erlangfasta 序列 的平均长度>。 fasta 文件看起来像这样,
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
我尝试使用以下 Erlang 代码回答这个问题:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
编译/执行:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
此代码对于小型 fasta 文件似乎工作正常,但它需要解析较大的一个(>100Mo)。为什么 ?我是 Erlang 新手,你能改进这段代码吗?
I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
I tried to answser this question using the following Erlang code:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果你需要非常快的 IO,那么你必须比平常做更多的技巧。
据我所知,这是最快的 IO,但请注意
-noshell -noinput
。像
erlc +native +"{hipe, [o3]}" g.erl
一样编译,但使用-smp禁用
并运行:
使用
-smp启用 但本机需要:
字节码,但使用
-smp 禁用
(几乎与本机相同,因为大部分工作都是在端口中完成的!):只是为了完整性,使用 smp 的字节码:
用于比较 sarnold 版本给了我错误的答案,并且在相同的硬件上需要更多:
编辑:我查看了
uniprot_sprot.fasta
的特征,我有点惊讶。它有 3824397 行和 232MB。这意味着 -smp 禁用版本每秒可以处理 118 万行文本(面向行的 IO 为 71MB/s)。If you need really fast IO then you have to do little bit more trickery than usual.
It is fastest IO as I know but note
-noshell -noinput
.Compile just like
erlc +native +"{hipe, [o3]}" g.erl
but with-smp disable
and run:
With
-smp enable
but native it takes:Byte code but with
-smp disable
(almost in par with native because most of work is done in port!):Just for completeness byte code with smp:
For comparison sarnold version gives me wrong answer and takes more on same HW:
EDIT: I have looked at characteristics of
uniprot_sprot.fasta
and I'm little bit surprised. It is 3824397 rows and 232MB. It means that-smp disabled
version can handle 1.18 million text lines per second (71MB/s in line oriented IO).我也在学习 Erlang,谢谢你提出这个有趣的问题。
我知道使用 Erlang 字符串作为字符列表可能会非常慢;如果您可以使用二进制文件,您应该会看到一些性能提升。我不知道如何将任意长度的字符串与二进制文件一起使用,但如果你能解决它,它应该会有所帮助。
另外,如果您不介意直接使用文件而不是
standard_io
,也许您可以使用file:open(..., [raw, read_ahead])< /代码>。
raw
表示文件必须位于本地节点的文件系统上,而read_ahead
指定 Erlang 应该使用缓冲区执行文件 IO。 (想想使用带或不带缓冲的 C 的 stdio 设施。)我希望
read_ahead
能够发挥最大的作用,但是 Erlang 的所有内容都包含短语“猜测之前的基准测试”。编辑
使用
file:open("uniprot_sprot.fasta", [read, read_ahead])
在完整的uniprot_sprot.fasta数据集上获取1m31s
。 (平均 359.04679841439776。)使用
file:open(.., [read, read_ahead])
和file:read_line(S)
,我得到0m34s
。使用
file:open(.., [read, read_ahead, raw])
和file:read_line(S)
,我得到0m9s
。是的,九秒。这就是我现在站的地方;如果你能弄清楚如何使用二进制文件而不是列表,它可能会看到更多的改进:
I too am learning Erlang, thanks for the fun question.
I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.
Also, if you don't mind working with a file directly rather than
standard_io
, perhaps you could speed things along by usingfile:open(..., [raw, read_ahead])
.raw
means the file must be on the local node's filesystem, andread_ahead
specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)I'd expect the
read_ahead
to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".EDIT
Using
file:open("uniprot_sprot.fasta", [read, read_ahead])
gets1m31s
on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)Using
file:open(.., [read, read_ahead])
andfile:read_line(S)
, I get0m34s
.Using
file:open(.., [read, read_ahead, raw])
andfile:read_line(S)
, I get0m9s
. Yes, nine seconds.Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:
看起来您的大性能问题已经通过以原始模式打开文件得到解决,但如果您需要进一步优化该代码,这里还有一些更多的想法。
学习并使用 fprof。
您使用
string:strip/1
主要是为了删除尾随换行符。由于 erlang 值是不可变的,您必须制作列表的完整副本(包含所有关联的内存分配)才能删除最后一个字符。如果您知道文件格式正确,只需从计数中减一,否则我会尝试编写一个长度函数来计算相关字符的数量并忽略不相关的字符。我对二进制文件比列表更好的建议持谨慎态度,但考虑到您的处理量很少,这里的情况可能就是这样。第一步是以二进制模式打开文件并使用 erlang:size/1 查找长度。
它不会(显着)影响性能,但仅在除法不正确的语言中才需要在
Total/(1.0*Sequences)
中乘以 1.0。 Erlang 除法工作正常。It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.
Learn and use fprof.
You're using
string:strip/1
primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character. If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here. The first steps are to open the file in binary mode and using
erlang:size/1
to find the length.It won't affect performance (significantly), but the multiplication by 1.0 in
Total/(1.0*Sequences)
is only necessary in languages with broken division. Erlang division works correctly.调用 string:len(string:strip(L)) 至少遍历列表两次(我不知道 string:strip 实现)。相反,您可以编写一个简单的函数来计算带 0 空格的行长度:
相同的方法也可以应用于二进制文件。
The call
string:len(string:strip(L))
traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:The same method can be applied to binaries as well.
您是否尝试过 Elixir (elixir-lang.org),它运行在 Erlang 之上,并且具有类似于 Ruby 的语法。 Elixir 通过以下方式解决 String 问题:
只是想知道 Elixir 是否会更快?
Did you try Elixir (elixir-lang.org) which is runs on top of Erlang and has a syntax similar to Ruby. Elixir solves String problems in the following way:
Just wonder whether Elixir would be faster?