“fasta 文件中序列的平均长度”：你能改进这个 Erlang 代码吗？

发布于 2024-09-10 17:12:50 字数 1314 浏览 18 评论 0原文

我正在尝试使用 Erlangfasta 序列的平均长度>。 fasta 文件看起来像这样，

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)

我尝试使用以下 Erlang 代码回答这个问题：

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case io:get_line(S,'') of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    {Sequences,Total}=scanLines(standard_io,0,0),
    io:format("~p\n",[Total/(1.0*Sequences)]),
    halt().

编译/执行：

erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16

此代码对于小型 fasta 文件似乎工作正常，但它需要解析较大的一个（>100Mo）。为什么？我是 Erlang 新手，你能改进这段代码吗？

原文

I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)

I tried to answser this question using the following Erlang code:

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case io:get_line(S,'') of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    {Sequences,Total}=scanLines(standard_io,0,0),
    io:format("~p\n",[Total/(1.0*Sequences)]),
    halt().

Compilation/Execution:

erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16

this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

離人涙 2024-09-17 17:12:50

如果你需要非常快的 IO，那么你必须比平常做更多的技巧。

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<gt;:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.

据我所知，这是最快的 IO，但请注意 -noshell -noinput。
像erlc +native +"{hipe, [o3]}" g.erl一样编译，但使用-smp禁用

erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

并运行：

time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s

使用-smp启用但本机需要：

$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s

字节码，但使用 -smp 禁用（几乎与本机相同，因为大部分工作都是在端口中完成的！）：

$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s

只是为了完整性，使用 smp 的字节码：

$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s

用于比较 sarnold 版本给了我错误的答案，并且在相同的硬件上需要更多：

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s

编辑：我查看了 uniprot_sprot.fasta 的特征，我有点惊讶。它有 3824397 行和 232MB。这意味着 -smp 禁用版本每秒可以处理 118 万行文本（面向行的 IO 为 71MB/s）。

If you need really fast IO then you have to do little bit more trickery than usual.

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<gt;:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.

It is fastest IO as I know but note -noshell -noinput.
Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable

erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

and run:

time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s

With -smp enable but native it takes:

$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s

Byte code but with -smp disable (almost in par with native because most of work is done in port!):

$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s

Just for completeness byte code with smp:

$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s

For comparison sarnold version gives me wrong answer and takes more on same HW:

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s

EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).

回复收藏 0 原文

动次打次papapa 2024-09-17 17:12:50

我也在学习 Erlang，谢谢你提出这个有趣的问题。

我知道使用 Erlang 字符串作为字符列表可能会非常慢；如果您可以使用二进制文件，您应该会看到一些性能提升。我不知道如何将任意长度的字符串与二进制文件一起使用，但如果你能解决它，它应该会有所帮助。

另外，如果您不介意直接使用文件而不是 standard_io，也许您可以使用 file:open(..., [raw, read_ahead])< /代码>。 raw 表示文件必须位于本地节点的文件系统上，而 read_ahead 指定 Erlang 应该使用缓冲区执行文件 IO。（想想使用带或不带缓冲的 C 的 stdio 设施。）

我希望 read_ahead 能够发挥最大的作用，但是 Erlang 的所有内容都包含短语“猜测之前的基准测试”。

编辑

使用file:open("uniprot_sprot.fasta", [read, read_ahead])在完整的uniprot_sprot.fasta数据集上获取1m31s。（平均 359.04679841439776。）

使用 file:open(.., [read, read_ahead]) 和 file:read_line(S)，我得到 0m34s。

使用 file:open(.., [read, read_ahead, raw]) 和 file:read_line(S)，我得到 0m9s。是的，九秒。

这就是我现在站的地方；如果你能弄清楚如何使用二进制文件而不是列表，它可能会看到更多的改进：

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case file:read_line(S) of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
    case F of
    { ok, File } -> 
        {Sequences,Total}=scanLines(File,0,0),
        io:format("~p\n",[Total/(1.0*Sequences)]);
    { error, Reason } ->
            io:format("~s", Reason)
    end,
    halt().

I too am learning Erlang, thanks for the fun question.

I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.

Also, if you don't mind working with a file directly rather than standard_io, perhaps you could speed things along by using file:open(..., [raw, read_ahead]). raw means the file must be on the local node's filesystem, and read_ahead specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)

I'd expect the read_ahead to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".

EDIT

Using file:open("uniprot_sprot.fasta", [read, read_ahead]) gets 1m31s on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)

Using file:open(.., [read, read_ahead]) and file:read_line(S), I get 0m34s.

Using file:open(.., [read, read_ahead, raw]) and file:read_line(S), I get 0m9s. Yes, nine seconds.

Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case file:read_line(S) of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
    case F of
    { ok, File } -> 
        {Sequences,Total}=scanLines(File,0,0),
        io:format("~p\n",[Total/(1.0*Sequences)]);
    { error, Reason } ->
            io:format("~s", Reason)
    end,
    halt().

回复收藏 0 原文

洒一地阳光 2024-09-17 17:12:50

看起来您的大性能问题已经通过以原始模式打开文件得到解决，但如果您需要进一步优化该代码，这里还有一些更多的想法。

学习并使用 fprof。

您使用 string:strip/1 主要是为了删除尾随换行符。由于 erlang 值是不可变的，您必须制作列表的完整副本（包含所有关联的内存分配）才能删除最后一个字符。如果您知道文件格式正确，只需从计数中减一，否则我会尝试编写一个长度函数来计算相关字符的数量并忽略不相关的字符。

我对二进制文件比列表更好的建议持谨慎态度，但考虑到您的处理量很少，这里的情况可能就是这样。第一步是以二进制模式打开文件并使用 erlang:size/1 查找长度。

它不会（显着）影响性能，但仅在除法不正确的语言中才需要在 Total/(1.0*Sequences) 中乘以 1.0。 Erlang 除法工作正常。

回复收藏 0 原文

嘦怹 2024-09-17 17:12:50

调用 string:len(string:strip(L)) 至少遍历列表两次（我不知道 string:strip 实现）。相反，您可以编写一个简单的函数来计算带 0 空格的行长度：

stripped_len(L) ->
  stripped_len(L, 0).

stripped_len([$ |L], Len) ->
  stripped_len(L, Len);

stripped_len([_C|L], Len) ->
  stripped_len(L, Len + 1);

stripped_len([], Len) ->
  Len.

相同的方法也可以应用于二进制文件。

The call string:len(string:strip(L)) traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:

stripped_len(L) ->
  stripped_len(L, 0).

stripped_len([$ |L], Len) ->
  stripped_len(L, Len);

stripped_len([_C|L], Len) ->
  stripped_len(L, Len + 1);

stripped_len([], Len) ->
  Len.

The same method can be applied to binaries as well.

回复收藏 0 原文

你怎么这么可爱啊 2024-09-17 17:12:50

您是否尝试过 Elixir (elixir-lang.org)，它运行在 Erlang 之上，并且具有类似于 Ruby 的语法。 Elixir 通过以下方式解决 String 问题：

Elixir 字符串是 UTF8 二进制文件，具有所有原始速度和内存
带来的节省。 Elixir 有一个带有 Unicode 的 String 模块
内置功能，是编写代码的一个很好的例子
写代码。 String.Unicode 读取各种 Unicode 数据库转储，例如
作为 UnicodeData.txt 动态生成 Unicode 函数
直接从该数据构建的字符串模块！ (http://devintorr.es/blog/2013/01 /22/灵丹妙药的兴奋/)

只是想知道 Elixir 是否会更快？

回复收藏 0 原文

~没有更多了~