Erlang 中 io:fread 的意外行为

发布于 2024-07-12 12:39:07 字数 1942 浏览 5 评论 0原文

这是一个 Erlang 问题。

我遇到了 io:fread 的一些意外行为。

我想知道是否有人可以检查我使用 io:fread 的方式是否有问题或者 io:fread 是否存在错误。

我有一个文本文件，其中包含一个“数字三角形”，如下所示：

59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...

每对数字之间有一个空格，每行以回车换行符对结尾。

我使用以下 Erlang 程序将该文件读入列表中。

-module(euler67).
-author('Cayle Spandon').

-export([solve/0]).

solve() ->
    {ok, File} = file:open("triangle.txt", [read]),
    Data = read_file(File),
    ok = file:close(File),
    Data.

read_file(File) ->
    read_file(File, []).

read_file(File, Data) ->
    case io:fread(File, "", "~d") of
        {ok, [N]} -> 
            read_file(File, [N | Data]);
        eof ->
            lists:reverse(Data)
    end.

该程序的输出是：

([email protected])30> euler67:solve().
[59,73,41,52,40,9,26,53,6,3410,51,87,86,8161,95,66,57,25,
 6890,81,80,38,92,67,7330,28,51,76,81|...]

请注意第四行的最后一个数字 (34) 和第五行的第一个数字 (10) 是如何合并为单个数字 3410 的。

当我使用“od”转储文本文件时这些行没有什么特别的；它们以 cr-nl 结尾，就像任何其他行一样：

> od -t a triangle.txt
0000000    5   9  cr  nl   7   3  sp   4   1  cr  nl   5   2  sp   4   0
0000020   sp   0   9  cr  nl   2   6  sp   5   3  sp   0   6  sp   3   4
0000040   cr  nl   1   0  sp   5   1  sp   8   7  sp   8   6  sp   8   1
0000060   cr  nl   6   1  sp   9   5  sp   6   6  sp   5   7  sp   2   5
0000100   sp   6   8  cr  nl   9   0  sp   8   1  sp   8   0  sp   3   8
0000120   sp   9   2  sp   6   7  sp   7   3  cr  nl   3   0  sp   2   8
0000140   sp   5   1  sp   7   6  sp   8   1  sp   1   8  sp   7   5  sp
0000160    4   4  cr  nl   8   4  sp   1   4  sp   9   5  sp   8   7  sp

一个有趣的观察是，出现问题的某些数字恰好位于文本文件中的 16 字节边界上（但不是全部，例如 6890）。

原文

This is an Erlang question.

I have run into some unexpected behavior by io:fread.

I was wondering if someone could check whether there is something wrong with the way I use io:fread or whether there is a bug in io:fread.

I have a text file which contains a "triangle of numbers"as follows:

59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...

There is a single space between each pair of numbers and each line ends with a carriage-return new-line pair.

I use the following Erlang program to read this file into a list.

-module(euler67).
-author('Cayle Spandon').

-export([solve/0]).

solve() ->
    {ok, File} = file:open("triangle.txt", [read]),
    Data = read_file(File),
    ok = file:close(File),
    Data.

read_file(File) ->
    read_file(File, []).

read_file(File, Data) ->
    case io:fread(File, "", "~d") of
        {ok, [N]} -> 
            read_file(File, [N | Data]);
        eof ->
            lists:reverse(Data)
    end.

The output of this program is:

([email protected])30> euler67:solve().
[59,73,41,52,40,9,26,53,6,3410,51,87,86,8161,95,66,57,25,
 6890,81,80,38,92,67,7330,28,51,76,81|...]

Note how the last number of the fourth line (34) and the first number of the fifth line (10) have been merged into a single number 3410.

When I dump the text file using "od" there is nothing special about those lines; they end with cr-nl just like any other line:

> od -t a triangle.txt
0000000    5   9  cr  nl   7   3  sp   4   1  cr  nl   5   2  sp   4   0
0000020   sp   0   9  cr  nl   2   6  sp   5   3  sp   0   6  sp   3   4
0000040   cr  nl   1   0  sp   5   1  sp   8   7  sp   8   6  sp   8   1
0000060   cr  nl   6   1  sp   9   5  sp   6   6  sp   5   7  sp   2   5
0000100   sp   6   8  cr  nl   9   0  sp   8   1  sp   8   0  sp   3   8
0000120   sp   9   2  sp   6   7  sp   7   3  cr  nl   3   0  sp   2   8
0000140   sp   5   1  sp   7   6  sp   8   1  sp   1   8  sp   7   5  sp
0000160    4   4  cr  nl   8   4  sp   1   4  sp   9   5  sp   8   7  sp

One interesting observation is that some of the numbers for which the problem occurs happen to be on 16-byte boundary in the text file (but not all, for example 6890).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青萝楚歌 2024-07-19 12:39:07

我认为它也是 Erlang 中的一个错误，而且是一个奇怪的错误。将格式字符串更改为“~2s”会产生同样奇怪的结果：

["59","73","4","15","2","40","0","92","6","53","0","6","34",
 "10","5","1","87","8","6","81","61","9","5","66","5","7",
 "25","6",
 [...]|...]

因此，它似乎将换行符作为常规字符进行计数以进行计数，但在生成输出时则不然。简直太疯狂了。

经过一周的 Erlang 编程，我已经开始深入研究源代码了。这对我来说可能是一个新记录...

编辑

更多的调查已证实这是一个错误。调用 fread 中使用的内部方法之一：

> io_lib_fread:fread([], "12 13\n14 15 16\n17 18 19 20\n", "~d").           
{done,{ok,"\f"}," 1314 15 16\n17 18 19 20\n"}

基本上，如果要读取多个值，则换行符，第一个换行符会在字符串的“仍待读取”部分中被吃掉。其他测试表明，如果您在前面添加一个空格，那就没问题，如果您用换行符引导字符串，则它会要求更多。

我要深入了解这一点，天哪...（咧嘴笑）没有那么多代码需要完成，而且没有太多专门处理换行符的内容，所以应该不会花太长时间缩小范围并修复它。

编辑^2

哈哈！得到了小坏蛋。

这是您想要的 stdlib 补丁（记住重新编译并将新的 Beam 文件放在旧文件的顶部）：

--- ../erlang/erlang-12.b.3-dfsg/lib/stdlib/src/io_lib_fread.erl
+++ ./io_lib_fread.erl
@@ -35,9 +35,9 @@
     fread_collect(MoreChars, [], Rest, RestFormat, N, Inputs).

 fread_collect([$\r|More], Stack, Rest, RestFormat, N, Inputs) ->
-    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\r|More]);
 fread_collect([$\n|More], Stack, Rest, RestFormat, N, Inputs) ->
-    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\n|More]);
 fread_collect([C|More], Stack, Rest, RestFormat, N, Inputs) ->
     fread_collect(More, [C|Stack], Rest, RestFormat, N, Inputs);
 fread_collect([], Stack, Rest, RestFormat, N, Inputs) ->
@@ -55,8 +55,8 @@
                eof ->
                    fread(RestFormat,eof,N,Inputs,eof);
                _ ->
-                   %% Don't forget to count the newline.
-                   {more,{More,RestFormat,N+1,Inputs}}
+                   %% Don't forget to strip and count the newline.
+                   {more,{tl(More),RestFormat,N+1,Inputs}}
            end;
        Other ->                                %An error has occurred
            {done,Other,More}

现在将我的补丁提交到 erlang-patches，并获得由此产生的名声和荣耀......

I'm going to go with it being a bug in Erlang, too, and a weird one. Changing the format string to "~2s" gives equally weird results:

["59","73","4","15","2","40","0","92","6","53","0","6","34",
 "10","5","1","87","8","6","81","61","9","5","66","5","7",
 "25","6",
 [...]|...]

So it appears that it's counting a newline character as a regular character for the purposes of counting, but not when it comes to producing the output. Loopy as all hell.

A week of Erlang programming, and I'm already delving into the source. That might be a new record for me...

EDIT

A bit more investigation has confirmed for me that this is a bug. Calling one of the internal methods that's used in fread:

> io_lib_fread:fread([], "12 13\n14 15 16\n17 18 19 20\n", "~d").           
{done,{ok,"\f"}," 1314 15 16\n17 18 19 20\n"}

Basically, if there's multiple values to be read, then a newline, the first newline gets eaten in the "still to be read" part of the string. Other testing suggests that if you prepend a space it's OK, and if you lead the string with a newline it asks for more.

I'm going to get to the bottom of this, gosh-darn-it... (grin) There's not that much code to go through, and not much of it deals specifically with newlines, so it shouldn't take too long to narrow it down and fix it.

EDIT^2

HA HA! Got the little blighter.

Here's the patch to the stdlib that you want (remember to recompile and drop the new beam file over the top of the old one):

--- ../erlang/erlang-12.b.3-dfsg/lib/stdlib/src/io_lib_fread.erl
+++ ./io_lib_fread.erl
@@ -35,9 +35,9 @@
     fread_collect(MoreChars, [], Rest, RestFormat, N, Inputs).

 fread_collect([$\r|More], Stack, Rest, RestFormat, N, Inputs) ->
-    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\r|More]);
 fread_collect([$\n|More], Stack, Rest, RestFormat, N, Inputs) ->
-    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+    fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\n|More]);
 fread_collect([C|More], Stack, Rest, RestFormat, N, Inputs) ->
     fread_collect(More, [C|Stack], Rest, RestFormat, N, Inputs);
 fread_collect([], Stack, Rest, RestFormat, N, Inputs) ->
@@ -55,8 +55,8 @@
                eof ->
                    fread(RestFormat,eof,N,Inputs,eof);
                _ ->
-                   %% Don't forget to count the newline.
-                   {more,{More,RestFormat,N+1,Inputs}}
+                   %% Don't forget to strip and count the newline.
+                   {more,{tl(More),RestFormat,N+1,Inputs}}
            end;
        Other ->                                %An error has occurred
            {done,Other,More}

Now to submit my patch to erlang-patches, and reap the resulting fame and glory...

回复收藏 0 原文

旧瑾黎汐 2024-07-19 12:39:07

除了这似乎是 erlang 库之一中的错误这一事实之外，我认为您可以（非常）轻松地规避该问题。

鉴于您的文件是面向行的，我认为最佳实践是您也逐行处理它。

考虑以下构造。它在未打补丁的 erlang 上工作得很好，并且因为它使用惰性求值，所以它可以处理任意长度的文件，而无需先将其全部读入内存。该模块包含一个应用于每一行的函数示例 - 将一行整数文本表示转换为整数列表。


-module(liner).
-author("Harro Verkouter").
-export([liner/2, integerize/0, lazyfile/1]).

% Applies a function to all lines of the file
% before reducing (foldl).
liner(File, Fun) ->
    lists:foldl(fun(X, Acc) -> Acc++Fun(X) end, [], lazyfile(File)).

% Reads the lines of a file in a lazy fashion
lazyfile(File) ->
    {ok, Fd} = file:open(File, [read]),
    lazylines(Fd).
% Actually, this one does the lazy read ;)
lazylines(Fd) ->
    case io:get_line(Fd, "") of
        eof -> file:close(Fd), [];
        {error, Reason} ->
            file:close(Fd), exit(Reason);
        L ->
            [L|lazylines(Fd)]
    end.

% Take a line of space separated integers (string) and transform
% them into a list of integers
integerize() ->
    fun(X) ->
        lists:map(fun(Y) -> list_to_integer(Y) end,
                string:tokens(X, " \n")) end.


Example usage:
Eshell V5.6.5  (abort with ^G)
1> c(liner).
{ok,liner}
2> liner:liner("triangle.txt", liner:integerize()).
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
 68,90,81,80,38,92,67,73,30|...]

And as a bonus, you can easily fold over the lines of any (lineoriented) file w/o running out of memory :)

6> lists:foldl( fun(X, Acc) -> 
6>                  io:format("~.2w: ~s", [Acc,X]), Acc+1
6>                  end,
6>              1,  
6>              liner:lazyfile("triangle.txt")).                                        
 1: 59
 2: 73 41
 3: 52 40 09
 4: 26 53 06 34
 5: 10 51 87 86 81
 6: 61 95 66 57 25 68
 7: 90 81 80 38 92 67 73
 8: 30 28 51 76 81 18 75 44

干杯，
H。

Besides the fact that it seems to be a bug in one of the erlang libs I think you could (very) easily circumvent the problem.

Given the fact your file is line-oriented I think best practice is that you process it line-by-line as well.

Consider the following construction. It works nicely on an unpatched erlang and because it uses lazy evaluation it can handle files of arbitrary length without having to read all of it into memory first. The module contains an example of a function to apply to each line - turning a line of text-representations of integers into a list of integers.


-module(liner).
-author("Harro Verkouter").
-export([liner/2, integerize/0, lazyfile/1]).

% Applies a function to all lines of the file
% before reducing (foldl).
liner(File, Fun) ->
    lists:foldl(fun(X, Acc) -> Acc++Fun(X) end, [], lazyfile(File)).

% Reads the lines of a file in a lazy fashion
lazyfile(File) ->
    {ok, Fd} = file:open(File, [read]),
    lazylines(Fd).
% Actually, this one does the lazy read ;)
lazylines(Fd) ->
    case io:get_line(Fd, "") of
        eof -> file:close(Fd), [];
        {error, Reason} ->
            file:close(Fd), exit(Reason);
        L ->
            [L|lazylines(Fd)]
    end.

% Take a line of space separated integers (string) and transform
% them into a list of integers
integerize() ->
    fun(X) ->
        lists:map(fun(Y) -> list_to_integer(Y) end,
                string:tokens(X, " \n")) end.


Example usage:
Eshell V5.6.5  (abort with ^G)
1> c(liner).
{ok,liner}
2> liner:liner("triangle.txt", liner:integerize()).
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
 68,90,81,80,38,92,67,73,30|...]

And as a bonus, you can easily fold over the lines of any (lineoriented) file w/o running out of memory :)

6> lists:foldl( fun(X, Acc) -> 
6>                  io:format("~.2w: ~s", [Acc,X]), Acc+1
6>                  end,
6>              1,  
6>              liner:lazyfile("triangle.txt")).                                        
 1: 59
 2: 73 41
 3: 52 40 09
 4: 26 53 06 34
 5: 10 51 87 86 81
 6: 61 95 66 57 25 68
 7: 90 81 80 38 92 67 73
 8: 30 28 51 76 81 18 75 44

Cheers,
h.

回复收藏 0 原文

喵星人汪星人 2024-07-19 12:39:07

我注意到有多个实例合并了两个数字，并且它似乎位于从第四行及以后开始的每行的行边界处。

我发现，如果您在从第五行开始的每一行的开头添加一个空格字符，即：

59
73 41
52 40 09
26 53 06 34
 10 51 87 86 81
 61 95 66 57 25 68
 90 81 80 38 92 67 73
 30 28 51 76 81 18 75 44
...

数字会被正确解析：

39> euler67:solve().
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
 68,90,81,80,38,92,67,73,30|...]

如果您也将空格添加到前四行的开头，它也会起作用。

它更像是一种解决方法，而不是实际的解决方案，但它确实有效。我想弄清楚如何设置 io:fread 的格式字符串，这样我们就不必这样做。

更新
这是一个不会强迫您更改文件的解决方法。假设所有数字都是两个字符（< 100）：

read_file(File, Data) ->
case io:fread(File, "", "~d") of
    {ok, [N] } -> 
        if
            N > 100 ->
                First = N div 100,
                Second = N - (First * 100),
                read_file(File, [First , Second | Data]);

            true ->
                read_file(File, [N | Data])
        end;
    eof ->
        lists:reverse(Data)
end.

基本上，代码捕获任何通过换行符连接的两个数字并将它们分成两个。

同样，这是一个暗示 io:fread 中可能存在错误的拼凑，但这应该可以解决。

再次更新 上面的内容仅适用于两位数输入，但由于该示例将所有数字（甚至小于 10 的数字）打包为两位数格式，因此这适用于本示例。

I noticed that there are multiple instances where two numbers are merged, and it appears to be at the line boundaries on every line starting at the fourth line and beyond.

I found that if you add a whitespace character to the beginning of every line starting at the fifth, that is:

59
73 41
52 40 09
26 53 06 34
 10 51 87 86 81
 61 95 66 57 25 68
 90 81 80 38 92 67 73
 30 28 51 76 81 18 75 44
...

The numbers get parsed properly:

39> euler67:solve().
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
 68,90,81,80,38,92,67,73,30|...]

It also works if you add the whitespace to the beginning of the first four lines, as well.

It's more of a workaround than an actual solution, but it works. I'd like to figure out how to set up the format string for io:fread such that we wouldn't have to do this.

UPDATE
Here's a workaround that won't force you to change the file. This assumes that all digits are two characters (< 100):

read_file(File, Data) ->
case io:fread(File, "", "~d") of
    {ok, [N] } -> 
        if
            N > 100 ->
                First = N div 100,
                Second = N - (First * 100),
                read_file(File, [First , Second | Data]);

            true ->
                read_file(File, [N | Data])
        end;
    eof ->
        lists:reverse(Data)
end.

Basically, the code catches any of the numbers which are the concatenation of two across a newline and splits them into two.

Again, it's a kludge that implies a possible bug in io:fread, but that should do it.

UPDATE AGAIN The above will only work for two-digit inputs, but since the example packs all digits (even those < 10) into a two-digit format, that will work for this example.

回复收藏 0 原文

~没有更多了~