Erlang字计数器

发布于 2024-11-10 08:46:46 字数 1874 浏览 6 评论 0原文

我正在尝试在 Erlang 中创建一个多线程（？）程序：

读取一个大文件（600mb）
向一组创建的进程发送消息，其中包含从文件中读取的行
创建的进程处理单词并存储在哈希树
创建过程中，然后将哈希树触发回主节点，
主节点打印树。

到目前为止，我想我已经完成了前三个......我不知道如何通过每次插入时打印出每个密钥哈希对来测试我的树。

主帖：

-module(lineread).
-export([read/1]).

read(Wordlength) ->
    {ok, Input} = file:open("/home/ml/datasets/tweets/first60kTweets.txt", [read]),
    Threads = makeThreads(Wordlength),
    read_lines(Input, Threads).

read_lines(Input, Threads) ->
    case file:read_line(Input) of
    eof ->
        file:close(Input);
    {ok, Line} ->
        send_to_all(Threads, Line),
        read_lines(Input, Threads)
    end.

send_to_all(Threads, Line) ->
    lists:foreach(fun(Pid) ->
                  Pid ! {line, Line} end,
              Threads).

makeThreads(NumThreads) ->
     [ spawn(counter, run, [N]) || N <- lists:seq(1, NumThreads) ].

其他帖子：

-module(counter).
-export([run/1]).

%%entry point for the code
run(K) ->
    loop(K, gb_trees:empty()).

%%loops for a recieved message         
loop(Size, Tree) ->
    receive
    {line, Line} ->
        %%io:format("~p~n", [Line]),
        Splits = re:split(Line, " "),
        NewTree = build_tree(Splits, Tree),
        loop(Size, NewTree);
    {char, Char} ->
        io:format("~p", [Char])
        %%loop(Size, )
    end.

%%puts the data into a Tree...
build_tree([], Tree) ->
    Tree;
build_tree([W|R], Tree) ->
    case gb_trees:is_defined(W, Tree) of
    true ->
        I = gb_trees:get(W, Tree),
        NewTree = gb_trees:update(W, I + 1, Tree),
        io:format(I),
        build_tree(R, NewTree);
    false ->
        NewTree = gb_trees:insert(W, 1, Tree),
        %%io:format("~p/~n"),
            build_tree(R, NewTree)
        end.

感谢您的帮助。

原文

I'm attempting to make a multi-threaded(?) program in Erlang that:

Reads in a large file (600mb)
Fires messages to a group of created processes that contain the lines read from the file
The created processes process the word and store in a hashtree
created process then fires the hashtree back to the master
master prints the tree.

So far, I think I have the first three done... I can't figure out how to test my tree by printing out each key-hash pair each time they're inserted.

Master thread:

-module(lineread).
-export([read/1]).

read(Wordlength) ->
    {ok, Input} = file:open("/home/ml/datasets/tweets/first60kTweets.txt", [read]),
    Threads = makeThreads(Wordlength),
    read_lines(Input, Threads).

read_lines(Input, Threads) ->
    case file:read_line(Input) of
    eof ->
        file:close(Input);
    {ok, Line} ->
        send_to_all(Threads, Line),
        read_lines(Input, Threads)
    end.

send_to_all(Threads, Line) ->
    lists:foreach(fun(Pid) ->
                  Pid ! {line, Line} end,
              Threads).

makeThreads(NumThreads) ->
     [ spawn(counter, run, [N]) || N <- lists:seq(1, NumThreads) ].

Other thread:

-module(counter).
-export([run/1]).

%%entry point for the code
run(K) ->
    loop(K, gb_trees:empty()).

%%loops for a recieved message         
loop(Size, Tree) ->
    receive
    {line, Line} ->
        %%io:format("~p~n", [Line]),
        Splits = re:split(Line, " "),
        NewTree = build_tree(Splits, Tree),
        loop(Size, NewTree);
    {char, Char} ->
        io:format("~p", [Char])
        %%loop(Size, )
    end.

%%puts the data into a Tree...
build_tree([], Tree) ->
    Tree;
build_tree([W|R], Tree) ->
    case gb_trees:is_defined(W, Tree) of
    true ->
        I = gb_trees:get(W, Tree),
        NewTree = gb_trees:update(W, I + 1, Tree),
        io:format(I),
        build_tree(R, NewTree);
    false ->
        NewTree = gb_trees:insert(W, 1, Tree),
        %%io:format("~p/~n"),
            build_tree(R, NewTree)
        end.

Thanks for your help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦回梦里 2024-11-17 08:46:46

我发现您的程序有很多问题，但要回答您的问题，您需要学习如何使用 receive 以便让您的进程相互通信。我建议阅读前几章此处 - 特别是这个。话虽如此，以下是我对代码的评论：

了解最终目标是什么会很有帮助。该代码没有多大意义，特别是因为从未使用过 Size 因此所有进程都是相同的。
目前还不清楚你为什么要产生任何东西。生成 W 进程并将每一行发送给所有进程是一个陷阱 - 由于 IO 开销，这最终会比不生成慢得多（发送到每个进程时必须复制文本）。您最终将在它们之间发送类似 (W)(600MB)(2) 的内容。哎呀。
您可能需要一个 dict 或 orddict 而不是 gb 树。请参阅字典：update_counter/3。另请参阅此了解更多详细信息。使用 ets 表可能有意义，但我再次不确定目标是什么。
在这种情况下，应该使用 string:tokens/2 而不是 re:split/2 - 您不需要 re 的开销来进行简单的空格分割。

回复收藏 0 原文

箹锭⒈辈孓 2024-11-17 08:46:46

我不知道如果我很关注的话，我猜你是在试图统计字数。

您可以向循环添加一条新消息，如下所示：

{print} ->
io:format("~p~n", gb_trees:to_list(Tree)),
io:format("单词 ~p~n", gb_trees:size(Tree)),
循环（大小，NewTree）；

另外我不明白为什么你将每一行发送到所有进程。这只是为了练习吗？或者尝试构建一些可以快速计算单词的东西？这不会。

我有几点建议：
您还可以使用字符串标记来分割单词。
http://www.erlang.org/doc/man/string.html

我会使用进程这个词而不是线程，因为 Erlang 声称它创建轻量级进程而不是线程
http://www.erlang.org/doc/efficiency_guide/processes.html

回复收藏 0 原文

じ违心 2024-11-17 08:46:46

最大的问题是这次演习的目的是什么？是为了真正的工作还是只是某种学习。如果它应该是真正的产品，那么你就不会很好地开始它。您的第一个问题应该是这些数据从哪里来，然后您应该尝试确定瓶颈在哪里。在你的例子中，你谈论的是 600MB。保存在内存中的数据并不多。如果您必须从某个存储中读取它，您的解决方案将强烈依赖于该存储，因为从 CPU 的角度来看，如果您预计平均字大小为 10B，则 600MB 字可能约为 60Mword；如果您预计平均字大小为 6B，则 600MB 字可能约为 60Mword。你能多快数出 100Mwords？如果您使用基于速度非常快的 k/v 解决方案，您可以在一个商用 CPU 核心上 1 秒内完成。听起来很疯狂？不，这比第一眼看上去要容易。如果您的解决方案基于 Judy Array 或您自己手工制作的 trie 实现之类的东西，您可以相对地获得这样的结果如果您不被 C 语言编码吓到的话，这很容易。那么您能以多快的速度厌倦数据存储中的这头野兽呢？如果您使用商品盘，您可以在 6 秒内读取数据！如果你更好地调整它，你可以做得更好，但在与处理相当的时间内完成它将是一个挑战。你为什么要尝试使其并行？仅当您必须真正从存储中读取数据时，上述情况才适用。例如，当您正在读取之前由某个进程存储的数据时，情况并非如此。在这种情况下，您可以更快地访问此数据，因为它很可能位于缓存中。在这种情况下，它们都会以非常不同的方式表现。例如，您可以随机访问数据。上述读取时序只能通过从传统磁盘顺序读取来实现。 SSD 的行为有所不同，但整体数据吞吐量并没有好多少。因此，只有当您能够足够快地厌倦 CPU 工作时，您在并行性方面的努力才是值得的。那么最大的问题将是如何收集数据并对其进行分区。如需灵感，您可以查看Tim Bray 的 Wide Finder 项目。但请记住，第一个 Wide Finder 项目并不是使用刷新的 IO 缓存来测量的，因此在这种情况下可以从缓存中随机访问数据。即使在这种情况下，您也可以使用 Erlang，但不要忘记它的优势之一是与其他语言（尤其是 C）进行交互，我将向您介绍 NIF。

如果目的只是在 Erlang 中练习，我的问题是为什么是树？为什么不ets:update_counter/3？如果您认为 ets 不够快，请尝试使用进程字典而不是 gb_trees。当您在自己控制的过程中使用它时，它并没有那么脏。考虑读取内存中的整个文件，但使用二进制文件。集中使用 binary 模块等等...

The biggest question is what is purpose of this exercise? Is it for real work or only some sort of learning. If it should be real product you don't start it well. You first question should be where those data will come from and then you should try identify where will be bottle neck. In your case you are talking about 600MB. It is not so much data to keep it in memory. If you have to read it from some storage your solution will strongly depend of this storage because from CPU point of view, 600MB of words can be about 60Mwords if you expect average word size 10B or 100Mwords if you expect average 6B words. How fast you can count 100Mwords? If you use k/v solution based of something really fast you can do it in 1s on one commodity CPU core. Sounds crazy? No, it is easier than it looks on first sight. If you base your solution on something like Judy Array or your own handcrafted trie implementation you can achieve result like this in relatively easy if you are not scare by coding in C. So how fast you can fed up this beast from data storage? If you use commodity disks you can read data in 6s! If you tune it better you can do better but do it in comparable time to processing will be challenge. Why you are trying make it parallel? Above is true only if you have to really read data from storage. It is not true for example when you are reading data which was store by some process just before. In this case you can access this data much faster because it will be in caches with high probability. In this case it will all behave in very different way. For example you can access data randomly. Above reading timings can be achieved only by sequential reading from classical disks. SSD behaves differently but overall data throughput is not in order of magnitude better. So only if you can fed up your CPU work fast enough then your effort to parallelism is worth of it. Then biggest issue will be how to fed up data and partition it. For inspiration you can look at Tim Bray's Wide Finder Project. But keep in mind first Wide Finder project was not measured with flushed IO caches so it was case when data could be accessed randomly from caches. Even in this case you can use Erlang but don't forget one of its strength is interfacing with other languages, especially C and I will point you at NIFs.

If purpose is only practicing in Erlang, my question is why trees? Why not ets:update_counter/3? If you consider ets is not fast enough, try process dictionary instead of gb_trees. It is not as much dirty when you use it in process under your own control. Consider reading whole file in memmory but work with binary. Use binary module intensively and so ...

回复收藏 0 原文

~没有更多了~