Erlang:这个 trie 实现最错误的地方是什么?

发布于 2024-08-16 06:35:06 字数 960 浏览 6 评论 0 原文

假期里,我的家人喜欢玩Boggle。问题是,我的Boggle 很糟糕。所以我做了任何优秀程序员都会做的事情:编写一个程序来给我玩。

该算法的核心是一个简单的 前缀 trie,其中每个节点都是一个 dict 对下一个字母的引用。

这是 trie:add 实现:

add([], Trie) ->
    dict:store(stop, true, Trie);

add([Ch|Rest], Trie) ->
    % setdefault(Key, Default, Dict) ->
    %     case dict:find(Key, Dict) of
    %         { ok, Val } -> { Dict, Val }
    %         error -> { dict:new(), Default }
    %     end.
    { NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
    NewSubTrie = add(Rest, SubTrie),
    dict:store(Ch, NewSubTrie, NewTrie).

您可以在此处看到其余部分,以及如何使用它的示例(在底部):

http://gist.github.com/263513

现在,这是我在 Erlang 中的第一个严肃的程序,我我知道它可能有很多问题……但我最关心的是它使用800 MB RAM

那么,我做的最错的是什么?我怎样才能让它少一点错误呢?

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.

At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.

This is the trie:add implementation:

add([], Trie) ->
    dict:store(stop, true, Trie);

add([Ch|Rest], Trie) ->
    % setdefault(Key, Default, Dict) ->
    %     case dict:find(Key, Dict) of
    %         { ok, Val } -> { Dict, Val }
    %         error -> { dict:new(), Default }
    %     end.
    { NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
    NewSubTrie = add(Rest, SubTrie),
    dict:store(Ch, NewSubTrie, NewTrie).

And you can see the rest, along with an example of how it's used (at the bottom), here:

http://gist.github.com/263513

Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.

So, what am I doing most-wrong? And how might I make it a bit less-wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

新一帅帅 2024-08-23 06:35:06

您可以通过简单地将单词存储在 ets 表中来实现此功能:

% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).  
> ets:insert(words, [{"zebra"}]).    

% check if word exists
> ets:lookup(words, "zed").          
[{"zed"}]

% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]

如果 trie 是必须的,但您可以接受非功能性方法,那么您可以尝试 有向图,正如 Paul 已经建议的那样。

如果您想保持功能,您可以通过使用使用较少内存的结构来节省一些字节的内存,例如 proplist 或记录,例如 -record(node, {a,b,....,x,y,z})

You could implement this functionality by simply storing the words in an ets table:

% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).  
> ets:insert(words, [{"zebra"}]).    

% check if word exists
> ets:lookup(words, "zed").          
[{"zed"}]

% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]

If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.

If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).

若相惜即相离 2024-08-23 06:35:06

我不记得一个字典需要多少内存,但让我们估计一下。您有 2.5e6 个字符和 2e5 个单词。如果你的字典树根本没有共享,那么字典中将需要 2.7e6 个关联(每个字符和每个“停止”符号一个)。一个简单的纯功能字典表示可能每个关联 4 个单词——它可能会更少,但我正在尝试获得一个上限。在 64 位计算机上,这需要 8*4*270 万字节,即 86 兆字节。这只是 800M 的十分之一,所以这里肯定有问题。

更新: dict.erl 表示带有哈希表的字典;当你有很多非常小的命令时,这意味着大量的开销,就像你所做的那样。我会尝试更改您的代码以使用 proplists 模块,它应该与我上面的计算相匹配。

I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.

Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.

沩ん囻菔务 2024-08-23 06:35:06

解决该问题的另一种方法是浏览单词列表,看看是否可以从骰子中构造出该单词。这样你只需要很少的内存,而且编码可能会更有趣。 (优化和并发)

An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)

×眷恋的温暖 2024-08-23 06:35:06

查看DAWG。它们比尝试更紧凑。

Look into DAWGs. They're much more compact than tries.

美人骨 2024-08-23 06:35:06

我不知道你的算法,但如果你存储那么多数据,也许你应该考虑使用 Erlang 的内置有向图库来表示你的 trie,而不是这么多字典。
http://www.erlang.org/doc/man/digraph.html

I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html

梦归所梦 2024-08-23 06:35:06

如果所有单词都是英文,并且不区分大小写,则所有字符都可以用 1 到 26 之间的数字进行编码(事实上,在 Erlang 中,它们是 97 到 122 之间的数字),保留0 为停止。因此,您也可以使用 array 模块

If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文