有效地存储和更新巨大(和稀疏?)多维数组以计算条件概率
只是为了好玩,我想计算一个单词(来自自然语言)出现在文本中的条件概率,具体取决于最后一个单词和倒数第二个单词。即我会采取大量例如英语文本并计算每个组合 n(i|jk) 和 n(jk) 出现的频率(其中 j, k,i 是连续的单词)。
最简单的方法是使用 3 维数组(对于 n(i|jk)),使用单词到 3 维位置的映射。使用 trie 可以有效地完成位置查找(至少这是我最好的猜测),但对于 O(1000) 个单词我会遇到内存限制。但我猜想这个数组只会稀疏地填充,大多数条目为零,因此我会浪费大量内存。所以没有 3-D 数组。
什么数据结构更适合这样的用例,并且仍然可以有效地进行大量小更新,就像我在计算单词出现次数时所做的那样? (也许有一种完全不同的方法来做到这一点?)
(当然我还需要计算n(jk)
,但这很简单,因为它只是二维的:) 我猜选择的语言是 C++。
Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk)
and n(jk)
appears (where j,k,i
are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)
), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using trie
s (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk)
, but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
C++ 代码:
字典可以是所有找到的单词的向量,例如:
但为了更好地查找单词->索引,它可以是地图:
当您阅读新单词时。您将其添加到字典中并获取其索引
k
,您已经有了前两个单词的i
和j
索引,因此您只需执行以下操作即可:为了获得更好的性能,您可以只搜索一次二元组:
可以理解吗?您需要更多详细信息吗?
C++ code:
The dictionary could be a vector of all found words like:
but for better lookup word->index it could be a map:
When you read a new word. You add it to the dictionary and get its index
k
, you already havei
andj
indexes of the previous two words so then you just do:For better performance you may search for bigram only once:
Is it understandable? Do you need more details?