有效地存储和更新巨大（和稀疏？）多维数组以计算条件概率

发布于 2024-10-07 05:58:50 字数 428 浏览 17 评论 0原文

只是为了好玩，我想计算一个单词（来自自然语言）出现在文本中的条件概率，具体取决于最后一个单词和倒数第二个单词。即我会采取大量例如英语文本并计算每个组合 n(i|jk) 和 n(jk) 出现的频率（其中 j, k,i 是连续的单词）。

最简单的方法是使用 3 维数组（对于 n(i|jk)），使用单词到 3 维位置的映射。使用 trie 可以有效地完成位置查找（至少这是我最好的猜测），但对于 O(1000) 个单词我会遇到内存限制。但我猜想这个数组只会稀疏地填充，大多数条目为零，因此我会浪费大量内存。所以没有 3-D 数组。

什么数据结构更适合这样的用例，并且仍然可以有效地进行大量小更新，就像我在计算单词出现次数时所做的那样？（也许有一种完全不同的方法来做到这一点？）

（当然我还需要计算n(jk)，但这很简单，因为它只是二维的:) 我猜选择的语言是 C++。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

大姐，你呐 2024-10-14 05:58:50

C++ 代码：

struct bigram_key{
    int i, j;// words - indexes of the words in a dictionary

    // a constructor to be easily constructible
    bigram_key(int a_i, int a_j):i(a_i), j(a_j){}

    // you need to sort keys to be used in a map container
    bool operator<(bigram_key const &other) const{
        return i<other.i || (i==other.i && j<other.j);
    }
};

struct bigram_data{
    int count;// n(ij)
    map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}

map<bigram_key, bigram_data> trigrams;

字典可以是所有找到的单词的向量，例如：

vector<string> dictionary;

但为了更好地查找单词->索引，它可以是地图：

map<string, int> dictionary;

当您阅读新单词时。您将其添加到字典中并获取其索引 k，您已经有了前两个单词的 i 和 j 索引，因此您只需执行以下操作即可：

trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;

为了获得更好的性能，您可以只搜索一次二元组：

bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;

可以理解吗？您需要更多详细信息吗？

C++ code:

struct bigram_key{
    int i, j;// words - indexes of the words in a dictionary

    // a constructor to be easily constructible
    bigram_key(int a_i, int a_j):i(a_i), j(a_j){}

    // you need to sort keys to be used in a map container
    bool operator<(bigram_key const &other) const{
        return i<other.i || (i==other.i && j<other.j);
    }
};

struct bigram_data{
    int count;// n(ij)
    map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}

map<bigram_key, bigram_data> trigrams;

The dictionary could be a vector of all found words like:

vector<string> dictionary;

but for better lookup word->index it could be a map:

map<string, int> dictionary;

When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:

trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;

For better performance you may search for bigram only once:

bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;

Is it understandable? Do you need more details?

回复收藏 0 原文

~没有更多了~

关于作者

初吻给了烟

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

有效地存储和更新巨大（和稀疏？）多维数组以计算条件概率

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

有效地存储和更新巨大（和稀疏？）多维数组以计算条件概率

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。