有效地存储和更新巨大(和稀疏?)多维数组以计算条件概率

发布于 2024-10-07 05:58:50 字数 428 浏览 0 评论 0原文

只是为了好玩,我想计算一个单词(来自自然语言)出现在文本中的条件概率,具体取决于最后一个单词和倒数第二个单词。即我会采取大量例如英语文本并计算每个组合 n(i|jk) 和 n(jk) 出现的频率(其中 j, k,i 是连续的单词)。

最简单的方法是使用 3 维数组(对于 n(i|jk)),使用单词到 3 维位置的映射。使用 trie 可以有效地完成位置查找(至少这是我最好的猜测),但对于 O(1000) 个单词我会遇到内存限制。但我猜想这个数组只会稀疏地填充,大多数条目为零,因此我会浪费大量内存。所以没有 3-D 数组。

什么数据结构更适合这样的用例,并且仍然可以有效地进行大量小更新,就像我在计算单词出现次数时所做的那样? (也许有一种完全不同的方法来做到这一点?)

(当然我还需要计算n(jk),但这很简单,因为它只是二维的:) 我猜选择的语言是 C++。

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).

The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.

What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)

(Of course I also need to count n(jk), but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

大姐,你呐 2024-10-14 05:58:50

C++ 代码:

struct bigram_key{
    int i, j;// words - indexes of the words in a dictionary

    // a constructor to be easily constructible
    bigram_key(int a_i, int a_j):i(a_i), j(a_j){}

    // you need to sort keys to be used in a map container
    bool operator<(bigram_key const &other) const{
        return i<other.i || (i==other.i && j<other.j);
    }
};

struct bigram_data{
    int count;// n(ij)
    map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}

map<bigram_key, bigram_data> trigrams;

字典可以是所有找到的单词的向量,例如:

vector<string> dictionary;

但为了更好地查找单词->索引,它可以是地图:

map<string, int> dictionary;

当您阅读新单词时。您将其添加到字典中并获取其索引 k,您已经有了前两个单词的 ij 索引,因此您只需执行以下操作即可:

trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;

为了获得更好的性能,您可以只搜索一次二元组:

bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;

可以理解吗?您需要更多详细信息吗?

C++ code:

struct bigram_key{
    int i, j;// words - indexes of the words in a dictionary

    // a constructor to be easily constructible
    bigram_key(int a_i, int a_j):i(a_i), j(a_j){}

    // you need to sort keys to be used in a map container
    bool operator<(bigram_key const &other) const{
        return i<other.i || (i==other.i && j<other.j);
    }
};

struct bigram_data{
    int count;// n(ij)
    map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}

map<bigram_key, bigram_data> trigrams;

The dictionary could be a vector of all found words like:

vector<string> dictionary;

but for better lookup word->index it could be a map:

map<string, int> dictionary;

When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:

trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;

For better performance you may search for bigram only once:

bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;

Is it understandable? Do you need more details?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文