当前位置：文江博客话题详情

在 Python 中计算 n-gram 的逐点互信息 (PMI) 分数

发布于 2024-10-20 23:30:55 字数 198 浏览 1 评论 0原文

我有一个很大的 n-gram 语料库和几个外部 n-gram。我想根据这个语料库（计数）计算每个外部 n-gram 的 PMI 分数。

有没有任何工具可以做到这一点，或者有人可以为我提供一段可以做到这一点的Python代码吗？

问题是我的 n 元语法是 2 元语法、3 元语法、4 元语法和 5 元语法。因此，计算 3 克或更多克的概率确实非常耗时。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠北森屋 2024-10-27 23:30:55

如果我正确理解你的问题，你想要计算类似 log { P("x1 x2 x3 x4 x5") / P("x1") P("x2") ... P("x5") }其中 P 测量任何给定 5-gram 或 1-gram 是给定事物的概率（基本上是计数比率，可能具有拉普拉斯式偏移）。因此，对你的语料库进行一次遍历并存储 (1) 每个 1-gram、(2) 每个 n-gram 的计数（后者使用字典），然后对于每个外部 n-gram，你执行一些字典查找，一些算术，就完成了。一开始就遍历语料库，然后对每个外部 n 元词进行固定量的工作。

（注：实际上我不确定如何为两个以上的随机变量定义 PMI；也许它类似于 log P(a)P(b)P(c)P(abc) / P(ab)P(bc)但如果是这样的话，你可以用同样的方式来做：迭代你的语料库，计算很多东西，然后你需要的所有概率只是计数的比率，也许用拉普拉斯- ish更正。）

如果你的语料库太大以至于你无法在内存中容纳n-gram字典，那么将它分成内存大小的块，为每个块计算n-gram字典并将它们存储在磁盘上让您能够相当有效地获取任何给定 n-gram 条目的形式；然后，对于每个 extern n-gram，遍历块并将计数相加。

什么形式？由你决定。一个简单的选择：按照 n-gram 的字典顺序（注意：如果您使用的是单词而不是字母，您可能需要从将单词转换为数字开始；您需要对语料库进行一次初步遍历来完成这）;然后找到你想要的 n-gram 是一个二分搜索或类似的东西，对于大小为 1GB 的块来说，这意味着每个块需要大约 15-20 次搜索；您可以添加一些额外的索引来减少这种情况。或者：使用光盘上的哈希表，使用 Berkeley DB 或其他东西；在这种情况下，您可以放弃分块。或者，如果字母表很小（例如，这些是字母 n 元组而不是单词 n 元组，并且您正在处理纯英语文本），只需将它们存储在一个大数组中，直接查找 - 但在这种情况下，不管怎样，你也许可以把整个事情都记在心里。

If I'm understanding your problem correctly, you want to compute things like log { P("x1 x2 x3 x4 x5") / P("x1") P("x2") ... P("x5") } where P measures the probability that any given 5-gram or 1-gram is a given thing (and is basically a ratio of counts, perhaps with Laplace-style offsets). So, make a single pass through your corpus and store counts of (1) each 1-gram, (2) each n-gram (use a dict for the latter), and then for each external n-gram you do a few dict lookups, a bit of arithmetic, and you're done. One pass through the corpus at the start, then a fixed amount of work per external n-gram.

(Note: Actually I'm not sure how one defines PMI for more than two random variables; perhaps it's something like log P(a)P(b)P(c)P(abc) / P(ab)P(bc)P(a_c). But if it's anything at all along those lines, you can do it the same way: iterate through your corpus counting lots of things, and then all the probabilities you need are simply ratios of the counts, perhaps with Laplace-ish corrections.)

If your corpus is so big that you can't fit the n-gram dict in memory, then divide it into kinda-memory-sized chunks, compute n-gram dicts for each chunk and store them on disc in a form that lets you get at any given n-gram's entry reasonably efficiently; then, for each extern n-gram, go through the chunks and add up the counts.

What form? Up to you. One simple option: in lexicographic order of the n-gram (note: if you're working with words rather than letters, you may want to begin by turning words into numbers; you'll want a single preliminary pass over your corpus to do this); then finding the n-gram you want is a binary search or something of the kind, which with chunks 1GB in size would mean somewhere on the order of 15-20 seeks per chunk; you could add some extra indexing to reduce this. Or: use a hash table on disc, with Berkeley DB or something; in that case you can forgo the chunking. Or, if the alphabet is small (e.g., these are letter n-grams rather than word n-grams and you're processing plain English text), just store them in a big array, with direct lookup -- but in that case, you can probably fit the whole thing in memory anyway.

回复收藏 0 原文

~没有更多了~