在 Python 中计算 n-gram 的逐点互信息 (PMI) 分数
我有一个很大的 n-gram 语料库和几个外部 n-gram。我想根据这个语料库(计数)计算每个外部 n-gram 的 PMI 分数。
有没有任何工具可以做到这一点,或者有人可以为我提供一段可以做到这一点的Python代码吗?
问题是我的 n 元语法是 2 元语法、3 元语法、4 元语法和 5 元语法。因此,计算 3 克或更多克的概率确实非常耗时。
I have a large corpus of n-grams and several external n-grams. I want to calculate the PMI score of each external n-gram based on this corpus (the counts).
Are there any tools to do this or can someone provide me with a piece of code in Python that can do this?
The problem is that my n-grams are 2-grams, 3-grams, 4-grams, and 5-grams. So calculating probabilities for 3-grams and more are really time-consuming.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我正确理解你的问题,你想要计算类似 log { P("x1 x2 x3 x4 x5") / P("x1") P("x2") ... P("x5") }其中 P 测量任何给定 5-gram 或 1-gram 是给定事物的概率(基本上是计数比率,可能具有拉普拉斯式偏移)。因此,对你的语料库进行一次遍历并存储 (1) 每个 1-gram、(2) 每个 n-gram 的计数(后者使用字典),然后对于每个外部 n-gram,你执行一些字典查找,一些算术,就完成了。一开始就遍历语料库,然后对每个外部 n 元词进行固定量的工作。
(注:实际上我不确定如何为两个以上的随机变量定义 PMI;也许它类似于 log P(a)P(b)P(c)P(abc) / P(ab)P(bc)但如果是这样的话,你可以用同样的方式来做:迭代你的语料库,计算很多东西,然后你需要的所有概率只是计数的比率,也许用拉普拉斯- ish更正。)
如果你的语料库太大以至于你无法在内存中容纳n-gram字典,那么将它分成内存大小的块,为每个块计算n-gram字典并将它们存储在磁盘上让您能够相当有效地获取任何给定 n-gram 条目的形式;然后,对于每个 extern n-gram,遍历块并将计数相加。
什么形式?由你决定。一个简单的选择:按照 n-gram 的字典顺序(注意:如果您使用的是单词而不是字母,您可能需要从将单词转换为数字开始;您需要对语料库进行一次初步遍历来完成这);然后找到你想要的 n-gram 是一个二分搜索或类似的东西,对于大小为 1GB 的块来说,这意味着每个块需要大约 15-20 次搜索;您可以添加一些额外的索引来减少这种情况。或者:使用光盘上的哈希表,使用 Berkeley DB 或其他东西;在这种情况下,您可以放弃分块。或者,如果字母表很小(例如,这些是字母 n 元组而不是单词 n 元组,并且您正在处理纯英语文本),只需将它们存储在一个大数组中,直接查找 - 但在这种情况下,不管怎样,你也许可以把整个事情都记在心里。
If I'm understanding your problem correctly, you want to compute things like log { P("x1 x2 x3 x4 x5") / P("x1") P("x2") ... P("x5") } where P measures the probability that any given 5-gram or 1-gram is a given thing (and is basically a ratio of counts, perhaps with Laplace-style offsets). So, make a single pass through your corpus and store counts of (1) each 1-gram, (2) each n-gram (use a dict for the latter), and then for each external n-gram you do a few dict lookups, a bit of arithmetic, and you're done. One pass through the corpus at the start, then a fixed amount of work per external n-gram.
(Note: Actually I'm not sure how one defines PMI for more than two random variables; perhaps it's something like log P(a)P(b)P(c)P(abc) / P(ab)P(bc)P(a_c). But if it's anything at all along those lines, you can do it the same way: iterate through your corpus counting lots of things, and then all the probabilities you need are simply ratios of the counts, perhaps with Laplace-ish corrections.)
If your corpus is so big that you can't fit the n-gram dict in memory, then divide it into kinda-memory-sized chunks, compute n-gram dicts for each chunk and store them on disc in a form that lets you get at any given n-gram's entry reasonably efficiently; then, for each extern n-gram, go through the chunks and add up the counts.
What form? Up to you. One simple option: in lexicographic order of the n-gram (note: if you're working with words rather than letters, you may want to begin by turning words into numbers; you'll want a single preliminary pass over your corpus to do this); then finding the n-gram you want is a binary search or something of the kind, which with chunks 1GB in size would mean somewhere on the order of 15-20 seeks per chunk; you could add some extra indexing to reduce this. Or: use a hash table on disc, with Berkeley DB or something; in that case you can forgo the chunking. Or, if the alphabet is small (e.g., these are letter n-grams rather than word n-grams and you're processing plain English text), just store them in a big array, with direct lookup -- but in that case, you can probably fit the whole thing in memory anyway.