计算概率分布
我有一个简单(可能很愚蠢)的问题。我想计算两个文档的 Kullback-Leibler 散度。它需要每个文档的概率分布。
我不知道如何计算每个文档的概率。任何带有外行示例的简单答案将不胜感激。
假设我们有以下两个文档:(
1 - cross validated answers are good
2 - simply validated answers are nice
文档的措辞只是 bla bla 给你一个例子)
我们如何计算这些文档的概率?
假设我们添加另一个文档:
3 - simply cross is not good answer
如果我们添加另一个文档,那么它会如何影响概率分布?
谢谢
I have a simple (may be stupid) question. I want to calculate Kullback–Leibler divergence on two documents. It requires probability distribution of each document.
I do not know how to calculate probability for each document. Any simple answer with layman example would be much appreciated.
Let's say we have follow two documents:
1 - cross validated answers are good
2 - simply validated answers are nice
(wording of the documents is just bla bla to give you an example)
How do we calculate probabilities for these documents?
Let's say we add one more document:
3 - simply cross is not good answer
If we add another document then how would it impact probability distribution?
Thanks
如果将文档添加到文档集合中,除非该文档与文档集合完全相同,否则分布中的单词或术语的分布将发生变化以适应新添加的单词。问题出现了:“这真的是您想要对第三份文件执行的操作吗?”
Kullback-Leibler 散度 是两个分布的散度度量。你们两个是什么发行版?
如果您的分布是在文档中随机选择某个单词的概率,那么概率值所在的空间就是构成文档的单词的集合。对于您的前两个文档(我假设这是您的整个集合),您可以构建一个包含 7 个术语的单词空间。从文档中随机选择单词作为词袋的概率为:
[计算公式为术语频率除以文档长度。请注意,新文档的单词形式与文档 1 和文档 2 中的单词不同。如果您将 (are/is) 和 (are/is) 对中的相同术语进行词干化或词形还原,则 (lem) 列将是概率。 (答案/答案)。]
将第三个文档引入场景中,您可能想要使用 Kullback-Liebler Divergence 执行的典型活动是将新文档或文档集合与已知文档或文档集合进行比较文件。
计算 Kullback-Liebler 散度
D(P||Q)
会生成一个值,该值表示使用替代分布Q
捕获真实分布P
的效果代码>.因此,Q1
可能是文档 1 中单词的分布,Q2
可能是文档 2 中单词的分布。使用P
作为新文档(文档 3)中的单词分布,您可以测量新文档与文档 1 的差异程度以及与文档 2 的差异程度。使用此信息,您可以判断新文档的相似程度是为了你知道的文件/收藏。If you add a document to a collection of documents, unless that document is exactly the same as the document collection, the distribution of words or terms in your distribution is going to change to accommodate the newly added words. The question arises: "Is that really what you want to do with the third document?"
Kullback-Leibler divergence is a measure of divergence for two distributions. What are you two distributions?
If your distribution is the probability of a certain word being selected at random in a document, then the space over which you have probability values is the collection of words which make up your documents. For your first two documents (I assume this is your entire collection), you can build a word-space of 7 terms. The Probability for a word being selected at random from the documents as bags of words are:
[This is calculated as the term-frequency divided by the document lengths. Notice that the new document has word forms that aren't the same as the words in doc 1 and doc 2. The (lem) column would be the probabilities if you stemmed or lemmatized to the same term the pairs (are/is) and (answer/answers).]
Introducing the third document into the scenario, a typical activity you might want to do with Kullback-Liebler Divergence is compare a new document or collection of documents with already-known documents or collections of documents.
Computing the Kullback-Liebler divergence
D(P||Q)
produces a value signifying how well the true distributionP
is captured by using the substitute distributionQ
. SoQ1
could be the distribution of words in doc 1, andQ2
could be the distribution of words in doc 2. Computing the KL divergence withP
being the distribution of words in the new document (doc 3), you can get measures of how divergent the new document is from doc 1 and how divergent it is from doc 2. Using this information, you can say how similar the new document is to your know documents/collections.