贝叶斯分类器分数代表什么?
我正在使用 ruby classifier gem ,其分类方法返回根据训练模型分类的给定字符串的分数。
分数是百分比吗?如果是的话,最大差值是100分吗?
I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
它是概率的对数。对于大型训练集,实际概率非常小,因此对数更容易比较。理论上,分数的范围从无限接近于零到负无穷大。
10**score * 100.0
将为您提供实际概率,其最大差异确实为 100。It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity.
10**score * 100.0
will give you the actual probability, which indeed has a maximum difference of 100.实际上,要计算以 b 为基础的典型朴素贝叶斯分类器的概率,它是 b^score/(1+b^score)。这是逆 logit (http://en.wikipedia.org/wiki/Logit) 然而,考虑到 NBC 的独立性假设,这些分数往往过高或过低,并且以这种方式计算的概率将在边界处累积。最好计算保留集中的分数,并对分数进行准确(1 或 0)的逻辑回归,以更好地了解分数和概率之间的关系。
来自 Jason Rennie 的论文:
2.7 朴素贝叶斯输出常常过于自信
文本数据库经常有
10,000 到 100,000 个不同的词汇;文档通常包含 100 个或更多
条款。因此,存在很大的复制机会。
为了了解有多少重复,我们训练了 MAP 朴素贝叶斯
模型包含 20 个新闻组文档中的 80%。我们生成 p(cjd;D)(后验)
剩余 20% 数据的值并显示 maxc p(cjd;D) 的统计数据
表 2.3。这些价值观是高度过度自信的。 60%的测试文档已分配
四舍五入到小数点后 9 位时,后数为 1。与逻辑回归不同,Naive
贝叶斯未经过优化以产生合理的概率值。逻辑回归
对线性系数进行联合优化,收敛到适当的
具有足够训练数据的概率值。朴素贝叶斯优化系数
一对一。仅当独立性假设成立时,它才会产生实际输出
确实如此。当特征包含大量重复信息时(通常是这样)
带有文本的情况),朴素贝叶斯提供的后验结果非常过度自信。
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.