贝叶斯分类器的实现细节

发布于 2024-12-13 17:14:09 字数 330 浏览 2 评论 0原文

我已经实现了一个简单的贝叶斯分类器,但是在对大量数据使用它时遇到了一些溢出问题。

为了使数字保持较小但仍然精确,我尝试的一种策略是用方程每个部分的最大公约数不断减少分子和分母。然而,这只在它们具有公约数时才有效...

注意,问题是双向的,当我在大部分计算中将分母和分子分开时,当我即时进行大多数计算时,我会遇到整数溢出问题,使用双精度算术,我遇到了非常小的双精度值所具有的各种问题/限制(如 IEEE 754 所定义)。

我确信你们中的一些人之前已经实现过这个算法,你们是如何处理这些问题的?我不想引入任意精度类型,因为它们成本太高,而且我确信存在一个不需要它们的解决方案。

谢谢。

I've implemented a simple Bayesian classifier, but I'm running into some overflow problems when using it on non-trivial amounts of data.

One strategy I tried in order to keep the numbers small, but still exact, was to keep reducing the numerator and denominator with the greatest common divisor for every part of the equation. This, however, only works when they have a common divisor...

Note, the problem goes both ways, when I keep the denominators and numerators separate for most of the calculation I struggle with integer overflow, when I do most calculations on the fly, using double arithmetic, I'm met with the various problems/limits that really small double values have (as defined by IEEE 754).

As I'm sure some of you here have implemented this algorithm before, how did you deal with these issues? I'd prefer not to pull in arbitrary precision types as they cost too much and I'm sure there exists a solution which doesn't require them.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我早已燃尽 2024-12-20 17:14:09

通常,处理此问题的方法是获取日志并使用加法,然后如果您想返回概率空间,则执行 exp 。

p1 * p2 * p3 * ... * pn = exp(log(p1) + log(p2) + log(p3) + ... log(pn))

您可以通过在日志空间中工作来避免流量不足。

Usually the way you handle this is by taking logs and using adds, and then doing an exp if you want to get back into probability space.

p1 * p2 * p3 * ... * pn = exp(log(p1) + log(p2) + log(p3) + ... log(pn))

You avoid under flows by working in log space.

梦里°也失望 2024-12-20 17:14:09

如果您要在两个类别之间进行分类,您可以引入每个类别的概率对数比。因此,如果:

log(Pr(cat1) / Pr(cat2)) <=> 0 # positive would favor cat1 and negative cat2

这等于:

log(Pr(cat1)) - log(Pr(cat2)) <=> 0

如果(如在贝叶斯分类器中)类别概率本身就是给定条件下的概率的乘积:

log(Pr(cat1|cond1)) + ... <=> log(Pr(cat2|cond1)) + ...

因此,您正在处理求和而不是乘法,并且您将需要大量数据集才能遇到相同的情况。

If you're classifying between two categories you can introduce the log ratio of probabilities for each category. So if:

log(Pr(cat1) / Pr(cat2)) <=> 0 # positive would favor cat1 and negative cat2

That is equal to:

log(Pr(cat1)) - log(Pr(cat2)) <=> 0

And if (as in Bayesian classifiers) the category probabilities are themselves products of probabilities given conditions:

log(Pr(cat1|cond1)) + ... <=> log(Pr(cat2|cond1)) + ...

Thus you are dealing with summation rather than multiplication and you will need a massive data set to run into the same thing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文