从头开始实现朴素贝叶斯分类器的好语言是什么?

发布于 2024-08-30 12:55:07 字数 1431 浏览 2 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不乱于心 2024-09-06 12:55:07

我会用 C# 来做,但这只是因为它是我目前最熟悉的语言,而且因为我知道它具有强大的字符串处理能力。它也可以在 C++ 中使用 stl::string 类、Ruby、Java 等来完成。

如果我要构建一个朴素贝叶斯分类器,我会从一个简单的示例开始,例如 罗素和Norvig 的书(我在本书第二版中学到的书)或 Mitchell 的 书(我用他的书,因为他教课)。让你的学习者以一般的方式生成规则;也就是说,给定输入数据,生成输出规则,并使输入数据成为可概括的事物(可以是用于垃圾邮件检测的文本块,可以是预测某人是否要打网球的天气预报)。

如果您想学习贝叶斯分类器,那么像这样的简单示例比成熟的垃圾邮件过滤器更好。语言解析本身就很难,判断是否存在垃圾语言也很困难。最好有一个简单的小型数据集,您可以在其中导出学习者应该如何学习,并确保您的程序符合您想要的功能。然后,您可以扩展数据集,或修改程序以合并语言解析等内容。

I would do it in C#, but that's only because it's the language that I'm most familiar with at the moment, and because I know it's got strong string handling. It can also be done in C++ with stl::string classes, Ruby, Java, etc.

If I were building a naive bayes classifier, I'd start with a simple example, like the one in Russell & Norvig's book (the one I learned off of way back when, in the second edition of the book) or the one in Mitchell's book (I used his because he taught the class). Make your learner generate rules in a general fashion; that is, given input data, produce output rules, and have the input data be a generalizable thing (could be a block of text that for spam detection, could be a weather report to predict if someone's going to play tennis).

If you're trying to learn Bayes classifiers, a simple example like this is better to start with than a full-blown spam filter. Language parsing is hard in and of itself, and then determining whether or not there's garbage language is also difficult. Better to have a simple, small dataset, one where you can derive how your learner should learn and make sure that your program matches what you want it to do. Then, you can grow your dataset, or modify your program to incorporate things like language parsing.

许你一世情深 2024-09-06 12:55:07

从贝叶斯分类器转向编程语言,我将忽略“其他东西”,因为它太宽泛,并且没有明显优越的候选者。在您列出的四个中,我会避​​免使用 C 和 C++,因为谁想要处理内存管理,尤其是当您学习时?通常,由于静态类型系统,我会倾向于使用 Java,如果您是初学者,我认为这仍然是最安全的选择。但 Ruby 也是一个明智的选择,因为您可以非常快速地制作新想法和新示例的原型。

我曾致力于维护一个相当强大的贝叶斯分类器的版本,用于阅读电子邮件。它是用 Lua 和 C 混合编写的。它的性能很高,但我对这个设计真正遗憾的一件事是代码中内置的抽象非常少。我绝对建议在代码中构建抽象,例如

  • 特征提取

  • 频率计数

  • 概率的表示

Java使它真正尽管 Ruby 也可以做到,但实施这些类型的抽象障碍很容易。

我的同事 Fidelis Assis 发现的一件事是标准浮点数不适合表示非常小的概率。我们对概率的对数做了相当多的计算(概率相乘,即对数和)。

Moving from Bayesian classifiers to programming languages, I'll leave out "something else" as being too broad, and having no patently superior candidates. Of the four you list, I'd avoid C and C++ because who wants to deal with memory management, especially when you're learning? Normally I'd be tempted toward Java because of the static type system, and if you're a beginner I think that's still the safest bet. But Ruby is also a sensible choice because you can prototype new ideas and new examples very rapidly.

I have worked on an maintain a version of a rather powerful Bayesian classifier for reading email. It is written in a mixture of Lua and C. It's highly performant, but one of the things I really regret about the design is that there is very little abstraction built into the code. I definitely recommend building abstractions into the code like

  • Feature extraction

  • Frequency counting

  • The representation of probability

Java makes it really easy to enforce these kinds of abstraction barriers, although Ruby can do it too.

One of the things my colleague Fidelis Assis found is that standard floating-point numbers are not good for representing very small probabilities. We do a fair amount with logarithms of probabilities (where probabilities multiply, the logarithms sum).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文