Mahout 中朴素贝叶斯分类器对网站分类的适用性

发布于 2024-12-13 01:35:27 字数 500 浏览 2 评论 0原文

我目前正在开发一个项目,需要一个对网站进行分类的数据库(例如 cnn.com = 新闻)。我们只需要广泛的分类 - 我们不需要对每个 URL 进行单独分类。我们正在与此类数据库的常见供应商进行交谈,但我们收到的大多数报价都相当昂贵,而且他们通常会提出烦人的要求 - 例如必须使用他们的 SDK 来查询数据库。

同时,我也一直在探索自己建立这样一个数据库的可能性。我意识到这不是一个 5 分钟的工作,所以我做了很多研究。

通过阅读有关该主题的各种论文,朴素贝叶斯分类器似乎通常是执行此操作的标准方法。然而,许多论文建议增强其网络分类的准确性,通常是通过利用其他上下文信息,例如超链接、标题标签、多词短语、URL、词频等。

我一直在针对 20 个新闻组测试数据集试验 Mahout 的朴素贝叶斯分类器,我可以看到它对网站分类的适用性,但我担心它对我的用例的准确性。

有人知道在 Mahout 中扩展贝叶斯分类器以考虑其他属性的可行性吗?任何关于从哪里开始的指示将不胜感激。

或者,如果我完全找错了树,请告诉我!

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.

In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.

From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.

I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.

Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.

Alternatively, if I'm barking up entirely the wrong tree please let me know!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浮华 2024-12-20 01:35:27

您可以根据需要控制输入。最终输入只是一个特征向量。特征向量的特征可以是单词或二元组——但它们也可以是您想要的任何东西。所以,是的,您可以通过根据需要修改输入来注入新功能。

如何最好地融入这些功能完全是另一个话题——没有一种最好的方法将它们转换为数字。 Mahout in Action 涵盖了这个相当好的 FWIW。

You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.

How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文