带有复合词的 classifier4J

发布于 2024-09-10 23:40:25 字数 1161 浏览 4 评论 0原文

我正在使用 BayesianClassifier 类对垃圾邮件进行分类。问题是复合词无法被识别。

例如,如果我添加led zeppelin作为匹配项,则包含它的句子将不会被识别为匹配项,即使它应该被识别为匹配项。

为了添加匹配,我使用 SimpleWordsDataSourceaddMatch()

,为了请求匹配,我使用 isMatch()>BayesianClassifier

关于如何解决这个问题有什么想法吗?


好的,谢谢您的见解。我附上了更多源代码。

SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);

wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");

classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match

现在我使用贝叶斯分类器的eachMatch方法,并且得到了不同的结果。 包含 led zeppelin it 的句子被归类为匹配项,这是可以的。但是包含 led it 的句子也被归类为匹配,这是错误的。

这是相关代码:

BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true

I'm using the BayesianClassifier class to classify spam. The problem is that compound words aren't being recognized.

For instance if I add led zeppelin as a match, a sentence containing it won't be recognized as a match even though it should.

For adding a match I'm using addMatch() of SimpleWordsDataSource

And for asking for a match I'm using isMatch() of BayesianClassifier

Any ideas on how to fix this?


Ok, thanks for the insight. I'm attaching more source code.

SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);

wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");

classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match

Now I'm using the teachMatch method of BayesianClassifier and I've got different results.
A sentence containing led zeppelin it is classified as a match, which is ok. But a sentence including led it is also classified as a match, which is wrong.

Here's the relevant code:

BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

赏烟花じ飞满天 2024-09-17 23:40:25

(我写了classifier4j)

你需要用更多的数据来训练它。

贝叶斯分类器的工作原理是创建统计模型来判断什么是匹配的,什么是不匹配的。

如果你给它足够的数据,它就会知道“led 和 zeppelin”是匹配的,但“led”本身不是匹配的

(I wrote classifier4j)

You need to train it with more data.

Bayesian classifiers work by creating statistical models of what is considered a match and what isn't.

If you give it enough data, it will learn that "led and zeppelin" is a match, but "led" by itself isn't

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文