带有复合词的 classifier4J
我正在使用 BayesianClassifier 类对垃圾邮件进行分类。问题是复合词无法被识别。
例如,如果我添加led zeppelin作为匹配项,则包含它的句子将不会被识别为匹配项,即使它应该被识别为匹配项。
为了添加匹配,我使用 SimpleWordsDataSource 的 addMatch()
,为了请求匹配,我使用 isMatch() 的 >BayesianClassifier
关于如何解决这个问题有什么想法吗?
好的,谢谢您的见解。我附上了更多源代码。
SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);
wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");
classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match
现在我使用贝叶斯分类器的eachMatch方法,并且得到了不同的结果。 包含 led zeppelin it 的句子被归类为匹配项,这是可以的。但是包含 led it 的句子也被归类为匹配,这是错误的。
这是相关代码:
BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true
I'm using the BayesianClassifier class to classify spam. The problem is that compound words aren't being recognized.
For instance if I add led zeppelin as a match, a sentence containing it won't be recognized as a match even though it should.
For adding a match I'm using addMatch() of SimpleWordsDataSource
And for asking for a match I'm using isMatch() of BayesianClassifier
Any ideas on how to fix this?
Ok, thanks for the insight. I'm attaching more source code.
SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);
wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");
classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match
Now I'm using the teachMatch method of BayesianClassifier and I've got different results.
A sentence containing led zeppelin it is classified as a match, which is ok. But a sentence including led it is also classified as a match, which is wrong.
Here's the relevant code:
BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
(我写了classifier4j)
你需要用更多的数据来训练它。
贝叶斯分类器的工作原理是创建统计模型来判断什么是匹配的,什么是不匹配的。
如果你给它足够的数据,它就会知道“led 和 zeppelin”是匹配的,但“led”本身不是匹配的
(I wrote classifier4j)
You need to train it with more data.
Bayesian classifiers work by creating statistical models of what is considered a match and what isn't.
If you give it enough data, it will learn that "led and zeppelin" is a match, but "led" by itself isn't