如何增量训练 nltk 分类器
我正在开发一个项目,使用 python nltk 模块和 naivebayes 分类器对文本片段进行分类。我能够对语料库数据进行训练并对另一组数据进行分类,但希望在初始训练后将额外的训练信息输入到分类器中。
如果我没有记错的话,似乎没有办法做到这一点,因为 NaiveBayesClassifier.train 方法需要一组完整的训练数据。有没有办法在不输入原始特征集的情况下添加到训练数据中?
我愿意接受建议,包括其他可以随着时间的推移接受新训练数据的分类器。
I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.
If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?
I'm open to suggestions including other classifiers that can accept new training data over time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我知道有两个选项:
1)定期根据新数据重新训练分类器。您将在语料库中积累新的训练数据(已经包含原始训练数据),然后每隔几个小时重新训练并重新训练数据。重新加载分类器。这可能是最简单的解决方案。
2)外部化内部模型,然后手动更新。可以通过给它一个
label_prodist
和一个feature_probdist
来直接创建NaiveBayesClassifier
。您可以单独创建它们,将它们传递到 NaiveBayesClassifier,然后在新数据出现时更新它们。分类器将立即使用这些新数据。您必须查看train
方法以获取有关如何更新概率分布的详细信息。There's 2 options that I know of:
1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.
2) Externalize the internal model, then update it manually. The
NaiveBayesClassifier
can be created directly by giving it alabel_prodist
and afeature_probdist
. You could create these separately, pass them in to aNaiveBayesClassifier
, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at thetrain
method for details on how to update the probability distributions.我刚刚学习NLTK,如果我错了,请纠正我。这是使用 NLTK 的 Python 3 分支,这可能不兼容。
NaiveBayesClassifier
实例有一个update()
方法,该方法似乎添加到训练数据中:这会打印出:
I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.
There is an
update()
method to theNaiveBayesClassifier
instance, which appears to add to the training data:This prints out:
正如雅各布所说,第二种方法是正确的方法
希望有人编写代码
看看
https://baali.wordpress .com/2012/01/25/incrementally-training-nltk-classifier/
As Jacob said, the second method is the right way
And hopefully someone write a code
Look
https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/