如何增量训练 nltk 分类器

发布于 2024-10-16 03:27:31 字数 254 浏览 5 评论 0原文

我正在开发一个项目,使用 python nltk 模块和 naivebayes 分类器对文本片段进行分类。我能够对语料库数据进行训练并对另一组数据进行分类,但希望在初始训练后将额外的训练信息输入到分类器中。

如果我没有记错的话,似乎没有办法做到这一点,因为 NaiveBayesClassifier.train 方法需要一组完整的训练数据。有没有办法在不输入原始特征集的情况下添加到训练数据中?

我愿意接受建议,包括其他可以随着时间的推移接受新训练数据的分类器。

I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.

If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?

I'm open to suggestions including other classifiers that can accept new training data over time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

小瓶盖 2024-10-23 03:27:31

我知道有两个选项:

1)定期根据新数据重新训练分类器。您将在语料库中积累新的训练数据(已经包含原始训练数据),然后每隔几个小时重新训练并重新训练数据。重新加载分类器。这可能是最简单的解决方案。

2)外部化内部模型,然后手动更新。可以通过给它一个 label_prodist 和一个 feature_probdist 来直接创建 NaiveBayesClassifier。您可以单独创建它们,将它们传递到 NaiveBayesClassifier,然后在新数据出现时更新它们。分类器将立即使用这些新数据。您必须查看 train 方法以获取有关如何更新概率分布的详细信息。

There's 2 options that I know of:

1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.

2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.

困倦 2024-10-23 03:27:31

我刚刚学习NLTK,如果我错了,请纠正我。这是使用 NLTK 的 Python 3 分支,这可能不兼容。

NaiveBayesClassifier 实例有一个 update() 方法,该方法似乎添加到训练数据中:

from textblob.classifiers import NaiveBayesClassifier

train = [
    ('training test totally tubular', 't'),
]

cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])

print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))

这会打印出:

t t
s s

I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.

There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:

from textblob.classifiers import NaiveBayesClassifier

train = [
    ('training test totally tubular', 't'),
]

cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])

print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))

This prints out:

t t
s s
勿忘初心 2024-10-23 03:27:31

正如雅各布所说,第二种方法是正确的方法
希望有人编写代码

看看

https://baali.wordpress .com/2012/01/25/incrementally-training-nltk-classifier/

As Jacob said, the second method is the right way
And hopefully someone write a code

Look

https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文