将文档分类

发布于 2024-09-07 02:42:21 字数 657 浏览 11 评论 0原文

我在 Postgres 数据库中存储了大约 30 万个文档,这些文档都标有主题类别(总共大约有 150 个类别)。我还有另外 150k 文档还没有类别。我正在尝试找到以编程方式对它们进行分类的最佳方法。

我一直在探索 NLTK 及其朴素贝叶斯分类器。似乎是一个很好的起点(如果你能为这个任务建议一个更好的分类算法,我洗耳恭听)。

我的问题是,我没有足够的 RAM 来一次在所有 150 个类别/300k 文档上训练 NaiveBayesClassifier(在 5 个类别上训练使用 8GB)。此外,当我训练更多类别时,分类器的准确度似乎会下降(2 个类别的准确度为 90%,5 个类别的准确度为 81%,10 个类别的准确度为 61%)。

我是否应该一次训练 5 个类别的分类器,然后通过分类器运行所有 150k 文档以查看是否存在匹配项?看起来这会起作用,除了会出现很多误报,其中与任何类别都不真正匹配的文档仅仅因为它是可用的最佳匹配而被分类器硬塞进去......一种为分类器提供“以上都不是”选项的方法,以防文档不适合任何类别?

这是我的测试类 http://gist.github.com/451880

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.

I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).

My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).

Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

Here is my test class http://gist.github.com/451880

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

芯好空 2024-09-14 02:42:35

有没有办法让“没有
分类器的“上方”选项
如果该文档不适合
任何类别?

您只需每次训练“以上都不是”伪类别即可获得这种效果。如果您最多可以训练 5 个类别(尽管我不确定为什么它会消耗这么多 RAM),则从每个类别的实际 2K 文档中训练 4 个实际类别,以及一个“以上都不是”类别及其 2K 文档从所有其他 146 个类别中随机抽取(如果您想要“分层抽样”方法,则从每个类别中大约抽取 13-14 个类别,这可能更合理)。

仍然感觉有点混乱,你可能会更好地采用完全不同的方法 - 找到一个多维文档度量,将 300K 预先标记的文档定义为 150 个合理可分离的集群,然后分配每个其他集群- 将未标记的文档添加到由此确定的适当集群中。我不认为 N​​LTK 有任何直接可用的东西来支持这种事情,但是,嘿,NLTK 发展得如此之快,以至于我很可能错过了一些东西......;-)

Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?

You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).

Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

£烟消云散 2024-09-14 02:42:34

您应该首先将文档转换为 TF-log(1 + IDF) 向量:术语频率为稀疏,因此您应该使用 python dict,以 term 作为键,将 count 作为值,然后除以总计数以获得全局频率。

另一个解决方案是使用 abs(hash(term)) 作为正整数键。然后你可以使用 scipy.sparse 向量,它比 python dict 更方便、更有效地执行线性代数运算。

还通过对属于同一类别的所有标记文档的频率进行平均来构建 150 个频率向量。然后,对于要标记的新文档,您可以计算文档向量与每个类别向量之间的余弦相似度并选择最相似的类别作为文档的标签。

如果这还不够好,那么您应该尝试使用 L1 惩罚来训练逻辑回归模型,如 scikit-learn 的这个示例(这是一个包装器对于 liblinear,如@ephes 所解释)。用于训练逻辑回归模型的向量应该是之前引入的 TD-log(1+IDF) 向量,以获得良好的性能(精度和召回率)。 scikit learn lib 提供了 sklearn.metrics 模块,其中包含用于计算给定模型和给定数据集的分数的例程。

对于较大的数据集:您应该尝试 vowpal wabbit,它可能是地球上进行大规模文档分类最快的兔子问题(但不容易使用 python 包装器 AFAIK)。

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.

Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.

Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.

If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.

For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

忆梦 2024-09-14 02:42:34

您的文档有多大(字数)? 150K 训练文档的内存消耗应该不是问题。

朴素贝叶斯是一个不错的选择,尤其是当您有许多类别且只有几个训练示例或非常嘈杂的训练数据时。但总的来说,线性支持向量机的性能确实要好得多。

您的问题是多类(文档仅属于一个类别)还是多标签(文档属于一个或多个类别)?

准确度并不是判断分类器性能的一个糟糕选择。您应该使用精度与召回率、精度召回盈亏平衡点 (prbp)、f1、auc,并且必须查看精度与召回率曲线,其中根据置信阈值绘制召回率 (x) 与精度 (y) 的关系(文档是否属于某个类别)。通常,您将为每个类别构建一个二元分类器(一个类别的正训练示例与不属于您当前类别的所有其他训练示例)。您必须为每个类别选择最佳置信度阈值。如果您想将每个类别的单个度量组合成全局绩效度量,则必须进行微观(将所有真阳性、假阳性、假阴性和真阴性相加并计算组合分数)或宏观(计算每个类别的分数和然后对所有类别的分数进行平均)平均。

我们拥有包含数千万个文档、数百万个训练示例和数千个类别(多标签)的语料库。由于我们面临严重的训练时间问题(每天新增、更新或删除的文档数量相当多),我们使用 lib线性。但对于较小的问题,使用 liblinear 周围的 python 包装器之一(liblinear2scipyscikit-learn)应该可以正常工作。

How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.

Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.

Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?

Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.

We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文