NLTK/NLP 构建多对多/多标签主题分类器

发布于 2024-12-09 18:24:33 字数 774 浏览 1 评论 0原文

我有一个包含超过 5000 个 XML 主题索引文档的人工标记语料库。它们的大小从几百千字节到几百兆字节不等。是短文章到手稿。它们都已被索引至段落级别。我很幸运有这样一个语料库,我正在尝试自学一些 NLP 概念。不可否认,我才刚刚开始。到目前为止,仅阅读免费提供的 NLTK 书籍 streamhacker,并略读 jacobs(?) NLTK 食谱。我喜欢尝试一些想法。

有人向我建议,也许我可以采用二元模型并使用朴素贝叶斯分类来标记新文档。我觉得这是错误的做法。朴素贝叶斯精通真/假关系,但要在我的分层标签集上使用它,我需要为每个标签构建一个新的分类器。其中有近1000个。我有内存和处理器能力来承担这样的任务,但我对结果持怀疑态度。不过,我会首先尝试这种方法,以满足某人的要求。我可能会在接下来的一两天内完成此任务,但我预计准确性会很低。

所以我的问题有点开放式。主要是由于学科的性质以及对我的数据普遍不熟悉,可能很难给出准确的答案。

  1. 哪种分类器适合此任务。我是否错了,贝叶斯是否可以用于除真/假类型的操作之外的操作。

  2. 对于这样的任务我应该追求什么特征提取。我对二元词并没有抱太大期望。

每个文档还包含一些引文信息,包括作者、作者性别 m、f、mix(m&f) 和其他(政府机构等)、文档类型、出版日期(16 世纪到当前)、人类分析师和其他一些一般元素。我还希望有一些有用的描述性任务来帮助更好地调查这些数据的性别偏见、分析师偏见等。但要意识到这有点超出了这个问题的范围。

I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.

It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.

So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.

  1. What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.

  2. what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.

Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

怼怹恏 2024-12-16 18:24:33

哪种分类器适合此任务。我是否错了,贝叶斯是否可以用于真/假运算之外的其他操作。

您可以通过 为每个类构建一个单独的二元分类器,可以区分该类和所有其他类。相应分类器产生正值的类是组合分类器的输出。您可以将朴素贝叶斯用于此算法或任何其他算法。 (你也可以用 NB 的概率输出和阈值来玩弄花招,但 NB 的概率估计是出了名的糟糕;只有它在其中的排名才使它有价值。)

对于这样的任务我应该追求什么特征提取

对于文本分类,已知 tf-idf 向量效果很好,但您尚未指定确切的任务是什么。文档上的任何元数据也可能有效;尝试做一些简单的统计分析。如果数据的任何特征在某些类中比其他类中出现的频率更高,那么它可能是一个有用的特征。

What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.

You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)

what feature extraction should I pursue for such a task

For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.

七色彩虹 2024-12-16 18:24:33

我知道您在这里有两个任务需要解决。第一个是你想根据一篇文章的主题(?)来标记一篇文章,因此该文章可以被分类为多个类别/类,因此你有一个多标签分类问题。有几种算法被提出来解决多标签分类问题 - 请查看文献。当我处理类似的问题时,我发现这篇论文非常有帮助: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401

你想要的第二个问题解决方法是用作者、性别、文档类型来标记论文。这是一个多类问题 - 每个类都有两个以上的潜在值,但所有文档都具有这些类的一些值。

我认为作为第一步,了解多类和多标签分类之间的差异很重要。

I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401

The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.

I think as a first step it is important to understand the differences between multi-class and multi-label classification.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文