对文本进行分类时自动将类别相互链接

发布于 2024-11-30 02:39:29 字数 595 浏览 2 评论 0原文

我一直在从事一个项目,对大量短文本进行数据挖掘,并根据预先存在的大量类别名称列表对这些文本进行分类。为此,我必须弄清楚如何首先从数据中创建一个好的文本语料库,以便获得分类的参考文档,然后将分类的质量提高到可接受的水平。我已经完成了这一部分(幸运的是,很多人已经对文本进行分类进行了大量研究)。

现在我的下一个问题是,我试图找出一种在计算上将各个类别相互链接的好方法。也就是说,要弄清楚如何识别“汽车”和“雪佛兰”在某种程度上有关联。到目前为止,我已经尝试利用 Cavnar 和 Trenkle 等人描述的 N-Gram 分类方法来比较我为每个类别创建的各种参考文档。不幸的是,我从该方法中得到的最好结果似乎是类别之间大约 50-55% 的正确关系,这些是最好的关系,总体而言约为 30-35%,这是非常低的。

我也尝试了其他几种方法,但无法获得高于 40% 的相关链接(非相关关系的一个例子是“卡车”类别与“化妆品”类别密切相关) ”或“尿布”类别,但与“雪佛兰”关系较弱(或根本不相关)。

现在,我尝试寻找更好的方法来做到这一点,但似乎我找不到任何方法(但我知道其他人做得比我更好)。有人有这方面的经验吗?关于创建类别之间关系的可用方法有什么建议吗?目前,我尝试过的方法要么根本没有提供足够的关系,要么包含太高比例的垃圾关系。

I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).

Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.

I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").

Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吾家有女初长成 2024-12-07 02:39:29

显然,进行匹配的最佳方法很大程度上取决于您的分类、“参考文档”的性质以及您想要创建的预期关系。

但是,根据提供的信息,我建议如下:

  1. 首先根据参考文档为每个类别构建基于单词(而不是基于字母)的一元模型或二元模型。如果每个类别只有很少的文档(看起来您可能只有一个),您可以使用半监督方法,并为每个类别添加自动分类的文档。用于构建模型的相对简单的工具可能是 CMU SLM 工具包
  2. 计算互信息 (infogain)模型中的每个术语或短语与其他类别的关系。如果您的类别相似,您可能需要仅使用相邻类别才能获得有意义的结果。此步骤将为最佳分离项提供更高的分数。
  3. 根据最热门的信息增益术语或短语将类别相互关联。这可以通过使用类别模型之间的欧几里德距离或余弦距离来完成,或者通过使用更复杂的技术(例如基于图的算法或层次聚类)来完成。

Obviously, the best way of doing that matching is highly dependent on your taxonomy, the nature of your "reference documents", and the expected relationships you'd like created.

However, based on the information provided, I'd suggest the following:

  1. Start by Building a word-based (rather than letter based) unigram or bigram model for each of your categories, based on the reference documents. If there are only few of these for each category (It seems you might have only one), you could use a semi-supervised approach, and throw in also the automatically categorized documents for each category. A relatively simple tool for building the model might be the CMU SLM toolkit.
  2. Calculate the mutual-information (infogain) of each term or phrase in your model, with relation to other categories. if your categories are similar, you might need you use only neighboring categories to get meaningful result. This step would give the best separating terms higher scores.
  3. Correlate the categories to each other based on the top-infogain terms or phrases. This could be done either by using euclidean or cosine distance between the category models, or by using a somewhat more elaborated techniques, like graph-based algorithms or hierarchic clustering.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文