对文本进行分类时自动将类别相互链接
我一直在从事一个项目,对大量短文本进行数据挖掘,并根据预先存在的大量类别名称列表对这些文本进行分类。为此,我必须弄清楚如何首先从数据中创建一个好的文本语料库,以便获得分类的参考文档,然后将分类的质量提高到可接受的水平。我已经完成了这一部分(幸运的是,很多人已经对文本进行分类进行了大量研究)。
现在我的下一个问题是,我试图找出一种在计算上将各个类别相互链接的好方法。也就是说,要弄清楚如何识别“汽车”和“雪佛兰”在某种程度上有关联。到目前为止,我已经尝试利用 Cavnar 和 Trenkle 等人描述的 N-Gram 分类方法来比较我为每个类别创建的各种参考文档。不幸的是,我从该方法中得到的最好结果似乎是类别之间大约 50-55% 的正确关系,这些是最好的关系,总体而言约为 30-35%,这是非常低的。
我也尝试了其他几种方法,但无法获得高于 40% 的相关链接(非相关关系的一个例子是“卡车”类别与“化妆品”类别密切相关) ”或“尿布”类别,但与“雪佛兰”关系较弱(或根本不相关)。
现在,我尝试寻找更好的方法来做到这一点,但似乎我找不到任何方法(但我知道其他人做得比我更好)。有人有这方面的经验吗?关于创建类别之间关系的可用方法有什么建议吗?目前,我尝试过的方法要么根本没有提供足够的关系,要么包含太高比例的垃圾关系。
I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).
Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.
I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").
Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
显然,进行匹配的最佳方法很大程度上取决于您的分类、“参考文档”的性质以及您想要创建的预期关系。
但是,根据提供的信息,我建议如下:
Obviously, the best way of doing that matching is highly dependent on your taxonomy, the nature of your "reference documents", and the expected relationships you'd like created.
However, based on the information provided, I'd suggest the following: