同义词样式文本查找和解析
我们有一位客户正在寻找一种方法来导入和分类大量文本数据。 必须对这些数据进行分类,建议最简单的方法是查看描述字段并尝试匹配那里的单词,看看是否可以为该特定记录派生出一个类别。
人们认为最好的方法是将单词与针对每个类别的关键词进行匹配,如果不成功,则使用某种同义词查找来查看是否可以使用它。 例如,如果特定记录中包含单词“汽车”,则同义词查找可以将该单词与单词“汽车”匹配,该单词将针对“车辆”类别进行保存。
有谁知道网络服务或其他查找词典以查找特定单词同义词的方法吗? 项目经理建议为此购买 Google Enterprise Search 许可证,但据我所知,这并不能提供这些人正在寻找的东西。
任何其他为客户提供他们正在寻找的东西的建议都将被感激地接受。
谢谢! 我会研究一下Wordnet。
您知道还有其他类型的文本分类软件产品吗? 我看到有一些关于使用巴亚斯算法的讨论,但我看不到任何现实世界的例子。
We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我首先想到的是 Wordnet。 Wordnet 是一个人工生成的单词和相关单词(包括同义词)数据库。 Wikipedia Wordnet 条目列出了 Wordnet 的几个接口。 我相信其中一些是网络服务。
您也可以自己推出。 Manning 和 Schutze 的第 5 章(免费 PDF) 展示了实现此目的的方法。
话虽如此,您正在解决正确的问题吗? 如何构建类别列表?
这是一个等级制度吗? 标签云? 请参阅Clay Shirky 的本体论被高估了,以了解对层次类别的批评。 我认为,如果您的分类基于单词集(例如朴素贝叶斯)而不是单个单词,那么同义词就不那么重要了。
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
您应该考虑使用 WordNet。 您可以访问他们的网站 http://wordnet.princeton.edu/ 获取更多信息,但有可用于以多种语言集成它们的库。
转到他们的在线工具,查看其实际使用情况:http://wordnetweb.princeton.edu /perl/webwn。 如果您查找一个单词,然后单击每个定义旁边的“S”,您将获得与该定义语义相关的单词列表。
我还认为您应该检查允许您执行“文档聚类”的软件。 这是一个示例: http://glaros.dtc.umn.edu/gkhome /cluto/cluto/概述。 这应该可以帮助您引导类别创建过程。
我认为这将帮助您朝着您想要的目标迈进一大步!
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
对于文本分类,您可以查看 Apache Mahout。
For text classification you can take a look at Apache Mahout.