使用 WordNet 检测专有名词?
我正在使用 JAWS 访问 WordNet。给定一个单词,有什么方法可以检测它是否是专有名词?看起来同义词集的词汇类别相当粗略。
澄清一下,这些词没有上下文——它们只是单独呈现。如果一个词可以被用作普通名词,那么它是可以接受的。所以“mark”很好,因为虽然它可能是某人的名字,但它也可以指一个点。然而,“非洲”却不是。
I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.
To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fine, because although it could be someone's name it could also refer to a point. However, "Africa" is not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不幸的是,您将无法从 WordNet 同义词集中可靠地确定专有名词信息。您正在寻找的是命名实体识别。维基百科页面上有几个可用 Java 版本的链接。我个人推荐 Stanford NER 或 LingPipe。
更新:
基于单词没有上下文的附加约束,您可以使用大小写作为主要指标,然后仔细检查WordNet以查看该单词是否可以用作名词。也许是这样的:
这会消除这样的误报:
并且仍然只捕获大写的名词
但仍然给你误报
因为它们可能是,但如果没有上下文你就无法知道。
如果您想变得非常棘手,您可以跟踪任何名词的上位词树,看看您是否到达了诸如“公司”或“国家”之类的明显内容。然而,我上次使用 WordNet 时(4 年前),上位词/下位词关系不是很可靠或一致,这可能会导致很多误报(并且没有改善我上面提到的误报,因为这些完全是错误的)取决于上下文)。
Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.
Updated:
Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:
That would eliminate false positives like this:
And still catch just the capitalized nouns in
but still give you false positives on
because they could be, but without context you don't know.
If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).
如果你使用linux命令行来使用Wordnet,你可以使用'wn -synsn'来获取一个单词的所有同义词集。专有名词将大写。例如,
但是,说真的,请不要仅依赖 Wordnet。可能有无数的专有名词,Wordnet 不会为您获取任何信息。例如,尝试使用“Henrik”这个名字!
不过,您可以从 Google n-gram 语料库等数据集中为您的单词构建上下文w,并使用此类上下文构建一个返回置信度得分的分类器(即,分类器可以说 w 是一个专有名词,置信度为 0 <= c <= 1。)
If you use the linux command-line to use Wordnet, you can use 'wn -synsn' to get all the synsets of a word. The proper nouns will be capitalized. E.g.,
But, seriously, please don't rely only on Wordnet for this. There are potentially gazillions of proper nouns for which Wordnet will not fetch you any information. Try the name Henrik, for example!
You can, however, build a context for your word w from datasets like the Google n-gram corpus, and use such contexts to build a classifier that returns a confidence score (i.e., the classifier can say w is a proper noun with 0 <= c <= 1 confidence.)
让我跑经过你。您可能需要浏览更多有关英语的书籍,才能深入了解人们无法脱离上下文确定单词的词性这一事实。
您能做的最好的事情就是测试排除...确定 WordNet 不知道给定词性中没有任何用法。在某些情况下,您可能会发现 WordNet 中只列出了一种词性。例如,我知道“汽车”除了作为名词之外没有其他用法。
将专有名词与常见名词区分开来更加困难。当然,您可以使用启发式……一个不是句子首词并且大写但不是全部大写的名词可能是一个专有名词。
最终,区别在于语义而不是词法分析。我怀疑您能否通过在 WordNet 中查找单词找到一个相当可靠的解决方案。我认为您需要先进行自然语言语法解析,然后才能可靠地提取名词,更不用说检测散文中的专有名词了。
Let me run this past you. You might have to do a run through some more books on English to gain insight into the fact that one cannot determine a word's part of speech out of context.
The best you could do is test for exclusion ... determining that WordNet knows of no usage in a given part of speech. In some cases you might find that only one part of speech is listed in WordNet. For example I know of no usage of "car" other than as a noun.
Distinguishing proper nouns from common ones is even more difficult. Certainly you can use the heuristic ... a noun which is not the initial word of a sentence and is capitalized but not in ALLCAPS is probably a proper noun.
Ultimately, the distinction is one of semantics rather than lexical analysis. I doubt you'll find a reasonably robust solution based on looking up words in WordNet. I think you'll need to do natural language grammatic parsing before you'll be able to reliably extract nouns, much less detect proper nouns in prose.
该信息似乎没有专门存储在 WordNet 中。但是,您可以查看名词 sysnet 的第一个单词形式,看看它是否大写。不确定这有多官方,但似乎可以说明,fly 不是一个专有名词,而 France 是一个专有名词。
That information doesn't seem to be specially stored in WordNet. You can however, look at the first word form of a noun sysnet to see if it's capitalized. Not sure how official that is but it seems to work telling that fly is not a proper noun and France is.