当前位置：文江博客话题详情

使用 WordNet 检测专有名词？

发布于 2024-08-15 23:25:35 字数 312 浏览 10 评论 0原文

我正在使用 JAWS 访问 WordNet。给定一个单词，有什么方法可以检测它是否是专有名词？看起来同义词集的词汇类别相当粗略。

澄清一下，这些词没有上下文——它们只是单独呈现。如果一个词可以被用作普通名词，那么它是可以接受的。所以“mark”很好，因为虽然它可能是某人的名字，但它也可以指一个点。然而，“非洲”却不是。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光暖心i 2024-08-22 23:25:35

不幸的是，您将无法从 WordNet 同义词集中可靠地确定专有名词信息。您正在寻找的是命名实体识别。维基百科页面上有几个可用 Java 版本的链接。我个人推荐 Stanford NER 或 LingPipe。

更新：

基于单词没有上下文的附加约束，您可以使用大小写作为主要指标，然后仔细检查WordNet以查看该单词是否可以用作名词。也许是这样的：

String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
    WordNetDatabase database = WordNetDatabase.getFileInstance();
    Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
    isProperNoun = synsets.length > 0;
}

这会消除这样的误报：

如果你构建了它...
如如你所愿...
哦罗密欧，罗密欧...

并且仍然只捕获大写的名词

在马克的书中，它说……
您最近听过 The Roots 或 Who 吗？

但仍然给你误报

标记第一个实例...
预订他们，丹诺。

因为它们可能是，但如果没有上下文你就无法知道。

如果您想变得非常棘手，您可以跟踪任何名词的上位词树，看看您是否到达了诸如“公司”或“国家”之类的明显内容。然而，我上次使用 WordNet 时（4 年前），上位词/下位词关系不是很可靠或一致，这可能会导致很多误报（并且没有改善我上面提到的误报，因为这些完全是错误的）取决于上下文）。

Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.

Updated:

Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:

String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
    WordNetDatabase database = WordNetDatabase.getFileInstance();
    Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
    isProperNoun = synsets.length > 0;
}

That would eliminate false positives like this:

If you build it...
As you wish...
Oh Romeo, Romeo...

And still catch just the capitalized nouns in

In the Book of Mark it says...
Have you heard The Roots or The Who recently?

but still give you false positives on

Mark the first instance...
Book 'em, Danno.

because they could be, but without context you don't know.

If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).

回复收藏 0 原文

橙味迷妹 2024-08-22 23:25:35

如果你使用linux命令行来使用Wordnet，你可以使用'wn -synsn'来获取一个单词的所有同义词集。专有名词将大写。例如，

$: wn mark -synsn

   Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun mark
   15 senses of mark                                                       

   Sense 1
   mark, grade, score
         => evaluation, valuation, rating
   .
   .
   .
   Sense 8
   Mark, Saint Mark, St. Mark
         INSTANCE OF=> Apostle, Apostelic Father
         INSTANCE OF=> Evangelist
         INSTANCE OF=> saint

但是，说真的，请不要仅依赖 Wordnet。可能有无数的专有名词，Wordnet 不会为您获取任何信息。例如，尝试使用“Henrik”这个名字！

不过，您可以从 Google n-gram 语料库等数据集中为您的单词构建上下文w，并使用此类上下文构建一个返回置信度得分的分类器（即，分类器可以说 w 是一个专有名词，置信度为 0 <= c <= 1。）

If you use the linux command-line to use Wordnet, you can use 'wn -synsn' to get all the synsets of a word. The proper nouns will be capitalized. E.g.,

$: wn mark -synsn

   Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun mark
   15 senses of mark                                                       

   Sense 1
   mark, grade, score
         => evaluation, valuation, rating
   .
   .
   .
   Sense 8
   Mark, Saint Mark, St. Mark
         INSTANCE OF=> Apostle, Apostelic Father
         INSTANCE OF=> Evangelist
         INSTANCE OF=> saint

But, seriously, please don't rely only on Wordnet for this. There are potentially gazillions of proper nouns for which Wordnet will not fetch you any information. Try the name Henrik, for example!

You can, however, build a context for your word w from datasets like the Google n-gram corpus, and use such contexts to build a classifier that returns a confidence score (i.e., the classifier can say w is a proper noun with 0 <= c <= 1 confidence.)

回复收藏 0 原文