从文本中提取专有名称的软件和技术有哪些?
我有一个基于文本的文档的大型语料库(100,000+),我想从中提取专有名称(例如人名)。
任何人都可以推荐有助于实现此目标的技术和/或软件。我对低级文本解析并不是特别感兴趣,而是对更高级的事情(例如识别和/或排名)特别感兴趣。
I have a large corpus of text-based documents (100,000+) from which I want to extract proper names (e.g. a person's name).
Could anyone recommend techniques and/or software that would be useful in accomplishing this goal. I'm not particularly interested in low-level text parsing, so much as I am in more high-level things such as recognizing and/or ranking.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您在寻找命名实体识别吗?请查看维基百科文章。
斯坦福 NLP 小组此处有一个不错的即用型软件包,其中包含提供 GPL 和商业许可证。
Are you looking for Named Entity Recognition? Take a look at the wikipedia article.
The Stanford NLP group has a decent ready-to-use package here, with both GPL and commerical licenses available.
如果没有某种形式的自然语言处理,这样的事情就无法可靠地完成。一些常见问题:
名称也是常用词:
John Black
多种语言和同一单词的各种形式。
指代不同事物的名称。
Lily
可以是人名、地名、猫名或花名。NLP 可以使用周围的语法结构来区分其中一些情况。
也就是说,您可以尝试的一种简单(且幼稚)的技术是使用单词的大写。如果您在句子中间看到大写开头字母,它通常是某种名称。
您可以合理地假设任何此类单词在同一文档中都指代相同的事物。序列中的两个这样的单词可能是名字/姓氏组合等。
如果文档中的大写字母不可信,您可能可以信任正确的单词列表,以便获取适用的正确名称列表语言。
Something like this cannot be done reliably without some form of Natural Language Processing. A few common issues:
Names that are also common words:
John Black
Multiple languages and various forms of the same word.
Names that refer to different things.
Lily
could be a name for a person, a place, a cat or just the flower.NLP can use surrounding grammar constructs to tell some of these cases apart.
That said, a simple (and naive) technique that you could try would be to use the capitalisation of the words. If you see a capital starting letter in the middle of a sentence, it is usually a name of some sort.
You might be able to reasonably assume that any such word refers to the same thing within the same document. Two such words in a sequence are probably a name/surname combination etc.
If capitalisation in the documents cannot be trusted, you might be able to trust that of a proper wordlist, instead, in order to get a list of proper names for the applicable languages.
也许你最好的选择是将每个单词与专有名称词典进行比较。
Probably your best bet is to compare each word against a dictionary of proper names.
如果您列出了所有唯一单词,然后删除了字典中的所有单词,该怎么办?
What if you made a list of all of the unique words, then removed all of the words that are in a dictionary?