使用 NLTK 进行命名实体识别。提取关键词的相关性
我正在检查 NLTK 的命名实体识别功能。是否有可能找出提取的关键词中哪些与原文最相关?另外,是否可以知道提取的关键字的类型(个人/组织)?
I was checking out the Named Entity Recognition feature of NLTK. Is it possible to find out which of the extracted keywords is most relevant to the original text? Also, is it possible to know the type (Person / Organization) of the extracted keywords?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您有训练有素的标记器,则可以先标记文本,然后使用 NLTK 附带的 NE 分类器。
标记的文本应该以列表的形式呈现
然后,将像这样调用新的分类器
它返回一个树。分类后的单词将显示为主结构内的树节点。
结果将包括它是个人、组织还是 GPE。
要找出最相关的术语,您必须定义“相关性”的度量。通常使用 tf/idf,但如果您只考虑一份文档,频率就足够了。
使用 NLTK 可以轻松计算文档中每个单词的频率。首先,您必须加载语料库,一旦加载并拥有一个 Text 对象,只需调用:
最后,您可以过滤掉 related_terms_sorted_by_freq 中不属于 NE 单词列表的所有单词。
NLTK 提供了完整书籍的在线版本,我觉得从这本书开始很有趣
If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.
The tagged text should be presented as a list
Then, the ne classifier would be called like this
It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.
To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.
Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:
Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.
NLTK offers an online version of a complete book which I find interesting to start with