当前位置：文江博客话题详情

使用 NLTK 进行命名实体识别。提取关键词的相关性

发布于 2024-11-01 17:37:13 字数 77 浏览 9 评论 0原文

我正在检查 NLTK 的命名实体识别功能。是否有可能找出提取的关键词中哪些与原文最相关？另外，是否可以知道提取的关键字的类型（个人/组织）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

抚笙 2024-11-08 17:37:13

如果您有训练有素的标记器，则可以先标记文本，然后使用 NLTK 附带的 NE 分类器。

标记的文本应该以列表的形式呈现

sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]

然后，将像这样调用新的分类器

nltk.ne_chunk(tagged_sentence)

它返回一个树。分类后的单词将显示为主结构内的树节点。
结果将包括它是个人、组织还是 GPE。

要找出最相关的术语，您必须定义“相关性”的度量。通常使用 tf/idf，但如果您只考虑一份文档，频率就足够了。

使用 NLTK 可以轻松计算文档中每个单词的频率。首先，您必须加载语料库，一旦加载并拥有一个 Text 对象，只需调用：

relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()

最后，您可以过滤掉 related_terms_sorted_by_freq 中不属于 NE 单词列表的所有单词。

NLTK 提供了完整书籍的在线版本，我觉得从这本书开始很有趣

If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.

The tagged text should be presented as a list

sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]

Then, the ne classifier would be called like this

nltk.ne_chunk(tagged_sentence)

It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.

To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.

Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:

relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()

Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.

NLTK offers an online version of a complete book which I find interesting to start with

回复收藏 0 原文

~没有更多了~

关于作者

热血少△年

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

使用 NLTK 进行命名实体识别。提取关键词的相关性

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

使用 NLTK 进行命名实体识别。提取关键词的相关性

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。