当前位置：文江博客话题详情

使用nltk库提取关键词

发布于 2024-11-14 13:47:57 字数 249 浏览 4 评论 0原文

我正在开发一个应用程序，它要求我从对话流中提取关键字（并最终生成这些单词的标签云）。我正在考虑以下步骤：

标记每个原始对话（输出存储为字符串列表列表）
删除停用词
使用词干分析器（波特词干算法）

到目前为止，nltk 提供了我需要的所有工具。在此之后，我需要以某种方式对这些单词进行“排序”并提出最重要的单词。谁能建议我使用 nltk 的哪些工具来实现此目的？

谢谢尼希特

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

差↓一点笑了 2024-11-21 13:47:57

我想这取决于你对“重要”的定义。
如果您谈论的是频率，那么您可以使用单词（或词干）作为键来构建字典，然后将其作为值进行计数。之后，您可以根据键的数量对字典中的键进行排序。

像这样的东西（未测试）：

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])

I guess it depends on your definition of "important".
If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.

Something like (not tested):

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])

回复收藏 0 原文

~没有更多了~