使用nltk库提取关键词

发布于 2024-11-14 13:47:57 字数 249 浏览 4 评论 0原文

我正在开发一个应用程序,它要求我从对话流中提取关键字(并最终生成这些单词的标签云)。我正在考虑以下步骤:

  1. 标记每个原始对话(输出存储为字符串列表列表)
  2. 删除停用词
  3. 使用词干分析器(波特词干算法)

到目前为止,nltk 提供了我需要的所有工具。在此之后,我需要以某种方式对这些单词进行“排序”并提出最重要的单词。谁能建议我使用 nltk 的哪些工具来实现此目的?

谢谢 尼希特

I am working on an application that requires me to extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following steps:

  1. Tokenize each raw conversation (output stored as List of List of strings)
  2. Remove stop words
  3. Use stemmer (Porter stemming algorithm)

Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?

Thanks
Nihit

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

差↓一点笑了 2024-11-21 13:47:57

我想这取决于你对“重要”的定义。
如果您谈论的是频率,那么您可以使用单词(或词干)作为键来构建字典,然后将其作为值进行计数。之后,您可以根据键的数量对字典中的键进行排序。

像这样的东西(未测试):

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])

I guess it depends on your definition of "important".
If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.

Something like (not tested):

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文