使用nltk库提取关键词
我正在开发一个应用程序,它要求我从对话流中提取关键字(并最终生成这些单词的标签云)。我正在考虑以下步骤:
- 标记每个原始对话(输出存储为字符串列表列表)
- 删除停用词
- 使用词干分析器(波特词干算法)
到目前为止,nltk 提供了我需要的所有工具。在此之后,我需要以某种方式对这些单词进行“排序”并提出最重要的单词。谁能建议我使用 nltk 的哪些工具来实现此目的?
谢谢 尼希特
I am working on an application that requires me to extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following steps:
- Tokenize each raw conversation (output stored as List of List of strings)
- Remove stop words
- Use stemmer (Porter stemming algorithm)
Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?
Thanks
Nihit
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我想这取决于你对“重要”的定义。
如果您谈论的是频率,那么您可以使用单词(或词干)作为键来构建字典,然后将其作为值进行计数。之后,您可以根据键的数量对字典中的键进行排序。
像这样的东西(未测试):
I guess it depends on your definition of "important".
If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.
Something like (not tested):