当前位置：文江博客话题详情

如何获取各种语言中最常用单词的列表？

发布于 2024-09-17 07:39:35 字数 162 浏览 5 评论 0原文

Stack Overflow 通过获取当前问题的标题并从 Google 中删除 10,000 个最常见的英语单词来实现“相关问题”功能。然后将剩余的单词作为全文搜索提交以查找相关问题。

我如何获得最常见英语单词的列表？或者其他语言中最常见的单词？这是我可以从 Google 网站上下来的东西吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

⊕婉儿 2024-09-24 07:39:35

词频列表就是您想要的。您还可以创建自己的库，或自定义一个库以在特定领域内使用，这是熟悉一些优秀库的好方法。从一些文本开始，例如这个问题，然后尝试这个粗略脚本的一些变体：

from nltk.stem.porter import PorterStemmer
import os
import string
from collections import defaultdict

ps = PorterStemmer()
word_count = defaultdict(int)

source_directory = '/some/dir/full/of/text'

for root, dirs, files in os.walk(source_directory):
    for item in files:
        current_text = os.path.join(root, item)
        words = open(current_text, 'r').read().split()
        for word in words:
            entry = ps.stem_word(word.strip(string.punctuation).lower())
            word_count[entry] += 1

results = [[word_count[i], i] for i in word_count]

print sorted(results)

这给出了有关下载的几本书的以下内容，以及最常见的单词：

[2955, 'that'], [4201, 'in'], [4658, 'to'], [4689, 'a'], [6441, 'and'], [6705, 'of'], [14508, 'the']]

看看过滤时会发生什么从查询中找出最常见的 xy 或 z 数字，或者将它们完全排除在文本搜索索引之外。如果您包含真实世界的数据，也可能会得到一些有趣的结果 - 例如“社区”“wiki”不太可能是通用列表中的常见词，但在 SO 上显然不是这种情况，您可能想要排除他们。

A word frequency list is what you want. You can also make your own, or customize one for use within a particular domain, and it is a nice way to become familiar with some good libraries. Start with some text such as discussed in this question, then try out some variants of this back-of-the-envelope script:

from nltk.stem.porter import PorterStemmer
import os
import string
from collections import defaultdict

ps = PorterStemmer()
word_count = defaultdict(int)

source_directory = '/some/dir/full/of/text'

for root, dirs, files in os.walk(source_directory):
    for item in files:
        current_text = os.path.join(root, item)
        words = open(current_text, 'r').read().split()
        for word in words:
            entry = ps.stem_word(word.strip(string.punctuation).lower())
            word_count[entry] += 1

results = [[word_count[i], i] for i in word_count]

print sorted(results)

This gives the following on a couple of books downloaded, re the most common words:

[2955, 'that'], [4201, 'in'], [4658, 'to'], [4689, 'a'], [6441, 'and'], [6705, 'of'], [14508, 'the']]

See what happens when you filter out the most common x y or z number from your queries, or leave them out of your text search index entirely. Also might get some interesting results if you include real world data -- for example "community" "wiki" is not likely a common word on a generic list, but on SO that obviously wouldn't be the case and you might want to exclude them.

回复收藏 0 原文

~没有更多了~