如何获取各种语言中最常用单词的列表?

发布于 2024-09-17 07:39:35 字数 162 浏览 5 评论 0原文

Stack Overflow 通过获取当前问题的标题并从 Google 中删除 10,000 个最常见的英语单词来实现“相关问题”功能。然后将剩余的单词作为全文搜索提交以查找相关问题。

我如何获得最常见英语单词的列表?或者其他语言中最常见的单词?这是我可以从 Google 网站上下来的东西吗?

Stack Overflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.

How do I get such a list of the most common English words? Or most common words in other languages? Is this something I can just get off the Google website?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

⊕婉儿 2024-09-24 07:39:35

词频列表就是您想要的。您还可以创建自己的库,或自定义一个库以在特定领域内使用,这是熟悉一些优秀库的好方法。从一些文本开始,例如 这个问题,然后尝试这个粗略脚本的一些变体:

from nltk.stem.porter import PorterStemmer
import os
import string
from collections import defaultdict

ps = PorterStemmer()
word_count = defaultdict(int)

source_directory = '/some/dir/full/of/text'

for root, dirs, files in os.walk(source_directory):
    for item in files:
        current_text = os.path.join(root, item)
        words = open(current_text, 'r').read().split()
        for word in words:
            entry = ps.stem_word(word.strip(string.punctuation).lower())
            word_count[entry] += 1

results = [[word_count[i], i] for i in word_count]

print sorted(results)

这给出了有关下载的几本书的以下内容,以及最常见的单词:

[2955, 'that'], [4201, 'in'], [4658, 'to'], [4689, 'a'], [6441, 'and'], [6705, 'of'], [14508, 'the']]

看看过滤时会发生什么从查询中找出最常见的 xy 或 z 数字,或者将它们完全排除在文本搜索索引之外。如果您包含真实世界的数据,也可能会得到一些有趣的结果 - 例如“社区”“wiki”不太可能是通用列表中的常见词,但在 SO 上显然不是这种情况,您可能想要排除他们。

A word frequency list is what you want. You can also make your own, or customize one for use within a particular domain, and it is a nice way to become familiar with some good libraries. Start with some text such as discussed in this question, then try out some variants of this back-of-the-envelope script:

from nltk.stem.porter import PorterStemmer
import os
import string
from collections import defaultdict

ps = PorterStemmer()
word_count = defaultdict(int)

source_directory = '/some/dir/full/of/text'

for root, dirs, files in os.walk(source_directory):
    for item in files:
        current_text = os.path.join(root, item)
        words = open(current_text, 'r').read().split()
        for word in words:
            entry = ps.stem_word(word.strip(string.punctuation).lower())
            word_count[entry] += 1

results = [[word_count[i], i] for i in word_count]

print sorted(results)

This gives the following on a couple of books downloaded, re the most common words:

[2955, 'that'], [4201, 'in'], [4658, 'to'], [4689, 'a'], [6441, 'and'], [6705, 'of'], [14508, 'the']]

See what happens when you filter out the most common x y or z number from your queries, or leave them out of your text search index entirely. Also might get some interesting results if you include real world data -- for example "community" "wiki" is not likely a common word on a generic list, but on SO that obviously wouldn't be the case and you might want to exclude them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文