如何获取各种语言中最常用单词的列表?
Stack Overflow 通过获取当前问题的标题并从 Google 中删除 10,000 个最常见的英语单词来实现“相关问题”功能。然后将剩余的单词作为全文搜索提交以查找相关问题。
我如何获得最常见英语单词的列表?或者其他语言中最常见的单词?这是我可以从 Google 网站上下来的东西吗?
Stack Overflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
How do I get such a list of the most common English words? Or most common words in other languages? Is this something I can just get off the Google website?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
词频列表就是您想要的。您还可以创建自己的库,或自定义一个库以在特定领域内使用,这是熟悉一些优秀库的好方法。从一些文本开始,例如 这个问题,然后尝试这个粗略脚本的一些变体:
这给出了有关下载的几本书的以下内容,以及最常见的单词:
看看过滤时会发生什么从查询中找出最常见的 xy 或 z 数字,或者将它们完全排除在文本搜索索引之外。如果您包含真实世界的数据,也可能会得到一些有趣的结果 - 例如“社区”“wiki”不太可能是通用列表中的常见词,但在 SO 上显然不是这种情况,您可能想要排除他们。
A word frequency list is what you want. You can also make your own, or customize one for use within a particular domain, and it is a nice way to become familiar with some good libraries. Start with some text such as discussed in this question, then try out some variants of this back-of-the-envelope script:
This gives the following on a couple of books downloaded, re the most common words:
See what happens when you filter out the most common x y or z number from your queries, or leave them out of your text search index entirely. Also might get some interesting results if you include real world data -- for example "community" "wiki" is not likely a common word on a generic list, but on SO that obviously wouldn't be the case and you might want to exclude them.