从哪里可以获得可在免费软件中使用的频率排序字典?

发布于 2024-12-10 07:11:02 字数 224 浏览 1 评论 0原文

我需要一个用于压缩程序的按频率排序的字典(许可或 GPLv3 兼容许可证),但没有丝毫线索可以在这样的许可证下获得一个(所有都缺少或错误的版权声明)。有人可以推荐一下哪里可以买到吗?我已经寻找了一段时间,但我唯一的选择似乎是创建自己的电子书,我怀疑电子书的有效质量。 (它不能完全代表所有英语,更不用说现代英语了,我的目标。)

PS:大约 200,000-50,000 字是一个很好的目标。巨大的文件不是一个好主意。

I need a frequency-sorted dictionary for a compression program, (permissive or GPLv3 compatible license), but haven't the slightest clue where to get one under such a license (all had missing or bad copyright notices). Would anyone have recommendations as to where to get one? I've looked for a while, but my only option seems to be creating my own, which I doubt the effective quality of, using e-books. (it would not be wholly representative of all English, much less modern English, my target.)

PS: about 200,000-50,000 words is a good target. Huge files is not a good idea.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北座城市 2024-12-17 07:11:02

您想要的是基于大量具有代表性的英语文本构建的一元分布。 “一元分布”是您所说的“频率词典”的正式术语。

Google 在许可下发布了大量 ngram 集合。

请参阅 http://googleresearch。 blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

或者http://books.google.com/ngrams/datasets

如果您不需要所有这些晦涩的单词,那么只需将分布切成您想要的即可。

至于许可,甚至 FSF 也表示 GPL 不适用于词典。它们不是“来源”。因此,这里的 CC 许可证非常适合合并到任何内容中。

如果您不关心拥有完全代表性的数据,那么请下载维基百科转储和用于提取文本的 Ruby 工具,并进行您自己的一元分布。

无论您选择什么,如果您想要有用的结果,您都将使用大量数据

What you want is a unigram distribution built over a large quantity of representative English text. A 'unigram distribution' is the formal term for what you're calling a 'dictionary with frequencies'.

Google published a giant collection of ngrams under a permissive license.

See http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html.

Or http://books.google.com/ngrams/datasets.

If you don't need all those obscure words, then just chop the distribution to what you want.

As for licensing, even the FSF says that the GPL is inapplicable to dictionaries. They aren't 'source'. So the CC license here works perfectly fine for incorporating in whatever.

If you don't care about having entirely representative data, then download the wikipedia dumps and the Ruby tool for extracting text, and do your own unigram distribution.

Whatever you choose, you'll be working with a lot of data if you want useful results.

放手` 2024-12-17 07:11:02

看看这里: http://norvig.com/ngrams/

包含这个,这可能就是您需要的:

  1. 4.9 MB count_1w.txt - 1/3 百万个最常见的单词,全部小写,带有计数。 (在本章中称为 vocab_common,但我在此处更改了文件名。)
  2. 5.6 MB count_2w.txt - 1/4 百万个最常见的双字(小写)双字母组,带有计数。

Have a look here: http://norvig.com/ngrams/

Contains this, which might be what you need:

  1. 4.9 MB count_1w.txt - The 1/3 million most frequent words, all lowercase, with counts. (Called vocab_common in the chapter, but I changed file names here.)
  2. 5.6 MB count_2w.txt - The 1/4 million most frequent two-word (lowercase) bigrams, with counts.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文