从哪里可以获得可在免费软件中使用的频率排序字典?
我需要一个用于压缩程序的按频率排序的字典(许可或 GPLv3 兼容许可证),但没有丝毫线索可以在这样的许可证下获得一个(所有都缺少或错误的版权声明)。有人可以推荐一下哪里可以买到吗?我已经寻找了一段时间,但我唯一的选择似乎是创建自己的电子书,我怀疑电子书的有效质量。 (它不能完全代表所有英语,更不用说现代英语了,我的目标。)
PS:大约 200,000-50,000 字是一个很好的目标。巨大的文件不是一个好主意。
I need a frequency-sorted dictionary for a compression program, (permissive or GPLv3 compatible license), but haven't the slightest clue where to get one under such a license (all had missing or bad copyright notices). Would anyone have recommendations as to where to get one? I've looked for a while, but my only option seems to be creating my own, which I doubt the effective quality of, using e-books. (it would not be wholly representative of all English, much less modern English, my target.)
PS: about 200,000-50,000 words is a good target. Huge files is not a good idea.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您想要的是基于大量具有代表性的英语文本构建的一元分布。 “一元分布”是您所说的“频率词典”的正式术语。
Google 在许可下发布了大量 ngram 集合。
请参阅 http://googleresearch。 blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html。
或者http://books.google.com/ngrams/datasets。
如果您不需要所有这些晦涩的单词,那么只需将分布切成您想要的即可。
至于许可,甚至 FSF 也表示 GPL 不适用于词典。它们不是“来源”。因此,这里的 CC 许可证非常适合合并到任何内容中。
如果您不关心拥有完全代表性的数据,那么请下载维基百科转储和用于提取文本的 Ruby 工具,并进行您自己的一元分布。
无论您选择什么,如果您想要有用的结果,您都将使用大量数据。
What you want is a unigram distribution built over a large quantity of representative English text. A 'unigram distribution' is the formal term for what you're calling a 'dictionary with frequencies'.
Google published a giant collection of ngrams under a permissive license.
See http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html.
Or http://books.google.com/ngrams/datasets.
If you don't need all those obscure words, then just chop the distribution to what you want.
As for licensing, even the FSF says that the GPL is inapplicable to dictionaries. They aren't 'source'. So the CC license here works perfectly fine for incorporating in whatever.
If you don't care about having entirely representative data, then download the wikipedia dumps and the Ruby tool for extracting text, and do your own unigram distribution.
Whatever you choose, you'll be working with a lot of data if you want useful results.
看看这里: http://norvig.com/ngrams/
包含这个,这可能就是您需要的:
Have a look here: http://norvig.com/ngrams/
Contains this, which might be what you need: