寻找来自 wikipedia 的 n-gram 数据库
我正在有效地尝试解决与此问题相同的问题:
减去单词代表物理对象的要求。答案和编辑后的问题似乎表明,一个好的开始是使用维基百科文本作为语料库构建 n-gram 频率列表。在我开始下载庞大的维基百科转储之前,有谁知道这样的列表是否已经存在?
PS如果上一个问题的原始发布者看到了这个,我很想知道你是如何解决这个问题的,因为你的结果看起来很棒:-)
I am effectively trying to solve the same problem as this question:
Finding related words (specifically physical objects) to a specific word
minus the requirement that words represent physical objects. The answers and edited question seem to indicate that a good start is building a list of frequency of n-grams using wikipedia text as a corpus. Before I start downloading the mammoth wikipedia dump, does anyone know if such a list already exists?
PS if the original poster of the previous question sees this, I would love to know how you went about solving the problem, as your results seem excellent :-)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Google 有一个公开的 TB n-garam 数据库(最多至 5)。
您可以订购 6 张 DVD,也可以找到包含该内容的 torrent。
Google has a publicly available terabyte n-garam database (up to 5).
You can order it in 6 DVDs or find a torrent that hosts it.
您可以此处找到 2008 年 6 月的维基百科 n-gram。此外,它还有中心词和标记句子。我尝试创建自己的 n 元语法,但双元语法耗尽了内存 (32Gb)(当前的英文维基百科非常庞大)。提取 xml 也花费了大约 8 小时,一元语法花费了 5 小时,二元语法花费了 8 小时。
由于 mediawiki 和 Wikipedia 的文本之间存在大量垃圾,因此链接的 n-gram 还具有经过一定程度清理的优点。
这是我的 Python 代码:
You can find the June 2008 Wikipedia n-grams here. In addition it also has headwords and tagged sentences. I tried to create my own n-grams, but ran out of memory (32Gb) on the bigrams (the current English Wikipedia is massive). It also took about 8 hours to extract the xml, 5 hours for unigrams and 8 hours for bigrams.
The linked n-grams also has the benefit of having been cleaned somewhat since mediawiki and Wikipedia has a lot of junk in between the text.
Here's my Python code: