我如何获得维基百科页面的子集?
我如何获得维基百科页面的子集(例如 100MB)?我发现您可以将整个数据集作为 XML 获取,但它更像是 1 或 2 个演出;我不需要那么多。
我想尝试实现映射缩减算法。
话虽如此,如果我能从任何地方找到 100 兆的文本样本数据,那也很好。例如,Stack Overflow 数据库(如果可用)可能会是一个不错的大小。我愿意接受建议。
编辑:还有不是种子的吗?我无法在工作中得到这些。
How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.
I want to experiment with implementing a map-reduce algorithm.
Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.
Edit: Any that aren't torrents? I can't get those at work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
stackoverflow 数据库可供下载。
The stackoverflow database is available for download.
Chris,您可以编写一个小程序来点击 Wikipedia“随机页面”链接,直到获得 100MB 的网页: http://en.wikipedia.org/wiki/Special:Random。您将希望丢弃可能获得的任何重复项,并且您可能还希望限制每分钟发出的请求数量(尽管部分文章将由中间网络缓存而不是维基百科服务器提供)。但这应该很容易。
Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.
一种选择是下载整个维基百科转储,然后仅使用其中的一部分。您可以解压缩整个文件,然后使用一个简单的脚本将文件拆分为较小的文件(例如 这里),或者如果你担心磁盘空间,你可以写一个脚本来动态解压和分割,然后你就可以在您想要的任何阶段停止解压过程。 Wikipedia Dump Reader 可以让您如果您熟悉 Python(请查看 mparser.py),那么它会为您提供即时解压缩和处理的灵感。
如果您不想下载整个内容,则可以选择抓取。 导出功能可能对此有所帮助,并且<在这种情况下还建议使用 href="http://meta.wikimedia.org/wiki/Using_the_python_wikipediabot" rel="nofollow noreferrer">wikipediabot。
One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).
If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.
如果您想获取 stackoverflow 数据库的副本,可以从 获取知识共享数据转储。
出于好奇,您使用这些数据做什么?
If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.
Out of curiosity, what are you using all this data for?
您可以使用网络爬虫抓取 100MB 的数据吗?
You could use a web crawler and scrape 100MB of data?
有很多可用的维基百科转储。为什么要选择最大的(英文维基)?维基新闻档案要小得多。
There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.
维基百科文章的一小部分包括“元”维基文章。它与整个文章数据集采用相同的 XML 格式,但较小(截至 2019 年 3 月约为 400MB),因此可用于软件验证(例如测试 GenSim 脚本)。
https://dumps.wikimedia.org/metawiki/latest/
您想要查找任何带有
-articles.xml.bz2
后缀的文件。One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).
https://dumps.wikimedia.org/metawiki/latest/
You want to look for any files with the
-articles.xml.bz2
suffix.