导入维基百科的 xml.bz2 选项
我想到了编写一个可以使用 XML 并将其插入数据库的 Java 程序的可能性。我提取了压缩的维基百科页面文件,因此我现在将其保存在 xml 中,而不仅仅是 xml.bz2 中。我查看了维基百科的网站,但没有成功。找不到东西。我想这不应该是一个非常困难的过程,它应该很简单,这就是我问你的原因:)
I thought of the possibility of writing a Java program that could use the XML and insert it into the database. I extracted the compressed Wikipedia pages file so I have it in xml right now, not only in xml.bz2. I've looked on Wikipedia's website but with no success. Couldn't find something. I imagine this is not supposed to be a very hard process and it should be straightforward and that's why I'm asking you :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
.bz2
后缀表示 bzip2 压缩。如果您使用的是 Linux 或其他 Unixish 操作系统,您可能已经安装了 bzip2 解压缩器;如果您使用的是 Windows,则可以此处下载一个。请注意,有些 Java 库可让您直接读取 bzip2 压缩的流,而无需外部解压缩器。其中之一可以在此处找到。
编辑:等等,我想我误解了你的问题。您似乎已经成功解压缩了 XML 转储,现在您想知道如何处理它。在这种情况下,您可能需要查看 mwdumper。
The
.bz2
suffix denotes bzip2 compression. If you're on Linux or another Unixish OS, you probably already have a bzip2 decompresor installed; if you're on Windows, you can download one here.Note that there are Java libraries that let you read bzip2-compressed streams directly without the need for an external decompressor. One of them can be found here.
Edit: Wait, I think I misread your question. It seems like you've already managed to decompress the XML dump, and now you want to know what to do with it. In that case, you might want to take a look at mwdumper.