Java:使用 SAXParser 拆分大型 XML 文件
我正在尝试使用 java 的 SAXParser
将大型 XML 文件拆分为较小的文件(特别是维基百科转储,未压缩时约为 28GB)。
我有一个 Pagehandler
类,它扩展了 DefaultHandler
:
private class PageHandler extends DefaultHandler {
private StringBuffer text;
...
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) {
text.append("<" + qName + ">");
}
@Override
public void endElement(String uri, String localName, String qName) {
text.append("</" + qName + ">");
if (qName.equals("page")) {
text.append("\n");
pageCount++;
writePage();
}
if (pageCount >= maxPages) {
rollFile();
}
}
@Override
public void characters(char[] chars, int start, int length) {
for (int i = start; i < start + length; i++) {
text.append(chars[i]);
}
}
}
所以我可以毫无问题地写出元素内容。我的问题是如何获取元素标签和属性 - 这些字符似乎没有被报告。充其量我将不得不根据作为参数传递给 startElement
的内容来重建这些 - 这似乎有点痛苦。或者有更简单的方法吗?
我想做的就是循环遍历文件并将其写出,经常滚动输出文件。这有多难:)
谢谢
I am trying to split a large XML file into smaller files using java's SAXParser
(specifically the wikipedia dump which is about 28GB uncompressed).
I have a Pagehandler
class which extends DefaultHandler
:
private class PageHandler extends DefaultHandler {
private StringBuffer text;
...
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) {
text.append("<" + qName + ">");
}
@Override
public void endElement(String uri, String localName, String qName) {
text.append("</" + qName + ">");
if (qName.equals("page")) {
text.append("\n");
pageCount++;
writePage();
}
if (pageCount >= maxPages) {
rollFile();
}
}
@Override
public void characters(char[] chars, int start, int length) {
for (int i = start; i < start + length; i++) {
text.append(chars[i]);
}
}
}
So I can write out element content no problem. My problem is how to get the element tags and attributes - these characters do not seem to be reported. At best I will have to reconstruct these from what's passed as arguments to startElement
- which seems a bit of a a pain. Or is there an easier way?
All I want to do is loop through the file and write it out, rolling the output file every-so-often. How hard can this be :)
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里的问题是您自己编写 XML 元素。查看
XMLWriterdom4j
的 code> class - 虽然它有点旧,但它使得通过调用它的
startElement
和endElement
方法,输出 XML 文档非常容易。The problem here is that you're writing the XML elements out yourself. Have a look at the
XMLWriter
class of dom4j - while it's a little old, it makes it really easy to output XML documents by calling itsstartElement
andendElement
methods.我不太确定我完全理解您想要做什么,但是要获取字符串形式的限定名称,您只需执行
qName.toString()
并获取您刚刚执行的属性名称atts.getQName(int index)
。I'm not quite sure I totally understand what you are trying to do but to get the qualified name as a string you simply do
qName.toString()
and to get the attributes name you just doatts.getQName(int index)
.