JAVA：使用 XmlStreamReader 收集 xml 标签的字节偏移量

发布于 2024-09-08 07:55:22 字数 916 浏览 8 评论 0原文

有没有办法使用 XMLStreamReader 准确收集 xml 标签的字节偏移量？

我有一个大的 xml 文件，需要随机访问。我不想将整个内容写入数据库，而是希望使用 XMLStreamReader 运行一次以收集重要标签的字节偏移量，然后能够使用 RandomAccessFile 稍后检索标签内容。

XMLStreamReader 似乎没有办法跟踪字符偏移量。相反，人们建议将 XmlStreamReader 附加到跟踪已读取字节数的读取器（例如，apache.commons.io 提供的 CountingInputStream），例如

：

CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;


while (xmlStreamReader.hasNext()) {
    int eventCode = xmlStreamReader.next();

    switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
    }

}
xmlStreamReader.close();

不幸的是，必须进行一些缓冲，因为上面的代码打印出多个标签的相同字节偏移量。是否有更准确的方法来跟踪 xml 文件中的字节偏移量（理想情况下不放弃正确的 xml 解析）？

原文

Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?

I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.

XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)

e.g:

CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;


while (xmlStreamReader.hasNext()) {
    int eventCode = xmlStreamReader.next();

    switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
    }

}
xmlStreamReader.close();

Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

原来分手还会想你 2024-09-15 07:55:22

您可以在 XMLStreamReader 上使用 getLocation()（如果使用 XMLEventReader，则可以使用 XMLEvent.getLocation()），但我记得在某处读过它不可靠且不精确。看起来它给出了标签的端点，而不是起始位置。

我有类似的需求，需要精确地知道文件中标签的位置，并且我正在查看其他解析器，看看是否有一个解析器可以保证提供必要的位置精度级别。

回复收藏 0 原文

远昼 2024-09-15 07:55:22

您可以在实际输入流周围使用包装输入流，只需推迟实际 I/O 操作的包装流，但保留带有各种代码的内部计数机制来检索当前偏移量？

回复收藏 0 原文

苍景流年 2024-09-15 07:55:22

不幸的是，Aalto 没有实现 LocationInfo 接口。

最后一个 java VTD-XML ximpleware 实现，当前为 2.11
在 sourceforge 或 github
提供一些代码，在每次调用后维护字节偏移量
其 IReader 实现的 getChar() 方法。

IReader 各种字符编码的实现
在 VTDGen.java 和 VTDGenHuge.java 中可用

IReader 实现为以下编码提供

ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;   
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

使用 getCharOffset() 方法更新 IReader
并实施它
通过将 charCount 成员添加到 offset 成员中
VTDGen 和 VTDGenHuge 类
并通过在每个 IReader 实现的每次 getChar() 和skipChar() 调用上递增它应该为您提供解决方案的开始。

Unfortunatly Aalto doesn't implement the LocationInfo interface.

The last java VTD-XML ximpleware implementation, currently 2.11
on sourceforge or on github
provides some code maintaning a byte offset after each call to
the getChar() method of its IReader implementations.

IReader implementations for various caracter encodings
are available inside VTDGen.java and VTDGenHuge.java

IReader implementations are provided for the following encodings

ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;   
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

Updating IReader with a getCharOffset() method
and implementing it
by adding a charCount member along to the offset member of the
VTDGen and VTDGenHuge classes
and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.

回复收藏 0 原文

孤星 2024-09-15 07:55:22

我想我已经找到了另一种选择。如果将 switch 块替换为以下内容，它将立即转储结束元素标记之后的位置。

        switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " end@" + xmlStreamReader.getLocation().getCharacterOffset()) ;
        }

该解决方案还要求必须手动计算结束标记的实际开始位置，并且具有不需要外部 JAR 文件的优点。

我无法追踪数据管理中的一些细微不一致（我认为这与我初始化 XMLStreamReader 的方式有关），但我总是看到随着阅读器移动，位置不断增加通过内容。

希望这有帮助！

I think I've found another option. If you replace your switch block with the following, it will dump the position immediately after the end element tag.

        switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " end@" + xmlStreamReader.getLocation().getCharacterOffset()) ;
        }

This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.

I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader), but I always saw a consistent increase in the location as the reader moved through the content.

Hope this helps!

回复收藏 0 原文

成熟的代价 2024-09-15 07:55:22

我最近为 How to find character offsets in big XML files using java? 上的类似问题制定了解决方案。我认为它提供了一个基于 ANTLR 生成的 XML 解析器的良好解决方案。

回复收藏 0 原文

深空失忆 2024-09-15 07:55:22

我刚刚为此花费了一个天的长周末，并在一定程度上得益于这里的一些线索，找到了解决方案。值得注意的是，自从 OP 提出这个问题以来，我认为这并没有变得容易得多。

~~TL;DR 使用 Woodstox 和字符偏移~~

首先要解决的问题是当您询问大多数 XMLStreamReader 实现的当前偏移量时，它们似乎提供的结果不准确。然而 Woodstox 在这方面似乎坚如磐石。

第二个问题是您使用的实际偏移类型。不幸的是，如果您需要使用多字节字符集，则似乎必须使用字符偏移量，这意味着从文件中进行随机访问检索不会非常有效 - 您不能只将指针设置为文件在您的偏移量并开始读取，您必须通读直到到达偏移量，然后开始提取。 ~~可能有一种我没有想到的更有效的方法来做到这一点，但对于我的情况来说，性能是可以接受的。 500MB 文件非常快。~~

[编辑] 所以这变成了我脑海中的碎片之一，我最终编写了一个 FilterReader，它将字节偏移到字符偏移映射的缓冲区保留为文件已读。当我们需要获取字节偏移量时，我们首先向 Woodstox 询问字符偏移量，然后让自定义读取器告诉我们字符偏移量的实际字节偏移量。我们可以从元素的开头和结尾获取字节偏移量，从而为我们提供所需的信息，并通过将文件作为 RandomAccessFile 打开来从文件中提取元素。

我为此创建了一个库，它位于 GitHub 和 Maven 中心。如果您只想获取重要部分，则派对技巧位于 ByteTrackingReader。
[/edit]

还有另一个类似的问题关于这（但接受的答案让我感到害怕和困惑），有些人评论说这整件事是一个坏主意，你为什么要这样做？ XML 是一种传输机制，您应该将其导入数据库并使用更合适的工具处理数据。在大多数情况下这是正确的，但如果您正在构建通过 XML 进行通信的应用程序或集成（在 2020 年仍然很强大），您需要工具来分析和操作交换的文件。我每天都会收到验证提要内容的请求，能够从大量文件中快速提取一组特定的项目，不仅验证内容，而且格式本身也至关重要。

无论如何，希望这可以节省某人几个小时，或者至少让他们更接近解决方案。如果你在 2030 年发现这个问题并试图解决同样的问题，上帝会帮助你。

I just burned a ~~day~~ long weekend on this, and arrived at the solution partially thanks to some clues here. Remarkably I don't think this has gotten much easier in the 10 years since the OP posted this question.

~~TL;DR Use Woodstox and char offsets~~

The first problem to contend with is that most XMLStreamReader implementations seem to provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.

The second problem is the actual type of offset you use. Unfortunately it seems that you have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset, then start extracting. ~~There may be a more efficient way to do this that I haven't though of, but the performance is acceptable for my case. 500MB files are pretty snappy.~~

[edit] So this turned into one of those splinter-in-my-mind things, and I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile.

I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
[/edit]

There is another similar question on SO about this (but the accepted answer frightened and confused me), and some people commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML (still going strong in 2020), you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.

Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution. God help you if you're finding this in 2030, trying to solve the same problem.

回复收藏 0 原文

~没有更多了~