使用 dom4j 从流中读取单个 XML 文档

发布于 2024-07-07 12:37:21 字数 172 浏览 5 评论 0原文

我尝试使用 dom4j 一次从流中读取一个 XML 文档,对其进行处理,然后继续处理流中的下一个文档。 不幸的是,dom4j 的 SAXReader(在幕后使用 JAXP)持续读取并阻塞了以下文档元素。

有没有办法让 SAXReader 在找到文档元素的末尾后停止读取流? 有更好的方法来实现这一点吗?

I'm trying to read a single XML document from stream at a time using dom4j, process it, then proceed to the next document on the stream. Unfortunately, dom4j's SAXReader (using JAXP under the covers) keeps reading and chokes on the following document element.

Is there a way to get the SAXReader to stop reading the stream once it finds the end of the document element? Is there a better way to accomplish this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

小梨窩很甜 2024-07-14 12:37:21

我能够使用一些内部 JAXP 类使其与一些体操一起使用:

  • 创建自定义扫描仪,XMLNSDocumentScannerImpl 的子类
    • 在自定义扫描器内创建一个自定义驱动程序(XMLNSDocumentScannerImpl.Driver 的实现),当它看到声明或元素时返回 END_DOCUMENT。 从 fElementScanner.getCurrentEntity() 获取 ScannedEntity。 如果实体有 PushbackReader,则将实体缓冲区中剩余的未读字符推回到阅读器上。
    • 在构造函数中,将 fTrailingMiscDriver 替换为此自定义驱动程序的实例。
  • 创建一个自定义配置类,它是 XIncludeAwareParserConfiguration 的子类,在其构造函数中用此自定义扫描仪的实例替换库存 DOCUMENT_SCANNER。
  • 安装此自定义配置类的实例作为“com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration”属性,以便在 dom4j 的 SAXReader 类尝试创建 JAXP XMLReader 时将其实例化。
  • 将 Reader 传递给 dom4j 的 SAXReader.read() 方法时,提供一个缓冲区大小比默认的单字符大得多的 PushbackReader。 至少 8192 应该足以支持 JAXP 的 Apache2 副本内 XMLEntityManager 的默认缓冲区大小。

这不是最干净的解决方案,因为它涉及内部 JAXP 类的子类化,但它确实有效。

I was able to get this to work with some gymnastics using some internal JAXP classes:

  • Create a custom scanner, a subclass of XMLNSDocumentScannerImpl
    • Create a custom driver, an implementation of XMLNSDocumentScannerImpl.Driver, inside the custom scanner that returns END_DOCUMENT when it sees an declaration or an element. Get the ScannedEntity from fElementScanner.getCurrentEntity(). If the entity has a PushbackReader, push back the remaining unread characters in the entity buffer onto the reader.
    • In the constructor, replaces the fTrailingMiscDriver with an instance of this custom driver.
  • Create a custom configuration class, a subclass of XIncludeAwareParserConfiguration, that replaces the stock DOCUMENT_SCANNER with an instance of this custom scanner in its constructor.
  • Install an instance of this custom configuration class as the "com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration" property so it will be instantiated when dom4j's SAXReader class tries to create a JAXP XMLReader.
  • When passing a Reader to dom4j's SAXReader.read() method, supply a PushbackReader with a buffer size considerably larger than the one-character default. At least 8192 should be enough to support the default buffer size of the XMLEntityManager inside JAXP's copy of Apache2.

This isn't the cleanest solution, as it involves subclassing internal JAXP classes, but it does work.

衣神在巴黎 2024-07-14 12:37:21

最有可能的是,您不希望同一流中同时存在多个文档。 我不认为 SAXReader 足够聪明,无法在读到第一个文档的末尾时停止。 为什么需要在同一个流中拥有多个文档?

Most likely, you don't want to have more than one document in the same stream at the same time. I don't think that the SAXReader is smart enough to stop when it gets to the end of the first document. Why is it necessary to have multiple documents in the same stream like this?

寄意 2024-07-14 12:37:21

我认为您必须添加一个适配器,即包装流的东西,并让该东西在看到下一个文档的开头时返回文件结尾。 据我所知,所编写的解析器将一直持续到文件末尾或出现错误...并且看到另一个 肯定会是一个错误。

I think you'd have to add an adapter, something to wrap the stream and have that thing return end of file when it sees the beginning of the next document. As far as I know ,the parsers as written, will go until the end of the file or an error... and seeing another <?xml version="1.0"?> would certainly be an error.

小镇女孩 2024-07-14 12:37:21

假设您首先负责将文档放入流中,应该很容易以某种方式分隔文档。 例如:

// Any value that is invalid for an XML character will do.
static final char DOC_TERMINATOR=4;

BOOL addDocumentToStream(BufferedWriter streamOut, char xmlData[])
{
  streamOut.write(xmlData);
  streamOut.write(DOC_TERMINATOR);
}

然后从流读取时读入数组,直到遇到 DOC_TERMINATOR。

char *getNextDocuument(BufferedReader streamIn)
{
  StringBuffer buffer = new StringBuffer();
  int character;

  while (true)
  {
    character = streamIn.read();
    if (character == DOC_TERMINATOR)
      break;

    buffer.append(character);
  }
  return buffer.toString().toCharArray();
}

由于 4 是无效的字符值,除非您明确添加它,否则您不会遇到它。 从而允许您拆分文档。 现在只需将结果 char 数组包装起来以输入到 SAX 中即可。

...
  XMLReader xmlReader = XMLReaderFactory.createXMLReader();
...
  while (true)
  {
    char xmlDoc = getNextDocument(streamIn);

    if (xmlDoc.length == 0)
      break;

    InputSource saxInputSource = new InputSource(new CharArrayReader(xmlDoc));
    xmlReader.parse(saxInputSource);
  }
...

请注意,当循环获得长度为 0 的文档时,循环将终止。这意味着您应该在最后一个文档之后添加第二个 DOC_TERMINATOR,或者您需要在 getNextDocument() 中添加一些内容来检测流的结尾。

Assuming you are responsible for placing documents into the stream in the first place should be easy to delimit the documents in some fashion. For example:

// Any value that is invalid for an XML character will do.
static final char DOC_TERMINATOR=4;

BOOL addDocumentToStream(BufferedWriter streamOut, char xmlData[])
{
  streamOut.write(xmlData);
  streamOut.write(DOC_TERMINATOR);
}

Then when reading from the stream read into a array until DOC_TERMINATOR is encountered.

char *getNextDocuument(BufferedReader streamIn)
{
  StringBuffer buffer = new StringBuffer();
  int character;

  while (true)
  {
    character = streamIn.read();
    if (character == DOC_TERMINATOR)
      break;

    buffer.append(character);
  }
  return buffer.toString().toCharArray();
}

Since 4 is an invalid character value you won't encounter except where you explicitly add it. Thus allowing you to split the documents. Now just wrap the resuling char array for input into SAX and your good to go.

...
  XMLReader xmlReader = XMLReaderFactory.createXMLReader();
...
  while (true)
  {
    char xmlDoc = getNextDocument(streamIn);

    if (xmlDoc.length == 0)
      break;

    InputSource saxInputSource = new InputSource(new CharArrayReader(xmlDoc));
    xmlReader.parse(saxInputSource);
  }
...

Note that the loop terminates when it gets a doc of length 0. This means that you should either add a second DOC_TERMINATOR after the last document of you need to add something to detect the end of the stream in getNextDocument().

望笑 2024-07-14 12:37:21

我之前已经通过用我自己创建的另一个具有非常简单的解析功能的读取器包装基本读取器来完成此操作。 假设您知道文档的结束标记,包装器将简单地解析匹配项,例如“”。 当它检测到它返回EOF时。 通过解析第一个开始标签并在匹配的结束标签上返回 EOF,可以使包装器变得自适应。 我发现没有必要实际检测结束标记的级别,因为我没有在文档内部使用文档标记,因此可以保证第一次出现结束标记就结束了文档。

我记得,技巧之一是让包装器块 close(),因为 DOM 读取器关闭输入源。

因此,给定 Reader 输入,您的代码可能如下所示:

SubdocReader sdr=new SubdocReader(input);
while(!sdr.eof()) {
    sdr.next();
    // read doc here using DOM
    // then process document
    }
input.close();

如果遇到 EOF,eof() 方法将返回 true。 next() 方法标记读取器停止为 read() 返回 -1。

希望这能为您指明一个有用的方向。

--
猕猴桃。

I have done this before by wrappering the base reader with another reader of my own creation that had very simple parsing capability. Assuming you know the closing tag for the document, the wrapper simply parses for a match, e.g. for "</MyDocument>". When it detects that it returns EOF. The wrapper can be made adaptive by parsing out the first opening tag and returning EOF on the matching closing tag. I found it was not necessary to actually detect the level for the closing tag since no document I had used the document tag within itself, so it was guaranteed that the first occurrence of the closing tag ended the document.

As I recall, one of the tricks was to have the wrapper block close(), since the DOM reader closes the input source.

So, given Reader input, your code might look like:

SubdocReader sdr=new SubdocReader(input);
while(!sdr.eof()) {
    sdr.next();
    // read doc here using DOM
    // then process document
    }
input.close();

The eof() method returns true if EOF is encountered. The next() method flags the reader to stop returning -1 for read().

Hopefully this points you in a useful direction.

--
Kiwi.

乞讨 2024-07-14 12:37:21

我会将输入流读入内部缓冲区。 根据预期的总流大小,我要么读取整个流,然后解析它,要么检测一个 xml 和下一个 xml 之间的边界(查找

处理具有一个 xml 的流和处理具有多个 xml 的流之间的唯一真正区别是缓冲区和拆分逻辑。

I would read the input stream into an internal buffer. Depending on the expected total stream size I would either read the entire stream and then parse it or detect the boundary between one xml and the next (look for

The only real difference then between handling a stream with one xml and a stream with multiple xmls is the buffer and split logic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文