解析格式错误/不完整/无效的 XML 文件

发布于 2024-11-28 23:34:00 字数 1196 浏览 2 评论 0原文

我有一个使用 JDOM 和 xpath 解析 XML 文件的过程，如下所示：

private static SAXBuilder   builder         =   null;
private static Document     doc         =   null; 
private static XPath        xpathInstance       =   null;

builder = new SAXBuilder();
Text list = null;

try {
    doc = builder.build(new StringReader(xmldocument));

} catch (JDOMException e) {
            throw new Exception(e);
} 



try {
    xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
    list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
    throw new Exception(e);
}

上面的工作正常。 xpath 表达式存储在属性文件中，因此可以随时更改。现在我必须处理更多来自遗留系统的 xml 文件，该系统只会以 4000 字节的块发送 xml 文件。现有的处理读取 4000 字节的块并将它们存储在 Oracle 数据库中，每个块作为数据库中的一行（对遗留系统进行任何更改或将块作为数据库中的行存储的处理是不可能的）。

我可以通过提取与特定 xml 文档相关的所有行并合并它们来构建完整的有效 XML 文档，然后使用现有处理（如上所示）来解析 xml 文档。

但问题是，我需要从 XML 文档中提取的数据始终位于前 4000 个字节。该块当然不是有效的 XML 文档，因为它不完整，但包含我需要的所有数据。我无法只解析一个块，因为 JDOM 构建器会拒绝它。

我想知道我是否可以解析格式错误的 XML 块，而不必合并所有部分（可能会达到很多部分）以获得有效的 XML 文档。这将节省我多次前往数据库检查某个块是否可用的次数，并且我不必为了能够使用前 4000 个字节而合并 100 个块。

我知道我可能可以使用java的字符串函数来提取相关数据，但这可以使用解析器甚至xpath吗？或者他们都期望 xml 文档在解析之前是一个格式良好的文档？

原文

I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:

private static SAXBuilder   builder         =   null;
private static Document     doc         =   null; 
private static XPath        xpathInstance       =   null;

builder = new SAXBuilder();
Text list = null;

try {
    doc = builder.build(new StringReader(xmldocument));

} catch (JDOMException e) {
            throw new Exception(e);
} 



try {
    xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
    list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
    throw new Exception(e);
}

The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).

I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.

The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.

I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.

I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

写下不归期 2024-12-05 23:34:00

您可以尝试使用 JSoup 来解析无效的 XML。根据定义，XML 应该是格式正确的，否则它是无效的并且不应该使用。

更新 - 示例：

public static void main(String[] args) {
    for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
            new Element(Tag.valueOf("p"), ""),
            "")) {
        print(node, 0);
    }
}

public static void print(Node node, int offset) {
    for (int i = 0; i < offset; i++) {
        System.out.print(" ");
    }
    System.out.print(node.nodeName());
    for (Attribute attribute: node.attributes()) {
        System.out.print(", ");
        System.out.print(attribute.getKey() + "=" + attribute.getValue());
    }
    System.out.println();
    for (Node child : node.childNodes()) {
        print(child, offset + 4);
    }
}

You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.

UPDATE - example:

public static void main(String[] args) {
    for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
            new Element(Tag.valueOf("p"), ""),
            "")) {
        print(node, 0);
    }
}

public static void print(Node node, int offset) {
    for (int i = 0; i < offset; i++) {
        System.out.print(" ");
    }
    System.out.print(node.nodeName());
    for (Attribute attribute: node.attributes()) {
        System.out.print(", ");
        System.out.print(attribute.getKey() + "=" + attribute.getValue());
    }
    System.out.println();
    for (Node child : node.childNodes()) {
        print(child, offset + 4);
    }
}

回复收藏 0 原文

~没有更多了~