如何解析无效(错误/格式不正确)的 XML?

发布于 2025-01-14 18:38:18 字数 570 浏览 4 评论 0 原文

目前,我正在开发一项功能,该功能涉及解析从其他产品收到的 XML。我决定针对一些实际的客户数据运行一些测试,看起来其他产品允许用户输入应被视为无效的输入。无论如何,我仍然必须尝试找出一种解析它的方法。我们正在使用 javax.xml.parsers.DocumentBuilder,我在输入时收到如下错误。

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

正如您所知,描述中似乎包含无效标签 ()。现在,这个描述标签被认为是叶标签,并且内部不应该有任何嵌套标签。无论如何,这仍然是一个问题,并且会在 DocumentBuilder.parse(...) 上产生异常。

我知道这是无效的 XML,但可以预见它是无效的。关于解析此类输入的方法有什么想法吗?

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

手心的海 2025-01-21 18:38:18

“XML”比无效更糟糕——它格式不正确;请参阅格式良好与有效的 XML

对违法行为的可预测性进行非正式评估并无帮助。该文本数据不是 XML。没有一致的 XML 工具或库可以帮助您处理它。

选项,最理想的第一个:

  1. 让提供商自行解决问题。 需要格式良好的 XML。(从技术上讲,格式良好的 XML 一词是多余的,但可能有助于强调。)

  2. 使用宽容的标记解析器< /strong> 在解析为 XML 之前解决问题:


  3. 使用文本编辑器手动将数据处理为文本
    以编程方式使用字符/字符串函数。这样做
    以编程方式可以从棘手到不可能作为
    看起来是什么
    可预测的往往不是——打破规则很少受到规则的约束

    • 对于无效字符错误,请使用正则表达式删除/替换无效字符:

      • PHP: preg_replace('/[^\x{0009}\x{000a}\x {000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
      • Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌ }-\u{FFFD}", ' ' )
      • JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')< /里>
    • 对于 & 符号,使用正则表达式将匹配项替换为 & 来源:blhsin演示

      <前><代码>&(?!(?:#\d+|#x[0-9a-f]+|\w+);)

请注意,上述正则表达式不会接受注释或 CDATA
部分考虑在内。

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

Options, most desirable first:

  1. Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

  2. Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

  3. Process the data as text manually using a text editor or
    programmatically using character/string functions. Doing this
    programmatically can range from tricky to impossible as
    what appears to be
    predictable often is not -- rule breaking is rarely bound by rules.

    • For invalid character errors, use regex to remove/replace invalid characters:

      • PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
      • Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌​}-\u{FFFD}", ' ')
      • JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
    • For ampersands, use regex to replace matches with &: credit: blhsin, demo

      &(?!(?:#\d+|#x[0-9a-f]+|\w+);)
      

Note that the above regular expressions won't take comments or CDATA
sections into account.

一向肩并 2025-01-21 18:38:18

根据设计,标准 XML 解析器永远不会接受无效的 XML。

您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容,或将其包装在 CDATA 中。

A standard XML parser will NEVER accept invalid XML, by design.

Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

断舍离 2025-01-21 18:38:18

接受的答案是很好的建议,并且包含非常有用的链接。

我想补充一点,还有许多 其他情况格式不正确和/或 DTD 无效的 XML 可以使用 SGML(HTML 和 XML 的 ISO 标准化超集)进行修复。在您的情况下,有效的方法是将伪造的 THIS-IS-PART-OF-DESCRIPTION 元素声明为 SGML 空元素,然后使用例如。 osx 程序(OpenSP/OpenJade SGML 包的一部分)将其转换为 XML。例如,如果您向 osx 提供以下内容,

<!DOCTYPE xml [
  <!ELEMENT xml - - ANY>
  <!ELEMENT description - - ANY>
  <!ELEMENT THIS-IS-PART-OF-DESCRIPTION -  - EMPTY>
]>
<xml>
  <description>blah blah
    <THIS-IS-PART-OF-DESCRIPTION>
  </description>
</xml>

它将输出格式正确的 XML,以便使用您选择的 XML 工具进行进一步处理。

但请注意,您的示例代码片段还有另一个问题,即以字母 xmlXMLXml 等开头的元素名称保留在XML,并且不会被符合标准的 XML 解析器接受。

The accepted answer is good advice, and contains very useful links.

I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx

<!DOCTYPE xml [
  <!ELEMENT xml - - ANY>
  <!ELEMENT description - - ANY>
  <!ELEMENT THIS-IS-PART-OF-DESCRIPTION -  - EMPTY>
]>
<xml>
  <description>blah blah
    <THIS-IS-PART-OF-DESCRIPTION>
  </description>
</xml>

it will output well-formed XML for further processing with the XML tools of your choice.

Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

阳光①夏 2025-01-21 18:38:18

IMO 这些情况应该通过使用 JSoup 来解决。

下面是针对此特定案例的非真正答案,但发现 网络上的(感谢 Coderwall 上的 inuyasha82)。这段代码确实启发了我在处理格式错误的 XML 时遇到另一个类似的问题,因此我在这里分享它。

请不要编辑下面的内容,因为它与原始网站上的内容相同。 请不要编辑下面的内容,因为它与原始网站上的内容相同。 >

XML 格式,要求文档中声明的唯一根元素有效。
例如,一个有效的 xml 是:

<root>
     <element>...</element>
     <element>...</element>
</root>

但是如果您有一个类似的文档:

<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>

这将被视为格式错误的 XML,因此许多 xml 解析器只是抛出一个异常,抱怨没有根元素。等等。

在此示例中,有一个关于如何解决该问题并成功解析上面格式错误的 xml 的解决方案。

基本上我们要做的是以编程方式添加根元素。

因此,首先您必须打开包含“格式错误”的 xml 的资源(即文件):

File file = new File(pathtofile);

然后打开 FileInputStream:

FileInputStream fis = new FileInputStream(file);

如果我们此时尝试使用任何 XML 库解析此流,我们将引发格式错误的文档异常。

现在我们创建一个包含三个元素的 InputStream 对象列表:

  1. 包含字符串的 ByteInputStream 元素:
  2. 我们的 FileInputStream
  3. 包含字符串的 ByteInputStream: >

所以代码是:

List<InputStream> streams = 
    Arrays.asList(
        new ByteArrayInputStream("<root>".getBytes()),
    fis,
    new ByteArrayInputStream("</root>".getBytes()));

现在使用 SequenceInputStream,我们为上面创建的 List 创建一个容器:

InputStream cntr = 
new SequenceInputStream(Collections.enumeration(str));

现在我们可以在 cntr 上使用任何 XML 解析器库,并且它将在没有任何 XML 解析器库的情况下进行解析 问题。 (与 Stax 库检查);

IMO these cases should be solved by using JSoup.

Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.

Please do not edit what is below, as it is as it on the original website.

The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:

<root>
     <element>...</element>
     <element>...</element>
</root>

But if you have a document like:

<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>

This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.

In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.

Basically what we will do is to add programmatically a root element.

So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):

File file = new File(pathtofile);

Then open a FileInputStream:

FileInputStream fis = new FileInputStream(file);

If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.

Now we create a list of InputStream objects with three lements:

  1. A ByteIputStream element that contains the string: <root>
  2. Our FileInputStream
  3. A ByteInputStream with the string: </root>

So the code is:

List<InputStream> streams = 
    Arrays.asList(
        new ByteArrayInputStream("<root>".getBytes()),
    fis,
    new ByteArrayInputStream("</root>".getBytes()));

Now using a SequenceInputStream, we create a container for the List created above:

InputStream cntr = 
new SequenceInputStream(Collections.enumeration(str));

Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文