当前位置：文江博客话题详情

如何解析无效（错误/格式不正确）的 XML？

发布于 2025-01-14 18:38:18 字数 570 浏览 4 评论 0 原文

目前，我正在开发一项功能，该功能涉及解析从其他产品收到的 XML。我决定针对一些实际的客户数据运行一些测试，看起来其他产品允许用户输入应被视为无效的输入。无论如何，我仍然必须尝试找出一种解析它的方法。我们正在使用 javax.xml.parsers.DocumentBuilder，我在输入时收到如下错误。

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

正如您所知，描述中似乎包含无效标签 ()。现在，这个描述标签被认为是叶标签，并且内部不应该有任何嵌套标签。无论如何，这仍然是一个问题，并且会在 DocumentBuilder.parse(...) 上产生异常。

我知道这是无效的 XML，但可以预见它是无效的。关于解析此类输入的方法有什么想法吗？

原文

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

手心的海 2025-01-21 18:38:18

“XML”比无效更糟糕——它格式不正确；请参阅格式良好与有效的 XML。

对违法行为的可预测性进行非正式评估并无帮助。该文本数据不是 XML。没有一致的 XML 工具或库可以帮助您处理它。

选项，最理想的第一个：

让提供商自行解决问题。 需要格式良好的 XML。（从技术上讲，格式良好的 XML 一词是多余的，但可能有助于强调。）
使用宽容的标记解析器< /strong> 在解析为 XML 之前解决问题：

独立： xmlstarlet 具有强大的恢复和修复功能功能^{_{信用：RomanPerekhrest}}

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

独立和 C/C++： HTML Tidy 有效也可以使用 XML。 Taggle 是一个端口TagSoup 到 C++。

Python： 美丽的汤基于Python。请参阅解析器之间的差异部分中的注释。另请参阅此问题的答案了解更多信息
关于处理 Python 中格式不正确的标记的建议，
特别包括lxml 的recover=True 选项。
另请参阅此答案，了解如何使用 codecs.EncodedFile() 清理非法字符。< /p>

Java： TagSoup 和JSoup 专注于 HTML。 FilterInputStream 可以用于预处理清理。

.NET：

XmlReaderSettings.CheckCharacters 可以
禁用以解决非法 XML 字符问题。

@jdweng 注释那个XmlReaderSettings.ConformanceLevel 可以设置为
ConformanceLevel.Fragment< /code> 这样 XmlReader 可以读取 XML 格式良好的解析缺少根元素的实体。

@jdweng 还报告XmlReader.ReadToFollowing() 有时可以用于解决 XML 语法问题，但请注意下面#3 中的违规警告。 Microsoft.Language.Xml.XMLParser 据说是“错误” -宽容”。

Go：设置Decoder.Strict< /code> 为 false，如示例（作者：@chuckx。 PHP：请参阅DOMDocument::$recover 和 libxml_use_internal_errors(true)。请参阅此处的好示例。 Ruby：Nokogiri 支持“温和的 Well-形式性”。 R：请参阅htmlTreeParse() 用于 R 中的容错标记解析。 Perl：请参阅XML::Liberal ，一个“超级自由的 XML 解析器，可以解析损坏的 XML。”
使用文本编辑器手动将数据处理为文本或
以编程方式使用字符/字符串函数。这样做
以编程方式可以从棘手到不可能作为
看起来是什么
可预测的往往不是——打破规则很少受到规则的约束。
- 对于无效字符错误，请使用正则表达式删除/替换无效字符：
  - PHP： preg_replace('/[^\x{0009}\x{000a}\x {000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
  - Ruby： string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌ }-\u{FFFD}", ' ' ）
  - JavaScript： inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')< /里>
- 对于 & 符号，使用正则表达式将匹配项替换为 &：^{_{来源：blhsin，演示}}
  
  <前><代码>&(?!(?:#\d+|#x[0-9a-f]+|\w+);)

请注意，上述正则表达式不会接受注释或 CDATA
部分考虑在内。

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

Options, most desirable first:

Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
- Standalone: xmlstarlet has robust recovering and repair capabilities^{_{credit: RomanPerekhrest}}
```
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
```
- Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
- Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
  suggestions for dealing with not-well-formed markup in Python,
  including especially lxml's recover=True option.
  See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
- Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
- .NET:
  - XmlReaderSettings.CheckCharacters can
    be disabled to get past illegal XML character problems.
  - @jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
    ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
  - @jdweng also reports that XmlReader.ReadToFollowing() can sometimes
    be used to work-around XML syntactical issues, but note
    rule-breaking warning in #3 below.
  - Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
- Go: Set Decoder.Strict to false as shown in this example by @chuckx.
- PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
- Ruby: Nokogiri supports “Gentle Well-Formedness”.
- R: See htmlTreeParse() for fault-tolerant markup parsing in R.
- Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
- For invalid character errors, use regex to remove/replace invalid characters:
  - PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
  - Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
  - JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
- For ampersands, use regex to replace matches with &:^{_{credit: blhsin, demo}}
```
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
```

Note that the above regular expressions won't take comments or CDATA
sections into account.

回复收藏 0 原文

一向肩并 2025-01-21 18:38:18

根据设计，标准 XML 解析器永远不会接受无效的 XML。

您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容，或将其包装在 CDATA 中。

回复收藏 0 原文

断舍离 2025-01-21 18:38:18

接受的答案是很好的建议，并且包含非常有用的链接。

我想补充一点，还有许多其他情况格式不正确和/或 DTD 无效的 XML 可以使用 SGML（HTML 和 XML 的 ISO 标准化超集）进行修复。在您的情况下，有效的方法是将伪造的 THIS-IS-PART-OF-DESCRIPTION 元素声明为 SGML 空元素，然后使用例如。 osx 程序（OpenSP/OpenJade SGML 包的一部分）将其转换为 XML。例如，如果您向 osx 提供以下内容，

<!DOCTYPE xml [
  <!ELEMENT xml - - ANY>
  <!ELEMENT description - - ANY>
  <!ELEMENT THIS-IS-PART-OF-DESCRIPTION -  - EMPTY>
]>
<xml>
  <description>blah blah
    <THIS-IS-PART-OF-DESCRIPTION>
  </description>
</xml>

它将输出格式正确的 XML，以便使用您选择的 XML 工具进行进一步处理。

但请注意，您的示例代码片段还有另一个问题，即以字母 xml 或 XML 或 Xml 等开头的元素名称保留在XML，并且不会被符合标准的 XML 解析器接受。

The accepted answer is good advice, and contains very useful links.

I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx

<!DOCTYPE xml [
  <!ELEMENT xml - - ANY>
  <!ELEMENT description - - ANY>
  <!ELEMENT THIS-IS-PART-OF-DESCRIPTION -  - EMPTY>
]>
<xml>
  <description>blah blah
    <THIS-IS-PART-OF-DESCRIPTION>
  </description>
</xml>

it will output well-formed XML for further processing with the XML tools of your choice.

Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

回复收藏 0 原文

阳光①夏 2025-01-21 18:38:18

IMO 这些情况应该通过使用 JSoup 来解决。

下面是针对此特定案例的非真正答案，但发现网络上的（感谢 Coderwall 上的 inuyasha82）。这段代码确实启发了我在处理格式错误的 XML 时遇到另一个类似的问题，因此我在这里分享它。

请不要编辑下面的内容，因为它与原始网站上的内容相同。 请不要编辑下面的内容，因为它与原始网站上的内容相同。 >

XML 格式，要求文档中声明的唯一根元素有效。
例如，一个有效的 xml 是：

<root>
     <element>...</element>
     <element>...</element>
</root>

但是如果您有一个类似的文档：

<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>

这将被视为格式错误的 XML，因此许多 xml 解析器只是抛出一个异常，抱怨没有根元素。等等。

在此示例中，有一个关于如何解决该问题并成功解析上面格式错误的 xml 的解决方案。

基本上我们要做的是以编程方式添加根元素。

因此，首先您必须打开包含“格式错误”的 xml 的资源（即文件）：

File file = new File(pathtofile);

然后打开 FileInputStream：

FileInputStream fis = new FileInputStream(file);

如果我们此时尝试使用任何 XML 库解析此流，我们将引发格式错误的文档异常。

现在我们创建一个包含三个元素的 InputStream 对象列表：

包含字符串的 ByteInputStream 元素：
我们的 FileInputStream
包含字符串的 ByteInputStream： >

所以代码是：

List<InputStream> streams = 
    Arrays.asList(
        new ByteArrayInputStream("<root>".getBytes()),
    fis,
    new ByteArrayInputStream("</root>".getBytes()));

现在使用 SequenceInputStream，我们为上面创建的 List 创建一个容器：

InputStream cntr = 
new SequenceInputStream(Collections.enumeration(str));

现在我们可以在 cntr 上使用任何 XML 解析器库，并且它将在没有任何 XML 解析器库的情况下进行解析问题。（与 Stax 库检查）；

IMO these cases should be solved by using JSoup.

Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.

Please do not edit what is below, as it is as it on the original website.

The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:

<root>
     <element>...</element>
     <element>...</element>
</root>

But if you have a document like:

<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>

This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.

In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.

Basically what we will do is to add programmatically a root element.

So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):

File file = new File(pathtofile);

Then open a FileInputStream:

FileInputStream fis = new FileInputStream(file);

If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.

Now we create a list of InputStream objects with three lements:

A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>

So the code is:

List<InputStream> streams = 
    Arrays.asList(
        new ByteArrayInputStream("<root>".getBytes()),
    fis,
    new ByteArrayInputStream("</root>".getBytes()));

Now using a SequenceInputStream, we create a container for the List created above:

InputStream cntr = 
new SequenceInputStream(Collections.enumeration(str));

Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

回复收藏 0 原文

~没有更多了~

关于作者

<逆流佳人身旁

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何解析无效（错误/格式不正确）的 XML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

选项，最理想的第一个：

Options, most desirable first:

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如何解析无效（错误/格式不正确）的 XML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

选项，最理想的第一个：

Options, most desirable first:

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。