目前,我正在开发一项功能,该功能涉及解析从其他产品收到的 XML。我决定针对一些实际的客户数据运行一些测试,看起来其他产品允许用户输入应被视为无效的输入。无论如何,我仍然必须尝试找出一种解析它的方法。我们正在使用 javax.xml.parsers.DocumentBuilder,我在输入时收到如下错误。
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
正如您所知,描述中似乎包含无效标签 (
)。现在,这个描述标签被认为是叶标签,并且内部不应该有任何嵌套标签。无论如何,这仍然是一个问题,并且会在 DocumentBuilder.parse(...)
上产生异常。
我知道这是无效的 XML,但可以预见它是无效的。关于解析此类输入的方法有什么想法吗?
Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder
and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>
). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
发布评论
评论(4)
“XML”比无效更糟糕——它格式不正确;请参阅格式良好与有效的 XML。
对违法行为的可预测性进行非正式评估并无帮助。该文本数据不是 XML。没有一致的 XML 工具或库可以帮助您处理它。
选项,最理想的第一个:
让提供商自行解决问题。 需要格式良好的 XML。(从技术上讲,格式良好的 XML 一词是多余的,但可能有助于强调。)
使用宽容的标记解析器< /strong> 在解析为 XML 之前解决问题:
独立: xmlstarlet 具有强大的恢复和修复功能功能信用:RomanPerekhrest
独立和 C/C++: HTML Tidy 有效也可以使用 XML。 Taggle 是一个端口TagSoup 到 C++。
Python: 美丽的汤 基于Python。请参阅解析器之间的差异部分中的注释。另请参阅此问题的答案了解更多信息
关于处理 Python 中格式不正确的标记的建议,
特别包括lxml 的
recover=True
选项。另请参阅此答案,了解如何使用
codecs.EncodedFile()
清理非法字符。< /p>Java: TagSoup 和JSoup 专注于 HTML。
FilterInputStream
可以用于预处理清理。.NET:
禁用以解决非法 XML 字符问题。
XmlReaderSettings.ConformanceLevel
可以设置为ConformanceLevel.Fragment< /code>
这样
XmlReader
可以读取 XML 格式良好的解析缺少根元素的实体。XmlReader.ReadToFollowing()
有时可以用于解决 XML 语法问题,但请注意
下面#3 中的违规警告。
Microsoft.Language.Xml.XMLParser
据说是“错误” -宽容”。Go:设置
Decoder.Strict< /code>
为
false
,如 示例(作者:@chuckx。PHP:请参阅DOMDocument::$recover 和 libxml_use_internal_errors(true)。请参阅此处的好示例。
Ruby:Nokogiri 支持“温和的 Well-形式性”。
R:请参阅htmlTreeParse() 用于 R 中的容错标记解析。
Perl:请参阅XML::Liberal ,一个“超级自由的 XML 解析器,可以解析损坏的 XML。”
使用文本编辑器手动将数据处理为文本或
以编程方式使用字符/字符串函数。这样做
以编程方式可以从棘手到不可能作为
看起来是什么
可预测的往往不是——打破规则很少受到规则的约束。
对于无效字符错误,请使用正则表达式删除/替换无效字符:
preg_replace('/[^\x{0009}\x{000a}\x {000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000 }-\u{FFFD}", ' ' )
inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
< /里>对于 & 符号,使用正则表达式将匹配项替换为
&
: 来源:blhsin,演示<前><代码>&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
请注意,上述正则表达式不会接受注释或 CDATA
部分考虑在内。
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's
recover=True
option.See also this answer for how to use
codecs.EncodedFile()
to cleanup illegal characters.Java: TagSoup and JSoup focus on HTML.
FilterInputStream
can be used for preprocessing cleanup..NET:
be disabled to get past illegal XML character problems.
XmlReaderSettings.ConformanceLevel
can be set toConformanceLevel.Fragment
so thatXmlReader
can read XML Well-Formed Parsed Entities lacking a root element.XmlReader.ReadToFollowing()
can sometimesbe used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser
is said to be “error-tolerant”.Go: Set
Decoder.Strict
tofalse
as shown in this example by @chuckx.PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with
&
: credit: blhsin, demoNote that the above regular expressions won't take comments or CDATA
sections into account.
根据设计,标准 XML 解析器永远不会接受无效的 XML。
您唯一的选择是在解析输入之前预处理输入以删除“可预见的无效”内容,或将其包装在 CDATA 中。
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
接受的答案是很好的建议,并且包含非常有用的链接。
我想补充一点,还有许多 其他情况格式不正确和/或 DTD 无效的 XML 可以使用 SGML(HTML 和 XML 的 ISO 标准化超集)进行修复。在您的情况下,有效的方法是将伪造的
THIS-IS-PART-OF-DESCRIPTION
元素声明为 SGML 空元素,然后使用例如。osx
程序(OpenSP/OpenJade SGML 包的一部分)将其转换为 XML。例如,如果您向osx
提供以下内容,它将输出格式正确的 XML,以便使用您选择的 XML 工具进行进一步处理。
但请注意,您的示例代码片段还有另一个问题,即以字母
xml
或XML
或Xml
等开头的元素名称保留在XML,并且不会被符合标准的 XML 解析器接受。The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus
THIS-IS-PART-OF-DESCRIPTION
element as SGML empty element and then use eg. theosx
program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following toosx
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters
xml
orXML
orXml
etc. are reserved in XML, and won't be accepted by conforming XML parsers.IMO 这些情况应该通过使用 JSoup 来解决。
下面是针对此特定案例的非真正答案,但发现 网络上的(感谢 Coderwall 上的 inuyasha82)。这段代码确实启发了我在处理格式错误的 XML 时遇到另一个类似的问题,因此我在这里分享它。
请不要编辑下面的内容,因为它与原始网站上的内容相同。 请不要编辑下面的内容,因为它与原始网站上的内容相同。 >
XML 格式,要求文档中声明的唯一根元素有效。
例如,一个有效的 xml 是:
但是如果您有一个类似的文档:
这将被视为格式错误的 XML,因此许多 xml 解析器只是抛出一个异常,抱怨没有根元素。等等。
在此示例中,有一个关于如何解决该问题并成功解析上面格式错误的 xml 的解决方案。
基本上我们要做的是以编程方式添加根元素。
因此,首先您必须打开包含“格式错误”的 xml 的资源(即文件):
然后打开 FileInputStream:
如果我们此时尝试使用任何 XML 库解析此流,我们将引发格式错误的文档异常。
现在我们创建一个包含三个元素的 InputStream 对象列表:
>
所以代码是:
现在使用 SequenceInputStream,我们为上面创建的 List 创建一个容器:
现在我们可以在 cntr 上使用任何 XML 解析器库,并且它将在没有任何 XML 解析器库的情况下进行解析 问题。 (与 Stax 库检查);
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
But if you have a document like:
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
Then open a FileInputStream:
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
<root>
</root>
So the code is:
Now using a SequenceInputStream, we create a container for the List created above:
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);