处理无效的 XML 十六进制字符
我试图通过网络发送 XML 文档,但收到以下异常:
"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
at System.Xml.XmlRawWriter.WriteValue(String value)
at System.Xml.XmlWellFormedWriter.WriteValue(String value)
at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
--- End of inner exception stack trace ---
我无法控制尝试发送的内容,因为字符串是从电子邮件中收集的。如何对字符串进行编码,使其成为有效的 XML,同时保留非法字符?
我想以某种方式保留原来的角色。
I'm trying to send an XML document over the wire but receiving the following exception:
"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
at System.Xml.XmlRawWriter.WriteValue(String value)
at System.Xml.XmlWellFormedWriter.WriteValue(String value)
at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
--- End of inner exception stack trace ---
I don't have any control over what I attempt to send because the string is gathered from an email. How can I encode my string so it's valid XML while keeping the illegal characters?
I'd like to keep the original characters one way or another.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
以下代码从字符串中删除 XML 无效字符并返回不含这些字符的新字符串:
The following code removes XML invalid characters from a string and returns a new string without them:
是这样做的一种方法
is one way of doing this
使用 XmlConvert.IsXmlChar 方法(自.NET Framework 4.0)
.Net Fiddle - https://dotnetfiddle.net/v1TNus
例如,垂直制表符(\v) 对于 XML 无效,它是有效的 UTF-8,但不是有效的 XML 1.0,甚至许多库(包括 libxml2)错过它并默默输出无效的 XML。
Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.
为我工作:
Work for me:
下面的解决方案删除了所有无效的 XML 字符,但我认为它确实是尽可能高效地完成的,特别是,它不分配一个新的 StringBuilder 以及一个新字符串,而不是除非已经确定该字符串中包含任何无效字符。因此,热点最终只是字符上的一个 for 循环,检查结果通常是每个字符上不超过两个大于/小于数字的比较。如果没有找到,它只是返回原始字符串。当绝大多数字符串都可以很好地开始时,这特别有用,最好尽快将它们作为输入和输出(没有浪费的分配等)。
-- 更新 --
请参阅下面如何直接编写具有这些无效字符的 XElement,尽管它使用此代码 --
部分代码受到影响 Tom Bogle 先生的解决方案在这里。另请参阅同一条帖子中 superlogic 的帖子中的有用信息。然而,所有这些仍然总是实例化一个新的 StringBuilder 和字符串。
用法:
测试:
// --- 代码 --- (我在名为 XML 的静态实用程序类中有这些方法)
======== ======== ========
直接写XElement.ToString
======== ======== ========
一、这个扩展方法的用法:
-- Fuller test -- -
--- 代码 ---
-- 这使用以下 XmlTextWritter --
The following solution removes any invalid XML characters, but it does so I think about as performantly as it could be done, and in particular, it does not allocate a new StringBuilder as well as a new string, not unless it is already determined that the string has any invalid characters in it. So the hot spot ends up being just a single for loop on the characters, with the check ending up being often no more than two greater than / lesser than numeric comparisons on each char. If none are found, it simply returns the original string. This is particularly helpful when the vast majority of strings are just fine to start with, it's nice to have these as in and out (with no wasted allocs etc) as quick as possible.
-- update --
See below how one can also directly write an XElement that has these invalid characters, though it uses this code --
Some of this code was influenced by Mr. Tom Bogle's solution here. See also on that same thread the helpful information in the post by superlogical. All of these, however, always instantiate a new StringBuilder and string still.
USAGE:
TEST:
// --- CODE --- (I have these methods in a static utility class called XML)
======== ======== ========
Write XElement.ToString directly
======== ======== ========
First, the usage of this extension method:
-- Fuller test --
--- code ---
-- this uses the following XmlTextWritter --
我位于 @parapurarajkumar 解决方案的接收端,其中非法字符已正确加载到
XmlDocument
中,但在我尝试保存输出时破坏了XmlWriter
。我的上下文
我正在使用 Elmah 查看网站上的异常/错误日志。 Elmah 以大型 XML 文档的形式返回异常发生时服务器的状态。对于我们的报告引擎,我使用
XmlWriter
漂亮地打印 XML。在网站攻击期间,我注意到某些 xml 未进行解析,并收到此
'.',十六进制值 0x00,是无效字符。
异常。非解决方案:我将文档转换为
byte[]
并将其清除为 0x00,但没有找到任何内容。当我扫描xml文档时,我发现了以下内容:
There was the nul byte编码为html实体
�
!解决方案:为了修复编码,我在将
�
值加载到XmlDocument
之前替换了它,因为加载它会创建nul 字节,并且很难从对象中清除它。这是我的整个过程:经验教训:使用关联的 html 实体清理非法字节(如果您的传入数据在输入时是 html 编码的)。
I'm on the receiving end of @parapurarajkumar's solution, where the illegal characters are being properly loaded into
XmlDocument
, but breakingXmlWriter
when I'm trying to save the output.My Context
I'm looking at exception/error logs from the website using Elmah. Elmah returns the state of the server at the time of the exception, in the form of a large XML document. For our reporting engine I pretty-print the XML with
XmlWriter
.During a website attack, I noticed that some xmls weren't parsing and was receiving this
'.', hexadecimal value 0x00, is an invalid character.
exception.NON-RESOLUTION: I converted the document to a
byte[]
and sanitized it of 0x00, but it found none.When I scanned the xml document, I found the following:
There was the nul byte encoded as an html entity
�
!!!RESOLUTION: To fix the encoding, I replaced the
�
value before loading it into myXmlDocument
, because loading it will create the nul byte and it will be difficult to sanitize it from the object. Here's my entire process:LESSON LEARNED: sanitize for illegal bytes using the associated html entity, if your incoming data is html encoded on entry.
有一个效果很好的通用解决方案:
一旦就位,您就可以创建对此的替代,如下所示:
其中 XmlUtil.RemoveInvalidXmlChars 定义如下:
There is a generic solution that works nicely:
Once this is in place, you can then create your override of THIS as follows:
where XmlUtil.RemoveInvalidXmlChars is defined as follows:
不能用以下方法清洁字符串吗
?
Can't the string be cleaned with:
?