XmlReader 在 UTF-8 BOM 上中断

发布于 2024-09-06 22:15:30 字数 2073 浏览 7 评论 0原文

我的应用程序中有以下 XML 解析代码：

    public static XElement Parse(string xml, string xsdFilename)
    {
        var readerSettings = new XmlReaderSettings
        {
            ValidationType = ValidationType.Schema,
            Schemas = new XmlSchemaSet()
        };
        readerSettings.Schemas.Add(null, xsdFilename);
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
        readerSettings.ValidationEventHandler +=
            (o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };

        var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);

        return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
    }

我使用它来将发送到 WCF 服务的字符串解析为 XML 文档，以进行自定义反序列化。

当我读取文件并通过网络发送它们（请求）时，它工作得很好；我已经确认 BOM 没有发送过来。在我的请求处理程序中，我序列化响应对象并将其作为字符串发送回。序列化过程将 UTF-8 BOM 添加到字符串的前面，这会导致解析响应时相同的代码被破坏。

System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.

在我过去一个小时左右所做的研究中，XmlReader 似乎应该尊重 BOM。如果我手动从字符串前面删除 BOM，则响应 xml 可以正常解析。

我是否遗漏了一些明显的东西，或者至少是一些阴险的东西？

编辑：这是我用来返回响应的序列化代码：

private static string SerializeResponse(Response response)
{
    var output = new MemoryStream();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    var bytes = output.ToArray();
    var responseXml = Encoding.UTF8.GetString(bytes);
    return responseXml;
}

如果这只是 xml 不正确包含 BOM 的问题，那么我将切换到，

var responseXml = new UTF8Encoding(false).GetString(bytes);

但根本不清楚我的研究表明 BOM 在实际的 XML 字符串中是非法的；请参阅 c# 从字节数组中检测 xml 编码？

原文

I have the following XML Parsing code in my application:

    public static XElement Parse(string xml, string xsdFilename)
    {
        var readerSettings = new XmlReaderSettings
        {
            ValidationType = ValidationType.Schema,
            Schemas = new XmlSchemaSet()
        };
        readerSettings.Schemas.Add(null, xsdFilename);
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
        readerSettings.ValidationEventHandler +=
            (o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };

        var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);

        return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
    }

I am using it to parse strings sent to my WCF service into XML documents, for custom deserialization.

It works fine when I read in files and send them over the wire (the request); I've verified that the BOM is not sent across. In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.

In the research I've done over the last hour or so, it appears that XmlReader should honor the BOM. If I manually remove the BOM from the front of the string, the response xml parses fine.

Am I missing something obvious, or at least something insidious?

EDIT: Here is the serialization code I'm using to return the response:

private static string SerializeResponse(Response response)
{
    var output = new MemoryStream();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    var bytes = output.ToArray();
    var responseXml = Encoding.UTF8.GetString(bytes);
    return responseXml;
}

If it's just a matter of the xml incorrectly containing the BOM, then I'll switch to

var responseXml = new UTF8Encoding(false).GetString(bytes);

but it was not clear at all from my research that the BOM was illegal in the actual XML string; see e.g. c# Detect xml encoding from Byte Array?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

好菇凉咱不稀罕他 2024-09-13 22:15:31

在我的请求处理程序中，我正在序列化响应对象并将其作为字符串发送回。序列化过程将 UTF-8 BOM 添加到字符串的前面，这会导致解析响应时相同的代码被破坏。

因此，您希望防止在序列化过程中添加 BOM。不幸的是，您没有提供序列化逻辑。

您应该做的是提供通过创建的 UTF8Encoding 实例UTF8Encoding(bool) 构造函数禁用 BOM 的生成，并通过将此 Encoding 实例转换为您正在使用的生成中间字符串的方法。

回复收藏 0 原文

她说她爱他 2024-09-13 22:15:31

xml 字符串不得（！）包含 BOM，BOM 只允许在使用 UTF-8 编码的字节数据（例如流）中。这是因为字符串表示形式未经过编码，而是已经是 unicode 字符序列。

因此，您似乎加载了错误的字符串，不幸的是您没有提供代码。

编辑：

感谢您发布序列化代码。

您不应将数据写入 MemoryStream，而应写入 StringWriter，然后可以使用 ToString 将其转换为字符串。由于这避免了传递字节表示，因此不仅速度更快，而且还避免了此类问题。

像这样的东西：

private static string SerializeResponse(Response response)
{
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
}

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.

Edit:

Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
{
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
}

回复收藏 0 原文

北陌 2024-09-13 22:15:31

BOM 一开始就不应该出现在字符串中。
BOM 用于检测原始字节数组的编码；他们没有必要在实际的字符串中。

字符串从何而来？
您可能使用错误的编码来读取它。

回复收藏 0 原文

你在我安 2024-09-13 22:15:31

C# 中的字符串编码为 UTF-16，因此 BOM 可能是错误的。作为一般规则，始终将 XML 编码为字节数组并从字节数组解码。

回复收藏 0 原文

萌面超妹 2024-09-13 22:15:31

以下是如何将 MemoryStream 转换为可以与 XmlDocument 一起使用的字符串（Skip 函数是 Linq）：

    public static string Decode(MemoryStream ms)
    {
      var fileBytes = ms.ToArray();
      var isUnicode = fileBytes[0] == 0xff && fileBytes[1] == 0xfe;  //UTF-16 little endian
      var isUtf8Bom = fileBytes[0] == 0xef && fileBytes[1] == 0xbb && fileBytes[2] == 0xbf;

      string xml = isUnicode ? Encoding.Unicode.GetString(fileBytes) : 
                   (isUtf8Bom ? new UTF8Encoding(false, true).GetString(fileBytes.Skip(3).ToArray()) : new UTF8Encoding(false, true).GetString(fileBytes));

       return xml;
    }

Here is how to convert the MemoryStream to a string that can be used with XmlDocument (the Skip function is Linq):

    public static string Decode(MemoryStream ms)
    {
      var fileBytes = ms.ToArray();
      var isUnicode = fileBytes[0] == 0xff && fileBytes[1] == 0xfe;  //UTF-16 little endian
      var isUtf8Bom = fileBytes[0] == 0xef && fileBytes[1] == 0xbb && fileBytes[2] == 0xbf;

      string xml = isUnicode ? Encoding.Unicode.GetString(fileBytes) : 
                   (isUtf8Bom ? new UTF8Encoding(false, true).GetString(fileBytes.Skip(3).ToArray()) : new UTF8Encoding(false, true).GetString(fileBytes));

       return xml;
    }

回复收藏 0 原文

~没有更多了~