纠正大型 Xml 文件中的编码

发布于 2024-10-07 23:46:38 字数 479 浏览 6 评论 0原文

我正在从包含此类内容的 XML 文件导入数据:

™MšRHšNER™Z

XML加载方式:

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

当我使用顶部包含的数据执行此代码时,我收到有关非法字符的异常。我理解那部分很好。

我不确定这是什么编码或如何解决这个问题。有没有办法可以更改 XmlDocument 的编码或其他方法来确保正确解析上述内容?


更新:我在本文档中没有任何编码声明或

我看到一些链接说要动态添加它?这是UTF-16编码吗?

I'm importing data from XML files containing this type of content:

<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>

The XML is loaded via:

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.

I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?


Update: I do not have any encoding declaration or <?xml in this document.

I've seen some links say to add it dynamically? Is this UTF-16 encoding?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

要走干脆点 2024-10-14 23:46:38

看起来:

  • 名称是 ÖMÜR HÜNERÖZ(或者可能是 ÔMÜR HÜNERÔZÕMÜR HÜNERÕZ;我不知道那是什么语言)。
  • XML 文件使用 DOS“OEM”代码页(可能是 437 或 850)进行编码。
  • 但它是使用 windows-1252(“ANSI”代码页)进行解码的。

It appears that:

  • The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
  • The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
  • But it was decoded using windows-1252 (the "ANSI" code page).
仅此而已 2024-10-14 23:46:38

如果您使用十六进制编辑器(例如 HXD 或 Visual Studio)查看该文件,你到底看到了什么?

您发布的字符串中的每个字符都由单个字节表示吗?文件是否有字节顺序标记(文件开头的一堆不可打印的字节)?

™ 和 š 似乎表明编码/转换过程中出现了严重错误,但让我们看看......我猜它们都对应于元音(O-M-A -R H-A-NER-O-Z,也许吧?),但我还没弄清楚它们是怎么变成这样的。 ..

编辑dan04一针见血。 cp-1252 中的 的十六进制值为 99,并且 š 是 9a。在 cp-437cp-850,十六进制 99 代表 Ö,9a Ü

修复方法很简单:只需在打开 XML 文件时指定此编码即可:

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?

Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?

The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...

Edit: dan04 hit the nail on the head. in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.

The fix is simple: just specify this encoding when opening your XML file:

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}
秋风の叶未落 2024-10-14 23:46:38

这里

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

您可能想看看这个:< a href="https://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file">如何最好地检测 XML 文件中的编码?

供实际阅读可以使用 StreamReader 来处理 BOM(字节顺序标记):

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

编辑:删除了编码参数。如果文件包含 BOM,StreamReader 将检测文件的编码。如果没有,它将默认为 UTF8。

编辑2检测StreamReader的文本编码

From here:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

You might want to take a look at this: How to best detect encoding in XML file?

For actual reading you can use StreamReader to take care of BOM(Byte order mark):

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.

Edit 2: Detecting Text Encoding for StreamReader

倾城°AllureLove 2024-10-14 23:46:38

显然,您提供了 XML 文档的一个片段,因为它缺少根元素,所以我假设这就是您的意图。顶部是否有像这样的xml处理指令?

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文