C# encoding character-encoding xml-parsing

纠正大型 Xml 文件中的编码

发布于 2024-10-07 23:46:38 字数 479 浏览 6 评论 0原文

我正在从包含此类内容的 XML 文件导入数据：

™MšRHšNER™Z

XML加载方式：

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

当我使用顶部包含的数据执行此代码时，我收到有关非法字符的异常。我理解那部分很好。

我不确定这是什么编码或如何解决这个问题。有没有办法可以更改 XmlDocument 的编码或其他方法来确保正确解析上述内容？

更新：我在本文档中没有任何编码声明或 。

我看到一些链接说要动态添加它？这是UTF-16编码吗？

原文

I'm importing data from XML files containing this type of content:

<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>

The XML is loaded via:

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.

I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?

Update: I do not have any encoding declaration or <?xml in this document.

I've seen some links say to add it dynamically? Is this UTF-16 encoding?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

要走干脆点 2024-10-14 23:46:38

看起来：

名称是 ÖMÜR HÜNERÖZ（或者可能是 ÔMÜR HÜNERÔZ 或 ÕMÜR HÜNERÕZ；我不知道那是什么语言）。
XML 文件使用 DOS“OEM”代码页（可能是 437 或 850）进行编码。
但它是使用 windows-1252（“ANSI”代码页）进行解码的。

回复收藏 0 原文

仅此而已 2024-10-14 23:46:38

如果您使用十六进制编辑器（例如 HXD 或 Visual Studio）查看该文件，你到底看到了什么？

您发布的字符串中的每个字符都由单个字节表示吗？文件是否有字节顺序标记（文件开头的一堆不可打印的字节）？

™ 和 š 似乎表明编码/转换过程中出现了严重错误，但让我们看看......我猜它们都对应于元音（O-M-A -R H-A-NER-O-Z，也许吧？），但我还没弄清楚它们是怎么变成这样的。 ..

编辑：dan04一针见血。 cp-1252 中的 ™ 的十六进制值为 99，并且 š 是 9a。在 cp-437 和 cp-850，十六进制 99 代表 Ö，9a Ü。

修复方法很简单：只需在打开 XML 文件时指定此编码即可：

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?

Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?

The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...

Edit: dan04 hit the nail on the head. ™ in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.

The fix is simple: just specify this encoding when opening your XML file:

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}

回复收藏 0 原文

秋风の叶未落 2024-10-14 23:46:38

从这里：

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

您可能想看看这个：< a href="https://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file">如何最好地检测 XML 文件中的编码？

供实际阅读可以使用 StreamReader 来处理 BOM（字节顺序标记）：

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

编辑：删除了编码参数。如果文件包含 BOM，StreamReader 将检测文件的编码。如果没有，它将默认为 UTF8。

编辑2：检测StreamReader的文本编码

From here:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

You might want to take a look at this: How to best detect encoding in XML file?

For actual reading you can use StreamReader to take care of BOM(Byte order mark):

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.

Edit 2: Detecting Text Encoding for StreamReader

回复收藏 0 原文