纠正大型 Xml 文件中的编码
我正在从包含此类内容的 XML 文件导入数据:
XML加载方式:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
当我使用顶部包含的数据执行此代码时,我收到有关非法字符的异常。我理解那部分很好。
我不确定这是什么编码或如何解决这个问题。有没有办法可以更改 XmlDocument 的编码或其他方法来确保正确解析上述内容?
更新:我在本文档中没有任何编码声明或 。
我看到一些链接说要动态添加它?这是UTF-16编码吗?
I'm importing data from XML files containing this type of content:
<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>
The XML is loaded via:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.
I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?
Update: I do not have any encoding declaration or <?xml
in this document.
I've seen some links say to add it dynamically? Is this UTF-16 encoding?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看起来:
ÖMÜR HÜNERÖZ
(或者可能是ÔMÜR HÜNERÔZ
或ÕMÜR HÜNERÕZ
;我不知道那是什么语言)。It appears that:
ÖMÜR HÜNERÖZ
(or possiblyÔMÜR HÜNERÔZ
orÕMÜR HÜNERÕZ
; I don't know what language that is).如果您使用十六进制编辑器(例如 HXD 或 Visual Studio)查看该文件,你到底看到了什么?
您发布的字符串中的每个字符都由单个字节表示吗?文件是否有字节顺序标记(文件开头的一堆不可打印的字节)?
™ 和 š 似乎表明编码/转换过程中出现了严重错误,但让我们看看......我猜它们都对应于元音(O-M-A -R H-A-NER-O-Z,也许吧?),但我还没弄清楚它们是怎么变成这样的。 ..
编辑:dan04一针见血。 cp-1252 中的
™
的十六进制值为 99,并且š
是 9a。在 cp-437 和 cp-850,十六进制 99 代表Ö
,9aÜ
。修复方法很简单:只需在打开 XML 文件时指定此编码即可:
If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?
Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?
The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...
Edit: dan04 hit the nail on the head.
™
in cp-1252 has hex value 99, andš
is 9a. In cp-437 and cp-850, hex 99 representsÖ
, and 9aÜ
.The fix is simple: just specify this encoding when opening your XML file:
从这里:
您可能想看看这个:< a href="https://stackoverflow.com/questions/637855/how-to-best-detect-encoding-in-xml-file">如何最好地检测 XML 文件中的编码?
供实际阅读可以使用 StreamReader 来处理 BOM(字节顺序标记):
编辑:删除了编码参数。如果文件包含 BOM,StreamReader 将检测文件的编码。如果没有,它将默认为 UTF8。
编辑2:检测StreamReader的文本编码
From here:
You might want to take a look at this: How to best detect encoding in XML file?
For actual reading you can use StreamReader to take care of BOM(Byte order mark):
Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.
Edit 2: Detecting Text Encoding for StreamReader
显然,您提供了 XML 文档的一个片段,因为它缺少根元素,所以我假设这就是您的意图。顶部是否有像
这样的xml处理指令?
Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like
<?xml version="1.0" encoding="UTF-8" ?>
?