什么代码页对“'ç'”进行编码?作为'?º' (0x3f 0xba)
今天我收到了一位客户发来的文件,我必须阅读该文件,但其中包含奇怪的字符。使用已知的名称,我可以猜测一些字符的含义。
例如:
Realname | Encoded as | sign | hex
----------|--------------|-------|-------
Françios | Fran?ºios | ç | 3f ba
André | Andr?? | é | 3f 3f
Hélène | H??l?¿ne | è | 3f bf
etc.
- 我尝试了所有代码页(.Net已知的)来导入文件,看看它们是否包含我知道的单词。但没有任何代码页能让我满意。
- 在Notepad++中打开文件认为它是ANSI,并且还显示不需要的字符。 (但它有一个有用的十六进制编辑器插件)。
- 其他文件(来自同一用户和 zip 文件)以 UTF-8 编码。
我不能指望从我收到文件的那个人那里得到帮助。 (使用谷歌翻译)他向我明确表示,他发现仅仅创建文件就非常困难,而且他正在使用我无法访问的软件(我相信 SAP)。
有没有其他方法可以找到他刚刚发送给我的文件的编码?
Today I received a file from a customer that I have to read, but it contains strange characters. Using known names, I can guess the meaning of some characters.
For example:
Realname | Encoded as | sign | hex
----------|--------------|-------|-------
Françios | Fran?ºios | ç | 3f ba
André | Andr?? | é | 3f 3f
Hélène | H??l?¿ne | è | 3f bf
etc.
- I have tried all codepages (known to .Net) to import the file, and see if they contain the words I know. But no codepage gives me satisfaction.
- Opening the file in Notepad++ thinks it is ANSI, and also shows the unwanted characters. (But it has a hex-editor plugin that is usefull).
- Other files (from the same user & zipfile) are encoded in UTF-8.
From the guy I received the files from, I cannot expect help. (Using Google Translate) he made it clear to me that he found it very hard just to create the files, and he is using software (I believe SAP) that I do not have access to.
Is there any other way I can find the encoding of the files he just send to me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我采用 UTF-8 编码文本,假装它是 CP850,然后就可以得到这些结果将其转换为 Latin-1,Windows-1252 或类似的编码。这 ”?”来自以下事实:0xc3 处的 CP850 字符是“├”,该字符在 Latin-1 或派生编码中不存在,因此转换将其替换为“?”。
编辑:我使用 iconv 和 CP437、CP862 或 CP865 比 CP850 更匹配。既然你问了,我这次用的一句台词是:
I can get those results if I take UTF-8 encoded text, pretend it is CP850, and then convert it to Latin-1, Windows-1252, or a similar encoding. The "?" comes from the fact that the CP850 character at 0xc3 is "├", which doesn't exist in Latin-1 or derived encodings, so the conversion replaces it with a "?".
Edit: I did a bit wider of a search using iconv, and CP437, CP862, or CP865 are better matches than CP850. Since you asked, the one-liner I used this time was:
它应该是 UTF-8 或 UTF-16。
它们包含几乎所有常规字符。
看来您有解码/编码问题。
notepad++ 它可能会感到困惑,因为您的文件不使用字节顺序标记。
你如何处理你的文件?
尝试将它们读取为二进制,然后尝试不同的编码来获取字符串。
如果您不将它们读取为二进制,则可能会发生默认编码。
这 ”?”是一个迹象。
可能会有所帮助。
it should UTF-8 or UTF-16.
they contains almost all regular characters.
it looks like you have a decode/encode problem.
notepad++ it maybe confused, because your files do not use a Byte-Order-Mark.
how do you process your files?
try to read them as binary and then try different encodings to get a string.
if you do not read them as binary, a default encoding may take place.
the "?" is a sign for that.
may be that helps out.