这是什么字符集?
我从客户那里收到了一堆 CSV 文件(似乎是数据库转储),其中许多列都有奇怪的字符,如下所示:
- Alain LefÆ'Ævre
- AngÆ'Île Dubeau &拉皮埃特
代表 é 的字符似乎太多了。有谁知道什么编码会产生那么多 é 字符?我不知道他们从哪里获取这些 CSV 文件,但假设我无法以更好的格式获取它们,我该如何将它们转换为 UTF-8 之类的格式?
I received a bunch of CSV files from a client (that appear to be a database dump), and many of the columns have weird characters like this:
- Alain Lefèvre
- Angèle Dubeau & La PietÃÂÂ
That's seems like an awful lot of characters to represent an é. Does anyone know what encoding would produce that many characters for é? I have no idea where they're getting these CSV files from, but assuming I can't get them in a better format, how would I convert them to something like UTF-8?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看起来像是双重重新错误解码的 UTF-8。通过将数据打开为 utf-8,将其保存为 Latin-1(也许),然后再次将其打开为 UTF-8,可能可以恢复数据。
It seems like it's a double-re-misdecoded UTF-8. It may be possible to recover the data by opening it as utf-8, saving it as Latin-1 (perhaps), and opening it as UTF-8 again.
看起来它经历了一个损坏过程,其中数据写入为 utf-8,但读入为 cp1252,这种情况发生了 3 次。通过将损坏的数据进行反向转换,这可能是可以恢复的(我不知道它是否适用于每个字符,但至少适用于某些字符) - 以 utf8 读入,以 cp1252 写出,重复。有很多方法可以进行这种转换 - 使用 Tordek 建议的文本编辑器,使用如下命令行工具,或者使用数据库或编程语言内置的编码功能。
It looks like it's been through a corruption process where the data was written as utf-8 but read in as cp1252, and this happened three times. This might be recoverable (I don't know if it will work for every character, but at least for some) by putting the corrupted data through the reverse transformation - read in as utf8, write out as cp1252, repeat. There are plenty of ways of doing that kind of conversion - using a text editor as Tordek suggests, using commandline tools as below, or using the encoding features built in to your database or programming language.
请记住,字符≠字节。您在输出中看到的是字符;您需要做一些不寻常的事情才能真正看到字节。 (我建议使用 '
xxd
',这是一个与 Vim 应用程序一起安装的工具; 或 'od
',核心实用程序之一GNU 操作系统。)一个擅长猜测字节流的字符编码的工具是“
enca
”极其简单的字符集分析器。Remember, character ≠ byte. What you're seeing in the output is characters; you'll need to do something unusual to actually see the bytes. (I suggest ‘
xxd
’, a tool that is installed with the Vim application; or ‘od
’, one of the core utilities of the GNU operating system.)One tool that is good at guessing the character encoding of a byte stream is ‘
enca
’ the Extremely Naive Charset Analyser.