这是什么字符集?

发布于 2024-08-15 15:28:10 字数 232 浏览 4 评论 0原文

我从客户那里收到了一堆 CSV 文件(似乎是数据库转储),其中许多列都有奇怪的字符,如下所示:

  • Alain LefÆ'Ævre
  • AngÆ'Île Dubeau &拉皮埃特

代表 é 的字符似乎太多了。有谁知道什么编码会产生那么多 é 字符?我不知道他们从哪里获取这些 CSV 文件,但假设我无法以更好的格式获取它们,我该如何将它们转换为 UTF-8 之类的格式?

I received a bunch of CSV files from a client (that appear to be a database dump), and many of the columns have weird characters like this:

  • Alain Lefèvre
  • Angèle Dubeau & La Pietà

That's seems like an awful lot of characters to represent an é. Does anyone know what encoding would produce that many characters for é? I have no idea where they're getting these CSV files from, but assuming I can't get them in a better format, how would I convert them to something like UTF-8?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

美煞众生 2024-08-22 15:28:10

看起来像是双重重新错误解码的 UTF-8。通过将数据打开为 utf-8,将其保存为 Latin-1(也许),然后再次将其打开为 UTF-8,可能可以恢复数据。

It seems like it's a double-re-misdecoded UTF-8. It may be possible to recover the data by opening it as utf-8, saving it as Latin-1 (perhaps), and opening it as UTF-8 again.

放飞的风筝 2024-08-22 15:28:10

看起来它经历了一个损坏过程,其中数据写入为 utf-8,但读入为 cp1252,这种情况发生了 3 次。通过将损坏的数据进行反向转换,这可能是可以恢复的(我不知道它是否适用于每个字符,但至少适用于某些字符) - 以 utf8 读入,以 cp1252 写出,重复。有很多方法可以进行这种转换 - 使用 Tordek 建议的文本编辑器,使用如下命令行工具,或者使用数据库或编程语言内置的编码功能。

unix shell prompt> echo Alain Lefèvre | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252

Alain Lefèvre

unix shell prompt>

It looks like it's been through a corruption process where the data was written as utf-8 but read in as cp1252, and this happened three times. This might be recoverable (I don't know if it will work for every character, but at least for some) by putting the corrupted data through the reverse transformation - read in as utf8, write out as cp1252, repeat. There are plenty of ways of doing that kind of conversion - using a text editor as Tordek suggests, using commandline tools as below, or using the encoding features built in to your database or programming language.

unix shell prompt> echo Alain Lefèvre | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252

Alain Lefèvre

unix shell prompt>
反目相谮 2024-08-22 15:28:10

代表 é 的字符似乎太多了。

请记住,字符≠字节。您在输出中看到的是字符;您需要做一些不寻常的事情才能真正看到字节。 (我建议使用 'xxd',这是一个与 Vim 应用程序一起安装的工具; 或 'od',核心实用程序之一GNU 操作系统。)

有谁知道什么编码会产生这种结果

一个擅长猜测字节流的字符编码的工具是“enca极其简单的字符集分析器

That's seems like an awful lot of characters to represent an é.

Remember, character ≠ byte. What you're seeing in the output is characters; you'll need to do something unusual to actually see the bytes. (I suggest ‘xxd’, a tool that is installed with the Vim application; or ‘od’, one of the core utilities of the GNU operating system.)

Does anyone know what encoding would produce that

One tool that is good at guessing the character encoding of a byte stream is ‘enca’ the Extremely Naive Charset Analyser.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文