当前位置：文江博客话题详情

这是什么字符集？

发布于 2024-08-15 15:28:10 字数 232 浏览 4 评论 0原文

我从客户那里收到了一堆 CSV 文件（似乎是数据库转储），其中许多列都有奇怪的字符，如下所示：

Alain LefÆ'Ævre
AngÆ'Île Dubeau &拉皮埃特

代表 é 的字符似乎太多了。有谁知道什么编码会产生那么多 é 字符？我不知道他们从哪里获取这些 CSV 文件，但假设我无法以更好的格式获取它们，我该如何将它们转换为 UTF-8 之类的格式？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

美煞众生 2024-08-22 15:28:10

看起来像是双重重新错误解码的 UTF-8。通过将数据打开为 utf-8，将其保存为 Latin-1（也许），然后再次将其打开为 UTF-8，可能可以恢复数据。

回复收藏 0 原文

放飞的风筝 2024-08-22 15:28:10

看起来它经历了一个损坏过程，其中数据写入为 utf-8，但读入为 cp1252，这种情况发生了 3 次。通过将损坏的数据进行反向转换，这可能是可以恢复的（我不知道它是否适用于每个字符，但至少适用于某些字符） - 以 utf8 读入，以 cp1252 写出，重复。有很多方法可以进行这种转换 - 使用 Tordek 建议的文本编辑器，使用如下命令行工具，或者使用数据库或编程语言内置的编码功能。

unix shell prompt> echo Alain LefÃƒÆ’Ã‚Â¨vre | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252

Alain Lefèvre

unix shell prompt>

It looks like it's been through a corruption process where the data was written as utf-8 but read in as cp1252, and this happened three times. This might be recoverable (I don't know if it will work for every character, but at least for some) by putting the corrupted data through the reverse transformation - read in as utf8, write out as cp1252, repeat. There are plenty of ways of doing that kind of conversion - using a text editor as Tordek suggests, using commandline tools as below, or using the encoding features built in to your database or programming language.

unix shell prompt> echo Alain LefÃƒÆ’Ã‚Â¨vre | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252

Alain Lefèvre

unix shell prompt>

回复收藏 0 原文