如何诊断和逆转(而非阻止)Unicode 损坏
在我上游的某个地方,发生了看起来像 unicode mangling 的“事情”。一种症状是小写 u 元音变音 (ü) 转换为“ü”(即字符 FC 转换为 C3 BC)。假设我无法控制这个上游流程,我如何对正在发生的事情进行逆向工程?如果可能的话,我可以向后转动香肠机并恢复原始文本吗?
(如果有助于理解这种情况,我收到的文本是 MySQL 转储的形式。我认为在转储/传输过程中的某个地方它被破坏了。)
Somewhere upstream of me, "something" happened that looks like unicode mangling. One symptom is that a lowercase u umlaut (ü) gets converted to "ü" (ie, character FC gets converted to C3 BC). Assuming that I have no control over this upstream process, how can I reverse-engineer what's going on? And if that is possible, can I crank the sausage machine backwards and get the original text back?
(If it helps to understand this case, the text I received was in the form of a MySQL dump. I think somwewhere in the dump/transport process it got mangled.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的文字没有“损坏”。只是UTF8格式的。 C3 BC 是 ü 应该 被编码为的内容。只要将您使用的任何软件也设置为 UTF8,所有痛苦都会消失。如果您无法将软件设置为 Unicode,请认真考虑切换到较新的软件。
我知道一开始这很可怕,但无论如何你最终都必须这样做。我最喜欢的音乐排字机不久前切换到纯 Unicode 输入(他们甚至故意删除了对旧 8 位代码页的支持以让人们切换),我很沮丧,认为 Latin-1 对我来说已经足够好了,破坏那些工作得很好的东西是愚蠢的……然后我克服了它,只是将 emacs 设置为 Unicode 缓冲区,现在我再也不用考虑字符编码了!
Your text isn't 'mangled'. It's just in UTF8. C3 BC is what the ü is supposed to be encoded as. Just set whatever software you use to UTF8 also, and all pain will go away. If you can't set your software to Unicode, seriously consider switching to newer software.
I know it's scary at first, but you will have to do that eventually, anyway. My favorite music typesetter switched to Unicode-only input a while ago (they even deliberately removed support for the old 8-bit code pages to get people to switch), and I was upset, thinking that Latin-1 was good enough for me, and it was stupid to break stuff that was working perfectly well... and then I got over it and just set emacs to Unicode buffers, and now I'll never have to think about character encoding again in my life!
首先,看起来您已经获得了 UTF-8 编码的文本(正如您发现
Ò
以您预期的编码方式解释的,也许是 Latin-1)。您可以通过检查是否使用了正确的字节序列(当然,不使用非法的字节序列)来猜测正在使用这种编码。请参阅维基百科文章以获取参考并查找有效和无效的字节序列。如果文本以 BOM 开头,您可以非常确定编码,但这不是必需的对于 UTF-8。
要将文本恢复为所需的编码,可以使用多种工具,GNU recode 就是其中之一。
First of all, it looks like you've got UTF-8 encoded text (as you've found
ü
interpreted in your expected encoding, maybe Latin-1).You could guess this encoding being used by checking that the correct byte sequences are used (and the illegal ones not used, of course). See the Wikipedia article for a reference and look for valid and invalid byte sequences. You can be pretty sure about the encoding if the text starts with a BOM, but that's not required for UTF-8.
To get the text back in your required encoding, several tools are available, GNU recode for one.