Windows 与 Mac 的 utf-8 问题
好的,我有一个包含 utf-8 代码的小测试文件。这就是(语言是沃洛夫语)
Fˆndeen d‘kk la bu ay wolof aki seereer a fa nekk. DigantŽem ak
Cees jur—om-benni kilomeetar la. MbŽyum gerte ‘pp ci diiwaan bi mu
在普通编辑器中的样子,但在十六进制中是这样的:
xxd test.txt
0000000: 46cb 866e 6465 656e 2064 e280 986b 6b20 F..ndeen d...kk
0000010: 6c61 2062 7520 6179 2077 6f6c 6f66 2061 la bu ay wolof a
0000020: 6b69 2073 6565 7265 6572 2061 2066 6120 ki seereer a fa
0000030: 6e65 6b6b 2e20 4469 6761 6e74 c5bd 656d nekk. Digant..em
0000040: 2061 6b0d 0a43 6565 7320 6a75 72e2 8094 ak..Cees jur...
0000050: 6f6d 2d62 656e 6e69 206b 696c 6f6d 6565 om-benni kilomee
0000060: 7461 7220 6c61 2e20 4d62 c5bd 7975 6d20 tar la. Mb..yum
0000070: 6765 7274 6520 e280 9870 7020 6369 2064 gerte ...pp ci d
0000080: 6969 7761 616e 2062 6920 6d75 0d0a iiwaan bi mu..
第二个字符 [cb86] 是 a-grave [à] 的非标准编码,很容易找到 在网络文档中保持一致,尽管在“真正的”utf-8 中,a-grave 将是 c3a0。真正的 utf-8 作品 在 Mac 和 Windows 下都很漂亮。
我通过使用包含对 { ˆ à } 的字符映射来处理假 utf-8,因为 小插入符是 cb86 生成的,并且在 MAC 上一切正常,用于显示文本(在文本小部件中) 像这样:
Fàndeen dëkk la bu ay wolof aki seereer a fa nekk. Digantéem ak
Cees juróom-benni kilomeetar la. Mbéyum gerte ëpp ci diiwaan bi mu
在 PC 上 - 使用同一文件(共享),读入的前三个字符是 46 cb 20(不使用 fconfigure)。我已经运行了所有可能的编码 并且永远无法让同一张地图发挥作用。 [有 20 个允许 46 cb 86]
抱歉,这太长了,但如果有人有线索,我很想听听。
特尔蒙克斯
OK, I have a small test file that contains utf-8 codes. Here it is (the language is Wolof)
Fˆndeen d‘kk la bu ay wolof aki seereer a fa nekk. DigantŽem ak
Cees jur—om-benni kilomeetar la. MbŽyum gerte ‘pp ci diiwaan bi mu
that is what it looks like in a vanilla editor, but in hex it is:
xxd test.txt
0000000: 46cb 866e 6465 656e 2064 e280 986b 6b20 F..ndeen d...kk
0000010: 6c61 2062 7520 6179 2077 6f6c 6f66 2061 la bu ay wolof a
0000020: 6b69 2073 6565 7265 6572 2061 2066 6120 ki seereer a fa
0000030: 6e65 6b6b 2e20 4469 6761 6e74 c5bd 656d nekk. Digant..em
0000040: 2061 6b0d 0a43 6565 7320 6a75 72e2 8094 ak..Cees jur...
0000050: 6f6d 2d62 656e 6e69 206b 696c 6f6d 6565 om-benni kilomee
0000060: 7461 7220 6c61 2e20 4d62 c5bd 7975 6d20 tar la. Mb..yum
0000070: 6765 7274 6520 e280 9870 7020 6369 2064 gerte ...pp ci d
0000080: 6969 7761 616e 2062 6920 6d75 0d0a iiwaan bi mu..
The second character [cb86] is a non-standard coding for a-grave [à] which is found quite
consistently in web documents, although in 'real' utf-8, a-grave would be c3a0. Real utf-8 works
beautifully on Macs and under Windows.
I handle the fake utf-8 by using a character map which included the pair { ˆ à } because that
little caret is what cb86 generates, and everything works fine ON A MAC for displaying text (in a text widget)
like this:
Fàndeen dëkk la bu ay wolof aki seereer a fa nekk. Digantéem ak
Cees juróom-benni kilomeetar la. Mbéyum gerte ëpp ci diiwaan bi mu
On a PC - using the same file (shared) the first three characters read in are
46 cb 20 (using no fconfigure). I have run through ALL the possible encodings
and can never get the same map to work. [There are twenty that will allow 46 cb 86]
Sorry this is so long, but if anyone has a clue, I would love to hear it.
Tel Monks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我根本不认识沃洛夫。但是,我确信您遇到的问题是您有一个混合编码的文件,具有非标准代码点(而不是标准 Unicode),然后使用 UTF- 转换为字节8 方案。 这很混乱!
处理这个问题的方法是首先使用配置为使用
utf-8
编码的通道将字节读入 Tcl:然后,您需要使用
字符串映射
应用转换,将“错误”字符转换为正确字符。例如,这将针对您列出的特定字符执行此操作(据我所知):但是,这可能都是错误的!问题是我不知道文件的内容应该是什么(在字符级别,而不是字节级别)。这又回到了我的评论“我根本不认识沃洛夫”。
更新
现在 dan04 已经确定了对那个糟糕的文本做了什么,我可以提供如何解码。
阅读上面的代码,但现在我们使用不同的映射步骤:
在提供的示例上,产生预期的输出。
I don't know Wolof at all. However, I'm sure that the problem you've got is that you've got a file that is in a mixed encoding, with non-standard code points (instead of standard Unicode) and then a conversion to bytes using the UTF-8 scheme. This is messy!
The way to deal with this is to first read the bytes into Tcl using a channel that is configured to use the
utf-8
encoding:Then, you need to apply a transformation using
string map
that converts the “wrong” characters to the right ones. For example, this would do it (as far as I can tell) for the specific characters you listed:However, that might be all wrong! The problem is that I don't know what the contents of the file should be (at the level of characters, not bytes). Which gets back to my comment “I don't know Wolof at all”.
Update
Now that dan04 has identified what had been done to that poor text, I can provide how to decode.
Read the code in as above, but now we use a different mapping step:
On the sample supplied, that produces the expected output.
数据最初使用 Mac 编码(最有可能是罗马语,但本例中也可能是土耳其语和冰岛语)进行编码,被误解为 windows-1252,然后正确转换为 UTF-8。
The data was originally encoded using a Mac encoding (most likely Roman, but Turkish and Icelandic are also possible for this example), misinterpreted as windows-1252, and then correctly converted to UTF-8.