文本编辑器将 python 创建的 UTF-8 文件显示为乱码

发布于 2024-10-20 21:45:54 字数 2579 浏览 3 评论 0原文

这是我的第一个问题，如果其格式不符合此处的预期，请提前抱歉。

我有一个小实用程序，可以读取 ISO-8859-9 文本文件并生成其 UTF-8 副本。我找到的方法是使用编码和解码方法，当我实现前辈的方法时，文本编辑器将unicode字符显示为不相关的字符。

问题的关键是文件写入正确。为了进行检查，我在 Mac 的 TextEdit 中创建了同一文件的手动创建版本。转换后的版本的十六进制转储和 md5sum 与手工创建的版本相同。然而，即使我选择 UTF-8 作为输入编码，KDE 上的 Textedit 和 Kwrite（或 Kate）也会显示荒谬的字符。为什么会发生这种情况？我该如何解决这个问题？

多谢。

更新：

od -c 输出如下：

首先，ISO-8859-9 文件：

0000000  374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020    T   e   s   t                                                
0000024

Python 创建的 UTF-8：

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

手工创建的 UTF-8：

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

实际代码：

def convert_file(path_of_text_file):
    try:
        original_file = open(path_of_text_file, 'rb')
        file_contents = unicode(original_file.read(), 'iso-8859-9')
        original_file.close()

        new_file = open("untitled2.txt", 'w+b')
        new_file.write(file_contents.encode('utf8'))
        new_file.close()
    except IOError:
        pass

另外，是的，手工制作的文件打开得很好。它还具有与 python 生成的相同的 md5sum 和十六进制输出。

od -xc 输出：

再次是原始 ISO-8859-9 文件：

0000000      f0fc    fe69    f6e7    d0dc    dedd    d6c7    6554    7473
         374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020      6554    7473                                                
           T   e   s   t                                                
0000024

Python 生成的 UTF-8 文件：

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

手工制作的 UTF-8 文件：

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

另一个有趣的注释： BBEdit 可以很好地处理 python 创建的文件。

原文

this is my first question here and if its format is not what is expected here, sorry in advance.

I have a small utility that reads ISO-8859-9 text files and produces its UTF-8 copies. The method I found is the usage of encode and decode methods, when I implement the way of the elders, text editors show the unicode characters as irrelevant characters.

The twist of the problem is the files are written correctly. For check, I've created a hand-created version of the same file in TextEdit in Mac. The converted version's hex dump and md5sum is same for the hand-created one. However both Textedit and Kwrite (or Kate) on KDE shows absurd characters even if I choose UTF-8 as the input encoding. Why this is happening and how can I solve this?

Thanks a lot.

Update:

od -c outputs are below:

First of all, the ISO-8859-9 file:

0000000  374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020    T   e   s   t                                                
0000024

The Python Created UTF-8:

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

Hand Created UTF-8:

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

The Actual Code:

def convert_file(path_of_text_file):
    try:
        original_file = open(path_of_text_file, 'rb')
        file_contents = unicode(original_file.read(), 'iso-8859-9')
        original_file.close()

        new_file = open("untitled2.txt", 'w+b')
        new_file.write(file_contents.encode('utf8'))
        new_file.close()
    except IOError:
        pass

Also yes, the handcrafted file open just fine. Also it has the same md5sum and hex output of the python generated one.

od -xc outputs:

Again the original ISO-8859-9 file:

0000000      f0fc    fe69    f6e7    d0dc    dedd    d6c7    6554    7473
         374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020      6554    7473                                                
           T   e   s   t                                                
0000024

Python generated UTF-8 file:

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

Hand crafted UTF-8 file:

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

Another note of interest: BBEdit handles python created files just fine.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不必你懂 2024-10-27 21:45:54

我已经解决了这个问题。这是 OSX 资源分支、TextEdit 和一点 PEBKAC 的混合问题。我是这样解决这个问题的：

我将文件复制到我的（fat32）闪存盘，所以我得到的资源叉为 ".filename" 。我注意到我用 python 编写的文件没有资源分支。有趣的是，当我使用带有强制 UTF-8 编码的 TextEdit 从闪存盘打开文件时，一切正常（奇怪的是，当我在将文件复制到闪存之前尝试时，它不起作用）。

有了这个证据，我可以说 TextEdit 将文件的编码存储在其资源分支中，而不是像 file 命令那样每次都猜测它。更有趣的是，现在我的 Linux boxen 似乎表现良好，我不能说为什么。

结果，代码按预期工作，一切都很好。失败的是 TextEdit，而不是 python。

谢谢大家，

黑客快乐。

回复收藏 0 原文

简单爱 2024-10-27 21:45:54

我快速实现了我认为您的 Python 转换脚本正在执行的操作：

iso = "\374\360i\376\347\366\334\320\335\336\307\326Test Test"
tmp = iso.decode('iso-8859-9')
utf = tmp.encode('utf-8')
out = open('utf.txt', 'wb')
out.write(utf)

od -xc 输出：

0000000    bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
        303 274 304 237   i 305 237 303 247 303 266 303 234 304 236 304
0000020    c5b0    c39e    c387    5496    7365    2074    6554    7473
        260 305 236 303 207 303 226   T   e   s   t       T   e   s   t
0000040

Mac 中 Textedit 的屏幕截图：

Textedit 输入编码首选项窗格
Textedit 显示 utf.txt

I did a quick implementation of what I presume your Python conversion script is doing:

iso = "\374\360i\376\347\366\334\320\335\336\307\326Test Test"
tmp = iso.decode('iso-8859-9')
utf = tmp.encode('utf-8')
out = open('utf.txt', 'wb')
out.write(utf)

The od -xc output:

0000000    bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
        303 274 304 237   i 305 237 303 247 303 266 303 234 304 236 304
0000020    c5b0    c39e    c387    5496    7365    2074    6554    7473
        260 305 236 303 207 303 226   T   e   s   t       T   e   s   t
0000040

Screenshots from Textedit in Mac:

Textedit input encoding pref pane
Textedit displaying utf.txt