文本编辑器将 python 创建的 UTF-8 文件显示为乱码

发布于 2024-10-20 21:45:54 字数 2579 浏览 0 评论 0原文

这是我的第一个问题,如果其格式不符合此处的预期,请提前抱歉。

我有一个小实用程序,可以读取 ISO-8859-9 文本文件并生成其 UTF-8 副本。我找到的方法是使用编码和解码方法,当我实现前辈的方法时,文本编辑器将unicode字符显示为不相关的字符。

问题的关键是文件写入正确。为了进行检查,我在 Mac 的 TextEdit 中创建了同一文件的手动创建版本。转换后的版本的十六进制转储和 md5sum 与手工创建的版本相同。然而,即使我选择 UTF-8 作为输入编码,KDE 上的 Textedit 和 Kwrite(或 Kate)也会显示荒谬的字符。为什么会发生这种情况?我该如何解决这个问题?

多谢。

更新:

od -c 输出如下:

首先,ISO-8859-9 文件:

0000000  374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020    T   e   s   t                                                
0000024

Python 创建的 UTF-8:

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

手工创建的 UTF-8:

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

实际代码:

def convert_file(path_of_text_file):
    try:
        original_file = open(path_of_text_file, 'rb')
        file_contents = unicode(original_file.read(), 'iso-8859-9')
        original_file.close()

        new_file = open("untitled2.txt", 'w+b')
        new_file.write(file_contents.encode('utf8'))
        new_file.close()
    except IOError:
        pass

另外,是的,手工制作的文件打开得很好。它还具有与 python 生成的相同的 md5sum 和十六进制输出。

od -xc 输出:

再次是原始 ISO-8859-9 文件:

0000000      f0fc    fe69    f6e7    d0dc    dedd    d6c7    6554    7473
         374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020      6554    7473                                                
           T   e   s   t                                                
0000024

Python 生成的 UTF-8 文件:

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

手工制作的 UTF-8 文件:

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

另一个有趣的注释: BBEdit 可以很好地处理 python 创建的文件。

this is my first question here and if its format is not what is expected here, sorry in advance.

I have a small utility that reads ISO-8859-9 text files and produces its UTF-8 copies. The method I found is the usage of encode and decode methods, when I implement the way of the elders, text editors show the unicode characters as irrelevant characters.

The twist of the problem is the files are written correctly. For check, I've created a hand-created version of the same file in TextEdit in Mac. The converted version's hex dump and md5sum is same for the hand-created one. However both Textedit and Kwrite (or Kate) on KDE shows absurd characters even if I choose UTF-8 as the input encoding. Why this is happening and how can I solve this?

Thanks a lot.

Update:

od -c outputs are below:

First of all, the ISO-8859-9 file:

0000000  374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020    T   e   s   t                                                
0000024

The Python Created UTF-8:

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

Hand Created UTF-8:

0000000    ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020   **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

The Actual Code:

def convert_file(path_of_text_file):
    try:
        original_file = open(path_of_text_file, 'rb')
        file_contents = unicode(original_file.read(), 'iso-8859-9')
        original_file.close()

        new_file = open("untitled2.txt", 'w+b')
        new_file.write(file_contents.encode('utf8'))
        new_file.close()
    except IOError:
        pass

Also yes, the handcrafted file open just fine. Also it has the same md5sum and hex output of the python generated one.

od -xc outputs:

Again the original ISO-8859-9 file:

0000000      f0fc    fe69    f6e7    d0dc    dedd    d6c7    6554    7473
         374 360   i 376 347 366 334 320 335 336 307 326   T   e   s   t
0000020      6554    7473                                                
           T   e   s   t                                                
0000024

Python generated UTF-8 file:

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

Hand crafted UTF-8 file:

0000000      bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
           ü  **   ğ  **   i   ş  **   ç  **   ö  **   Ü  **   Ğ  **   İ
0000020      c5b0    c39e    c387    5496    7365    5474    7365    0074
          **   Ş  **   Ç  **   Ö  **   T   e   s   t   T   e   s   t    
0000037

Another note of interest: BBEdit handles python created files just fine.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

不必你懂 2024-10-27 21:45:54

我已经解决了这个问题。这是 OSX 资源分支、TextEdit 和一点 PEBKAC 的混合问题。我是这样解决这个问题的:

我将文件复制到我的(fat32)闪存盘,所以我得到的资源叉为 ".filename" 。我注意到我用 python 编写的文件没有资源分支。有趣的是,当我使用带有强制 UTF-8 编码的 TextEdit 从闪存盘打开文件时,一切正常(奇怪的是,当我在将文件复制到闪存之前尝试时,它不起作用)。

有了这个证据,我可以说 TextEdit 将文件的编码存储在其资源分支中,而不是像 file 命令那样每次都猜测它。更有趣的是,现在我的 Linux boxen 似乎表现良好,我不能说为什么。

结果,代码按预期工作,一切都很好。失败的是 TextEdit,而不是 python。

谢谢大家,

黑客快乐。

I've solved the problem. This is a mixed issue of OSX resource forks, TextEdit and a bit of PEBKAC. Here's how I solved it:

I copied the files to my (fat32) flash disk, so I get the resource forks as ".filename" . The thing I noticed that the file I wrote with python has come with no resource forks. Interestingly when I opened file from the flash disk with TextEdit with forced UTF-8 encoding, everything worked fine (strangely it didn't work when I tried before copying files to the flash).

With this evidence I can say that TextEdit is storing a file's encoding in its resource fork, not guessing it everytime unlike the file command. More interestingly now my Linux boxen seems to behave well, I can't say why.

As a result, the code works as it should and everything is fine. The dud is the TextEdit, not python.

Thanks everyone,

Happy hacking.

简单爱 2024-10-27 21:45:54

我快速实现了我认为您的 Python 转换脚本正在执行的操作:

iso = "\374\360i\376\347\366\334\320\335\336\307\326Test Test"
tmp = iso.decode('iso-8859-9')
utf = tmp.encode('utf-8')
out = open('utf.txt', 'wb')
out.write(utf)

od -xc 输出:

0000000    bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
        303 274 304 237   i 305 237 303 247 303 266 303 234 304 236 304
0000020    c5b0    c39e    c387    5496    7365    2074    6554    7473
        260 305 236 303 207 303 226   T   e   s   t       T   e   s   t
0000040

Mac 中 Textedit 的屏幕截图:

Textedit 输入编码首选项窗格
Textedit 显示 utf.txt

I did a quick implementation of what I presume your Python conversion script is doing:

iso = "\374\360i\376\347\366\334\320\335\336\307\326Test Test"
tmp = iso.decode('iso-8859-9')
utf = tmp.encode('utf-8')
out = open('utf.txt', 'wb')
out.write(utf)

The od -xc output:

0000000    bcc3    9fc4    c569    c39f    c3a7    c3b6    c49c    c49e
        303 274 304 237   i 305 237 303 247 303 266 303 234 304 236 304
0000020    c5b0    c39e    c387    5496    7365    2074    6554    7473
        260 305 236 303 207 303 226   T   e   s   t       T   e   s   t
0000040

Screenshots from Textedit in Mac:

Textedit input encoding pref pane
Textedit displaying utf.txt

樱花坊 2024-10-27 21:45:54

由于文件内容是相同的,因此文件内容之外肯定有一些东西决定如何解释文件。文件名是明显的嫌疑点。如果您在不同的目录中对文件进行相同的命名,它们的行为是否会相同?

使用 file 命令查看 OS/X 如何猜测文件类型。

Since the file contents are identical, there must be something outside of the file contents that are determining how the file is interpreted. The file name is the obvious suspect. If you name the files identically in different directories, do they start behaving identically?

Use the file command to see how OS/X is guessing the file type.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文