在 Python 3 中从 utf-16 转换为 utf-8

发布于 2024-09-07 02:15:38 字数 130 浏览 10 评论 0原文

我正在使用 Python 3 进行编程，但遇到了一个小问题，我在网上找不到任何关于它的参考。

据我了解，默认字符串是 utf-16，但我必须使用 utf-8，我找不到将从默认字符串转换为 utf-8 的命令。我非常感谢你的帮助。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

太傻旳人生 2024-09-14 02:15:38

在 Python 3 中，当您进行字符串操作时，有两种不同的数据类型非常重要。首先是 string 类，它是一个表示 unicode 代码点的对象。重要的是，这个字符串不是一些字节，而是一个字符序列。其次，有 bytes 类，它只是一个字节序列，通常表示以编码（如 utf-8 或 iso-8859-15）存储的字符串。

这对你来说意味着什么？据我了解，您想要读取和写入 utf-8 文件。让我们编写一个程序，将所有“ć”替换为“ç”字符

def main():
    # Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
    with open('output_file', 'w', encoding='utf-8') as out_file:
        # read every line. We give open() the encoding so it will return a Unicode string. 
        for line in open('input_file', encoding='utf-8'):
            #Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
            print(line.replace('ć', 'ç'), out_file)

那么什么时候应该使用字节呢？不经常。我能想到的一个例子是当你从套接字读取某些内容时。如果你在 bytes 对象中有这个，你可以通过执行 bytes.decode('encoding') 使其成为 unicode 字符串，反之亦然，使用 str.encode('encoding') 。但正如所说，您可能不需要它。

不过，因为它很有趣，所以这里有一个困难的方法，您可以自己对所有内容进行编码：

def main():
    # Open the file in binary mode. So we are going to write bytes to it instead of strings
    with open('output_file', 'wb') as out_file:
        # read every line. Again, we open it binary, so we get bytes 
        for line_bytes in open('input_file', 'rb'):
            #Convert the bytes to a string
            line_string = bytes.decode('utf-8')
            #Replace the characters we want. 
            line_string = line_string.replace('ć', 'ç')
            #Make a bytes to print
            out_bytes = line_string.encode('utf-8')
            #Print the bytes
            print(out_bytes, out_file)

有关此主题（字符串编码）的好读物是 http://www.joelonsoftware.com/articles/Unicode.html。真心推荐阅读！

来源： http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

（PS当你看，我在这篇文章中没有提到 utf-16，我实际上不知道 python 是否使用它作为内部解码，但它完全不相关，目前你正在使用字符串，你正在使用字符。（代码点），而不是字节。

In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).

What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters

def main():
    # Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
    with open('output_file', 'w', encoding='utf-8') as out_file:
        # read every line. We give open() the encoding so it will return a Unicode string. 
        for line in open('input_file', encoding='utf-8'):
            #Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
            print(line.replace('ć', 'ç'), out_file)

So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.

Still, because it is interesting, here the hard way, where you encode everything yourself:

def main():
    # Open the file in binary mode. So we are going to write bytes to it instead of strings
    with open('output_file', 'wb') as out_file:
        # read every line. Again, we open it binary, so we get bytes 
        for line_bytes in open('input_file', 'rb'):
            #Convert the bytes to a string
            line_string = bytes.decode('utf-8')
            #Replace the characters we want. 
            line_string = line_string.replace('ć', 'ç')
            #Make a bytes to print
            out_bytes = line_string.encode('utf-8')
            #Print the bytes
            print(out_bytes, out_file)

Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!

Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

回复收藏 0 原文

~没有更多了~