Python 编码问题
为什么我会遇到这个问题?我该如何解决它?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
谢谢
Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在某个地方,也许是巧妙地,您要求 Python 将字节流转换为字符“字符串”。
不要将字符串视为“字节”。字符串是数字列表,每个数字在 Unicode 中都有约定的含义。 (#65=拉丁大写A。#19968=汉字“一”/“第一”)。
有许多方法可以将 Unicode 实体列表编码为字节流。 Python 假设您的字节流是特定此类方法(称为“UTF-8”)的结果。
但是,您的字节流包含与该方法不对应的数据。因此引发了错误。
您需要弄清楚字节流的编码,并告诉 Python 该编码。
重要的是要知道您使用的是 Python 2 还是 3,以及导致此异常的代码,以查看字节来自何处以及处理它们的适当方法是什么。
如果是读取文件,您可以显式处理读取的字节。但您必须确定文件编码。
如果它来自源代码一部分的字符串,那么 Python 会假设源文件的“错误”...也许
$LC_ALL
或$LANG
需要待设置。现在是牢固理解编码概念、文本编辑器如何选择编码进行编写以及您的语言和操作系统的标准的好时机。Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps
$LC_ALL
or$LANG
needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.除了 Joe 所说的之外, chardet 是一个有用的工具,可以检测源数据。
In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.
某处有一个编码为“Windows-1252”(或“cp1252”)的纯字符串,其中包含“右单引号”(')而不是撇号(')。这可能来自您阅读的文件,甚至来自您的 Python 源文件;您可以运行 Python 2.x 并在脚本开头附近有一个
# -*-coding: utf8 -*-
行,或者您也可以运行 Python 3.x。你没有提供足够的数据;但是,某处有一个 cp1252 编码的字符串,您尝试(显式或隐式)将其解码为 utf-8 形式的 unicode。这行不通。
请向我们提供更多信息,我们将再次尝试为您提供帮助。
乔·科伯格的回答让我想起了我的一个旧答案,有些人发现它很有帮助:Python UnicodeDecodeError - 我是否误解了编码?
Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a
# -*- coding: utf8 -*-
line somewhere near the script's beginning, or you could be running Python 3.x.You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?