UnicodeDecodeError，无效的连续字节

发布于 2024-10-30 14:52:51 字数 516 浏览 5 评论 0原文

为什么以下项目失败？为什么“latin-1”编解码器能够成功？

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

结果是：

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

原文

Why is the below item failing? Why does it succeed with "latin-1" codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

渔村楼浪 2024-11-06 14:52:51

当我尝试通过 pandas.read_csv 打开 CSV 文件时，出现了同样的错误
方法。

解决方案是将编码更改为 latin-1：

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

I had the same error when I tried to open a CSV file by pandas.read_csv
method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

回复收藏 0 原文

千纸鹤 2024-11-06 14:52:51

在二进制中，0xE9 看起来像 1110 1001。如果您在维基百科上阅读UTF-8，您会发现这样的byte 后面必须跟有两个 10xx xxxx 形式。例如：

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

但这只是异常的机械原因。在本例中，您有一个几乎肯定是用 latin 1 编码的字符串。您可以看到 UTF-8 和 latin 1 看起来有何不同：（

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

注意，我在这里混合使用 Python 2 和 3 表示形式。输入是在任何版本的 Python 中都有效，但您的 Python 解释器不太可能以这种方式实际显示 unicode 和字节字符串。）

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

回复收藏 0 原文

五里雾 2024-11-06 14:52:51

这是无效的 UTF-8。该字符是 ISO-Latin1 中的 e-acute 字符，这就是它在该代码集上成功的原因。

如果您不知道接收字符串的代码集，那么您就会遇到麻烦。最好为您的协议/应用程序选择一个代码集（希望是 UTF-8），然后您只需拒绝那些未解码的代码集。

如果你做不到这一点，你就需要启发式方法。

回复收藏 0 原文

陌路黄昏 2024-11-06 14:52:51

因为 UTF-8 是多字节的，并且没有与 \xe9 加上以下空格的组合相对应的字符。

为什么它在 utf-8 和 latin-1 中都可以成功？

下面是同一个句子在 utf-8 中的格式：

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

回复收藏 0 原文

站稳脚跟 2024-11-06 14:52:51

使用这个，如果显示UTF-8错误

pd.read_csv('File_name.csv',encoding='latin-1')

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

回复收藏 0 原文

神回复 2024-11-06 14:52:51

如果在操作刚刚打开的文件时出现此错误，请检查是否以 'rb' 模式打开它

回复收藏 0 原文

喜爱皱眉﹌ 2024-11-06 14:52:51

utf-8 代码错误通常在数值范围超过 0 到 127 时出现。

引发此异常的原因是：

1）如果代码点 < 128，每个字节与码点的值相同。
2）如果码位为128或更大，则无法用此编码表示Unicode字符串。（Python 在这种情况下会引发 UnicodeEncodeError 异常。）

为了克服这个问题，我们有一组编码，最广泛使用的是“Latin-1，也称为 ISO-8859-1”，

因此 ISO-8859-1 Unicode点 0–255 与 Latin-1 值相同，因此转换为这种编码只需将代码点转换为字节值；字符串无法编码为 Latin-1

如果遇到大于 255 的代码点，则在尝试加载数据集时出现此异常时，

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

，请尝试使用此格式在语法末尾添加编码技术，然后接受加载数据集。

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

回复收藏 0 原文

冧九 2024-11-06 14:52:51

当您在 pandas 中输入特定文件或数据时，就会出现这种类型的错误，例如：-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

然后错误显示如下：-
UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 1 中的字节 0xf4：无效的连续字节

因此，为了避免此类错误，可以通过添加参数来删除

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

Well this type of error comes when u are taking input a particular file or data in pandas such as :-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte

So to avoid this type of error can be removed by adding an argument

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

回复收藏 0 原文

救赎№ 2024-11-06 14:52:51

当我从 .txt 文件中读取包含希伯来语的文本时，这也发生在我身上。

我点击：文件 ->另存为 我将此文件保存为 UTF-8 编码

回复收藏 0 原文

迷你仙 2024-11-06 14:52:51

TLDR：我建议在切换编码器以消除错误之前深入调查问题的根源。

当我处理大量包含其他 zip 文件的 zip 文件时，出现此错误。

我的工作流程如下：

读取 zip
读取子 zip
读取子 zip 中的文本

在某些时候，我遇到了上面的编码错误。经过仔细检查，发现一些儿童拉链错误地包含了更多拉链。将这些 zip 作为文本读取会导致一些奇怪的字符表示，我可以使用 encoding="latin-1" 来静音，但这反过来又会导致进一步的问题。由于我正在处理国际数据，因此认为这是一个编码问题并不完全愚蠢（我遇到了 0xc2: 问题），但最终这不是实际问题。

回复收藏 0 原文

柠北森屋 2024-11-06 14:52:51

在本例中，我尝试执行一个激活 path/file.sql 的 .py。

我的解决方案是将file.sql的编码修改为“UTF-8 without BOM”并且它有效！

您可以使用 Notepad++ 来完成。

我将留下我的代码的一部分。

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!

You can do it with Notepad++.

i will leave a part of my code.

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')

回复收藏 0 原文