UnicodeDecodeError,无效的连续字节

发布于 2024-10-30 14:52:51 字数 516 浏览 5 评论 0原文

为什么以下项目失败?为什么“latin-1”编解码器能够成功?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

结果是:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

Why is the below item failing? Why does it succeed with "latin-1" codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

渔村楼浪 2024-11-06 14:52:51

当我尝试通过 pandas.read_csv 打开 CSV 文件时,出现了同样的错误
方法。

解决方案是将编码更改为 latin-1

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

I had the same error when I tried to open a CSV file by pandas.read_csv
method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
千纸鹤 2024-11-06 14:52:51

在二进制中,0xE9 看起来像 1110 1001。如果您在维基百科上阅读UTF-8,您会发现这样的byte 后面必须跟有两个 10xx xxxx 形式。例如:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

但这只是异常的机械原因。在本例中,您有一个几乎肯定是用 latin 1 编码的字符串。您可以看到 UTF-8 和 latin 1 看起来有何不同:(

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

注意,我在这里混合使用 Python 2 和 3 表示形式。输入是在任何版本的 Python 中都有效,但您的 Python 解释器不太可能以这种方式实际显示 unicode 和字节字符串。)

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

五里雾 2024-11-06 14:52:51

这是无效的 UTF-8。该字符是 ISO-Latin1 中的 e-acute 字符,这就是它在该代码集上成功的原因。

如果您不知道接收字符串的代码集,那么您就会遇到麻烦。最好为您的协议/应用程序选择一个代码集(希望是 UTF-8),然后您只需拒绝那些未解码的代码集。

如果你做不到这一点,你就需要启发式方法。

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.

If you can't do that, you'll need heuristics.

陌路黄昏 2024-11-06 14:52:51

因为 UTF-8 是多字节的,并且没有与 \xe9 加上以下空格的组合相对应的字符。

为什么它在 utf-8 和 latin-1 中都可以成功?

下面是同一个句子在 utf-8 中的格式:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
站稳脚跟 2024-11-06 14:52:51

使用这个,如果显示UTF-8错误

pd.read_csv('File_name.csv',encoding='latin-1')

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')
神回复 2024-11-06 14:52:51

如果在操作刚刚打开的文件时出现此错误,请检查是否以 'rb' 模式打开它

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode

喜爱皱眉﹌ 2024-11-06 14:52:51

utf-8 代码错误通常在数值范围超过 0 到 127 时出现。

引发此异常的原因是:

1)如果代码点 < 128,每个字节与码点的值相同。
2)如果码位为128或更大,则无法用此编码表示Unicode字符串。 (Python 在这种情况下会引发 UnicodeEncodeError 异常。)

为了克服这个问题,我们有一组编码,最广泛使用的是“Latin-1,也称为 ISO-8859-1”,

因此 ISO-8859-1 Unicode点 0–255 与 Latin-1 值相同,因此转换为这种编码只需将代码点转换为字节值; 字符串无法编码为 Latin-1

如果遇到大于 255 的代码点,则在尝试加载数据集时出现此异常时,

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

,请尝试使用此格式在语法末尾添加编码技术,然后接受加载数据集。

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.

冧九 2024-11-06 14:52:51

当您在 pandas 中输入特定文件或数据时,就会出现这种类型的错误,例如:-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

然后错误显示如下:-
UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 1 中的字节 0xf4:无效的连续字节

因此,为了避免此类错误,可以通过添加参数来删除

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')

Well this type of error comes when u are taking input a particular file or data in pandas such as :-

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)

Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte

So to avoid this type of error can be removed by adding an argument

data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')
救赎№ 2024-11-06 14:52:51

当我从 .txt 文件中读取包含希伯来语的文本时,这也发生在我身上。

我点击:文件 ->另存为 我将此文件保存为 UTF-8 编码

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding

迷你仙 2024-11-06 14:52:51

TLDR:我建议在切换编码器以消除错误之前深入调查问题的根源。

当我处理大量包含其他 zip 文件的 zip 文件时,出现此错误。

我的工作流程如下:

  1. 读取 zip
  2. 读取子 zip
  3. 读取子 zip 中的文本

在某些时候,我遇到了上面的编码错误。经过仔细检查,发现一些儿童拉链错误地包含了更多拉链。将这些 zip 作为文本读取会导致一些奇怪的字符表示,我可以使用 encoding="latin-1" 来静音,但这反过来又会导致进一步的问题。由于我正在处理国际数据,因此认为这是一个编码问题并不完全愚蠢(我遇到了 0xc2: 问题),但最终这不是实际问题。

TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.

I got this error as I was processing a large number of zip files with additional zip files in them.

My workflow was the following:

  1. Read zip
  2. Read child zip
  3. Read text from child zip

At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.

柠北森屋 2024-11-06 14:52:51

在本例中,我尝试执行一个激活 path/file.sql 的 .py。

我的解决方案是将file.sql的编码修改为“UTF-8 without BOM”并且它有效!

您可以使用 Notepad++ 来完成。

我将留下我的代码的一部分。

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!

You can do it with Notepad++.

i will leave a part of my code.

con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor()
sqlfile = open(path, 'r')
风蛊 2024-11-06 14:52:51

我遇到了这个问题,结果发现我直接从谷歌表格文件保存了我的CSV。换句话说,我在一个谷歌工作表文件中。我选择保存副本,然后当我的浏览器下载它时,我选择打开。然后,我直接保存了 CSV。这是错误的举动。

为我解决这个问题的方法是首先将工作表另存为本地计算机上的 .xlsx 文件,然后从那里将单个工作表导出为 .csv。然后 pd.read_csv('myfile.csv') 的错误消失了

I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.

What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')

鹊巢 2024-11-06 14:52:51

解决方案更改为“UTF-8 sin BOM”

The solution was change to "UTF-8 sin BOM"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文