UnicodeDecodeError: 'charmap'编解码器无法解码位置 Y 中的字节 X:字符映射到<未定义>

发布于 2025-01-04 06:12:11 字数 625 浏览 1 评论 0原文

我正在尝试让 Python 3 程序对填充信息的文本文件进行一些操作。但是,当尝试读取该文件时,出现以下错误:

Traceback (most recent call last):  
  File "SCRIPT LOCATION", line NUMBER, in <module>  
    text = file.read()
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode  
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`  

阅读此问答后,请参阅如何确定文本的编码 如果您需要帮助确定您尝试打开的文件的编码。

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:

Traceback (most recent call last):  
  File "SCRIPT LOCATION", line NUMBER, in <module>  
    text = file.read()
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode  
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`  

After reading this Q&A, see How to determine the encoding of text if you need help figuring out the encoding of the file you are trying to open.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

奢欲 2025-01-11 06:12:11

有问题的文件未使用 CP1252 编码。它使用另一种编码。哪一种你得自己弄清楚。常见的是Latin-1和UTF-8。由于 0x90 实际上在 Latin-1 中没有任何含义,因此 UTF-8(其中 0x90 是连续字节)更有可能。

您在打开文件时指定编码:

file = open(filename, encoding="utf8")

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.

You specify the encoding when you open the file:

file = open(filename, encoding="utf8")
娇纵 2025-01-11 06:12:11

如果 file = open(filename,encoding="utf-8") 不起作用,请尝试
file = open(filename,errors="ignore"),如果您想删除不需要的字符。 (文档

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

心是晴朗的。 2025-01-11 06:12:11

或者,如果您不需要解码文件,例如将文件上传到网站,请使用:

open(filename, 'rb')

where r = reading, b = 二进制

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:

open(filename, 'rb')

where r = reading, b = binary

赴月观长安 2025-01-11 06:12:11

TLDR:尝试:file = open(filename,encoding='cp437')

为什么?当使用时:

file = open(filename)
text = file.read()

Python 假定文件使用与当前环境相同的代码页(在开篇文章的情况下为 cp1252),并尝试将其解码为自己的默认 UTF-8 。如果文件包含此代码页中未定义的值的字符(例如 0x90),我们会收到 UnicodeDecodeError。有时我们不知道文件的编码,有时Python可能无法处理文件的编码(例如cp790),有时文件可能包含混合编码。

如果不需要这些字符,可以决定用问号替换它们,方法是:

file = open(filename, errors='replace')

另一种解决方法是使用:

file = open(filename, errors='ignore')

然后字符保持不变,但其他错误也将被屏蔽。

一个非常好的解决方案是指定编码,但不是任何编码(如cp1252),而是将每个单字节值(0..255)映射到的编码字符(如 cp437 或 latin1):

file = open(filename, encoding='cp437')

代码页 437 只是一个示例。它是原始的 DOS 编码。所有代码都被映射,因此读取文件时不会出现错误,不会屏蔽任何错误,字符会被保留(未完全保留,但仍可区分),并且可以检查它们的 ord() 值。

请注意,此建议只是解决棘手问题的快速解决方法。正确的解决方案是使用二进制模式,尽管它不是那么快。

TLDR: Try: file = open(filename, encoding='cp437')

Why? When one uses:

file = open(filename)
text = file.read()

Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be not handled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.

If such characters are unneeded, one may decide to replace them by question marks, with:

file = open(filename, errors='replace')

Another workaround is to use:

file = open(filename, errors='ignore')

The characters are then left intact, but other errors will be masked too.

A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which maps every single-byte value (0..255) to a character (like cp437 or latin1):

file = open(filename, encoding='cp437')

Codepage 437 is just an example. It is the original DOS encoding. All codes are mapped, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable) and one can check their ord() values.

Please note that this advice is just a quick workaround for a nasty problem. Proper solution is to use binary mode, although it is not so quick.

复古式 2025-01-11 06:12:11

作为 @LennartRegebro 的答案的扩展:

如果您无法判断文件使用的编码并且上述解决方案不起作用(它不是 utf8),您发现自己只是猜测 - 有 在线工具,您可以使用它来识别是什么编码。它们并不完美,但通常工作得很好。弄清楚编码后,您应该能够使用上面的解决方案。

编辑:(从评论复制)

一个非常流行的文本编辑器Sublime Text有一个命令来显示编码(如果已设置)...

  1. 转到查看-> 显示控制台(或Ctrl+`

在此处输入图像描述

  1. 在底部字段中输入 view.encoding() 和希望一切顺利(除了 Undefined 之外,我无法得到任何东西,但也许你会有更好的运气......)

在此处输入图像描述

As an extension to @LennartRegebro's answer:

If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.

EDIT: (Copied from comment)

A quite popular text editor Sublime Text has a command to display encoding if it has been set...

  1. Go to View -> Show Console (or Ctrl+`)

enter image description here

  1. Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

enter image description here

魂牵梦绕锁你心扉 2025-01-11 06:12:11

别再浪费时间了,只需在读取和写入代码中添加以下 encoding="cp437"errors='ignore' 即可:

open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')

Godspeed

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:

open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')

Godspeed

想你的星星会说话 2025-01-11 06:12:11

下面的代码将对 utf8 符号进行编码。

with open("./website.html", encoding="utf8") as file:
    contents = file.read()

Below code will encode the utf8 symbols.

with open("./website.html", encoding="utf8") as file:
    contents = file.read()
可是我不能没有你 2025-01-11 06:12:11
def read_files(file_path):

    with open(file_path, encoding='utf8') as f:
        text = f.read()
        return text

或(与)

def read_files(text, file_path):

    with open(file_path, 'rb') as f:
        f.write(text.encode('utf8', 'ignore'))

document = Document()
document.add_heading(file_path.name, 0)
    file_path.read_text(encoding='UTF-8'))
        file_content = file_path.read_text(encoding='UTF-8')
        document.add_paragraph(file_content)

def read_text_from_file(cale_fisier):
    text = cale_fisier.read_text(encoding='UTF-8')
    print("what I read: ", text)
    return text # return written text

def save_text_into_file(cale_fisier, text):
    f = open(cale_fisier, "w", encoding = 'utf-8') # open file
    print("Ce am scris: ", text)
    f.write(text) # write the content to the file

def read_text_from_file(file_path):
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        return text # return written text


def write_to_file(text, file_path):
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore')) # write the content to the file

import os
import glob

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()

    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)
def read_files(file_path):

    with open(file_path, encoding='utf8') as f:
        text = f.read()
        return text

OR (AND)

def read_files(text, file_path):

    with open(file_path, 'rb') as f:
        f.write(text.encode('utf8', 'ignore'))

OR

document = Document()
document.add_heading(file_path.name, 0)
    file_path.read_text(encoding='UTF-8'))
        file_content = file_path.read_text(encoding='UTF-8')
        document.add_paragraph(file_content)

OR

def read_text_from_file(cale_fisier):
    text = cale_fisier.read_text(encoding='UTF-8')
    print("what I read: ", text)
    return text # return written text

def save_text_into_file(cale_fisier, text):
    f = open(cale_fisier, "w", encoding = 'utf-8') # open file
    print("Ce am scris: ", text)
    f.write(text) # write the content to the file

OR

def read_text_from_file(file_path):
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        return text # return written text


def write_to_file(text, file_path):
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore')) # write the content to the file

OR

import os
import glob

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()

    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)
吻泪 2025-01-11 06:12:11

在应用建议的解决方案之前,您可以检查文件(以及错误日志)中出现的 Unicode 字符是什么,在本例中为 0x90https://unicodelookup.com/#0x90/1(或直接访问 Unicode 联盟网站 http://www.unicode.org/charts/ 通过搜索 0x0090

和然后考虑将其从文件中删除。

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)

and then consider removing it from the file.

不必在意 2025-01-11 06:12:11

对我来说使用 utf16 编码有效

file = open('filename.csv', encoding="utf16")

for me encoding with utf16 worked

file = open('filename.csv', encoding="utf16")
猫弦 2025-01-11 06:12:11

在较新版本的 Python(从 3.7 开始)中,您可以添加解释器选项 -Xutf8,这应该可以解决您的问题。如果您使用 Pycharm,只需运行 > > 编辑配置(在“配置”选项卡中,将“解释器选项”字段中的值更改为 -Xutf8)。

或者,等效地,您可以将环境变量 PYTHONUTF8 设置为 1。

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).

Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

孤者何惧 2025-01-11 06:12:11

对于那些在 Windows 中使用 Anaconda 的人来说,我也遇到了同样的问题。 Notepad++帮我解决这个问题。

在 Notepad++ 中打开该文件。在右下角它会告诉您当前的文件编码。
在顶部菜单中,“查看”旁边找到“编码”。在“编码”中,转到“字符集”,然后耐心地查找您需要的编码。就我而言,编码“Windows-1252”在“西欧”下找到

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.

Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

静谧幽蓝 2025-01-11 06:12:11

如果您使用的是 Windows,该文件可能以 UTF-8 BOM 开头,表明它肯定是一个 UTF-8 文件。根据 https://bugs.python.org/issue44510,我使用了 encoding=" utf-8-sig",csv文件读取成功。

If you are on Windows, the file may be starting with a UTF-8 BOM indicating it definitely is a UTF-8 file. As per https://bugs.python.org/issue44510, I used encoding="utf-8-sig", and the csv file was read successfully.

秋凉 2025-01-11 06:12:11

对我来说,更改与我的代码相同的 Mysql 字符编码有助于解决问题。 photo=open('pic3.png',encoding=latin1)
输入图片此处描述

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)
enter image description here

沉溺在你眼里的海 2025-01-11 06:12:11

这是我如何使用 UTF-8 打开和关闭文件的示例,从最近的代码中提取:

def traducere_v1_txt(translator, file):
  data = []
  with open(f"{base_path}/{file}" , "r" ,encoding='utf8', errors='ignore') as open_file:
    data = open_file.readlines()
    
    
file_name = file.replace(".html","")
        with open(f"Translated_Folder/{file_name}_{input_lang}.html","w", encoding='utf8') as htmlfile:
          htmlfile.write(lxml1)

This is an example of how I open and close file with UTF-8, extracted from a recent code:

def traducere_v1_txt(translator, file):
  data = []
  with open(f"{base_path}/{file}" , "r" ,encoding='utf8', errors='ignore') as open_file:
    data = open_file.readlines()
    
    
file_name = file.replace(".html","")
        with open(f"Translated_Folder/{file_name}_{input_lang}.html","w", encoding='utf8') as htmlfile:
          htmlfile.write(lxml1)
森林散布 2025-01-11 06:12:11

这项检查帮助我解决了这个问题:

with open(input_file, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
encoding = result['encoding']

print(f"Detected encoding: {encoding}")

with open(input_file, 'r', newline='', encoding=encoding, errors='replace') as csvfile:
 reader = csv.reader(csvfile)
 # read the file...

This check helped me solve the issue:

with open(input_file, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
encoding = result['encoding']

print(f"Detected encoding: {encoding}")

with open(input_file, 'r', newline='', encoding=encoding, errors='replace') as csvfile:
 reader = csv.reader(csvfile)
 # read the file...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文