错误读取文件 - ' utf'可以在位置45中解码字节0xff:无效启动字节

发布于 2025-02-02 12:28:05 字数 2380 浏览 2 评论 0原文

我在这里有这两个脚本, send.py recept.py 。 send.py是一个主机,它打开连接并等待接收.py连接。从理论上讲,一旦连接成功,我就可以将任何文件(带有send.py脚本)发送到另一个设备(带有接收脚本)。小问题...我试图从计算机上发现的随机音乐文件中阅读,以确保它可以与任何类型的文件一起使用并概述以下错误:

unicodedecodeerror:'utf-8'编解码器无法在位置45中解码字节0xff:无效启动字节

是什么导致此错误?

send.py

from socket import *

port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.bind(('0.0.0.0', port))
s.listen(1)

c, addr = s.accept()

buffersize = 128

fname = '✵ТГК -Гелик 2022✵ Gelik✵-160 (mp3cut.net).mp3' #input('File Path: ')

with open(fname, 'rb') as file:
    readfc = file.read()

c.send(fname.encode())

if len(readfc) > buffersize:
    for packet in range(len(readfc) % buffersize):
        c.send(readfc[0:buffersize])

recept.py

from socket import *

port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.connect(('192.168.0.171', port))

index = 0
while True:
    data = s.recv(1024)
    if not data:
        pass
    else:
        index += 1
        if index == 1:
            filename = data.decode()
        else:
            with open(filename, 'ab') as file:
                file.write(data.decode())

这是MSUIC文件的第一行:

ID3     #TSSE     Lavf59.16.100           яыа                                 Info     #R ђ.3 

!$&)+.0369:=@CEGJMORUVY\_acfiknqsux{}Ђ‚…‡ЉЌЏ‘”—љњћЎЈ¦©«­°і¶ёєЅАВЕЗКМПТФЦЩЬЮбгжилортхшъэ    Lavc59.18            $@     ђ.3ЮЬмf                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    яыаD р  i   ```

I've got these two scripts right here, send.py and receive.py. Send.py is a host, it opens a connection and waits for receive.py to connect. Once the connection is successfull, in theory, I could send any file from one device (with the send.py script) to another (with the receive.py script). Little problem... I was trying to read from a random music file I found on my computer to make sure it works with any type of file and encoutered the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 45: invalid start byte

What causes this error?

send.py:

from socket import *

port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.bind(('0.0.0.0', port))
s.listen(1)

c, addr = s.accept()

buffersize = 128

fname = '✵ТГК -Гелик 2022✵ Gelik✵-160 (mp3cut.net).mp3' #input('File Path: ')

with open(fname, 'rb') as file:
    readfc = file.read()

c.send(fname.encode())

if len(readfc) > buffersize:
    for packet in range(len(readfc) % buffersize):
        c.send(readfc[0:buffersize])

and receive.py:

from socket import *

port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.connect(('192.168.0.171', port))

index = 0
while True:
    data = s.recv(1024)
    if not data:
        pass
    else:
        index += 1
        if index == 1:
            filename = data.decode()
        else:
            with open(filename, 'ab') as file:
                file.write(data.decode())

And here are the first lines from the msuic file:

ID3     #TSSE     Lavf59.16.100           яыа                                 Info     #R ђ.3 

!
amp;)+.0369:=@CEGJMORUVY\_acfiknqsux{}Ђ‚…‡ЉЌЏ‘”—љњћЎЈ¦©«­°і¶ёєЅАВЕЗКМПТФЦЩЬЮбгжилортхшъэ    Lavc59.18            $@     ђ.3ЮЬмf                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    яыаD р  i   ```

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你是年少的欢喜 2025-02-09 12:28:06

该代码假设发件人中的单个发送与收件人中的单个recv匹配。对于TCP:TCP只是一个非结构化的字节流而不是结构化消息传输,该假设是错误的,它可以在发送/RECV上保留消息边界。

这意味着收件人中的初始data = s.recv(1024)不仅可能包括文件名,而且还可能已经包含了音乐文件的一部分。因此,它是UTF-8编码的文件名(多字节字符)的混合,其次是二进制音乐数据(字节)。尝试filename = data.decode()将成功解码初始文件名。但是,它将在文件名结束后继续解码数据,从而将二进制音乐数据视为UTF-8中编码的多字节字符。这将导致观察到的解码误差。

修复程序应清楚地标记文件名结束的位置,二进制数据从何处开始,然后仅将文件名解码为文本,然后将其余的字节视为字节。一种常见的方法是将文件名与长度相结合,以便清楚其结束的位置。另一种方法可能会在文件名末尾添加\ 0(因为它不是有效的UTF-8编码字符的一部分en/unicode/u+0000“ rel =“ nofollow noreferrer”> nul - 本身在文件名中无效),并在此定界线上拆分传入的数据。

除此之外,后来的data.decode()在阅读音乐数据时,由于发件人侧没有匹配encode(),因此读取音乐数据是错误的。而且不应该有一个,因为这些是二进制数据,即已经是字节。

This code is assuming that a single send in the sender matches a single recv in the recipient. This assumption is wrong for TCP: TCP is only an unstructured byte stream and not a structured message transport which would preserve message boundaries over send/recv.

This means that the initial data = s.recv(1024) in the recipient might not only include the filename, but might also already include parts of the music file. Thus it is a mix of the utf-8 encoded filename (multi-byte characters) followed by the binary music data (bytes). Trying to filename = data.decode() on this will successfully decode the initial filename. But it will continue to decode the data after the end of the filename and thus treat the binary music data also as multi-byte characters encoded in utf-8. This will lead to the observed decoding error.

The fix should be to clearly mark where the filename ends and the binary data start and then only decode the filename as text and treat the rest as bytes. A common approach is to prefix the filename with the length so that it is clear where it ends. Another approaches might to add a \0 at the end of the filename (since it is not part of valid utf-8 encoded character except NUL - which itself is invalid in filenames) and split the incoming data on this delimiter.

Apart from that the later data.decode() when reading the music data is plain wrong since there is no matching encode() on the sender side. And there should not be one since these are binary data, i.e. already bytes.

旧伤还要旧人安 2025-02-09 12:28:06

除了@stefanullrich所说的话:

您在第9行中收到二进制数据。
您在第17行中以二进制模式打开文件。
所有这些都是正确的。
为什么您认为您需要将二进制数据解码为第18行中的字符串???这就是导致您看到的例外的原因。只是不要调用.decode(),请按原样写入该数据!

In addition to what @StefanUllrich said:

You receive binary data in line 9.
You open your file in binary mode in line 17.
All of this is correct.
Why do you think you need to decode the binary data to a string in line 18??? That's what's causing the exception you're seeing. Just don't call .decode(), write that data as it is!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文