当编码采用 shift_jis 时,使用 Python 的电子邮件模块解析电子邮件时出错

发布于 2024-10-28 12:54:43 字数 1316 浏览 2 评论 0原文

当我尝试使用电子邮件解析器解码 shift_jis 编码的电子邮件并将其转换为 unicode 时,我收到一条错误消息“UnicodeDecodeError: 'shift_jis' 编解码器无法解码位置 2-3 中的字节:非法多字节序列”。代码和电子邮件可以在下面找到:

import email.header
import base64
import sys
import email

def getrawemail():
    line = ' '
    raw_email = ''
    while line:
        line = sys.stdin.readline()
        raw_email += line
    return raw_email

def getheader(subject, charsets):
    for i in charsets:
        if isinstance(i, str):
            encoding = i
            break
    if subject[-2] == "?=":
        encoded = subject[5 + len(encoding):len(subject) - 2]
    else:
        encoded = subject[5 + len(encoding):]
    return (encoding, encoded)

def decodeheader((encoding, encoded)):
    decoded = base64.b64decode(encoded)
    decoded = unicode(decoded, encoding)
    return decoded

raw_email = getrawemail()
msg = email.message_from_string(raw_email)
subject = decodeheader(getheader(msg["Subject"], msg.get_charsets()))
print subject

电子邮件:http://pastebin.com/L4jAkm5R

我已阅读另一个堆栈溢出问题,这可能与 Unicode 和 shift_jis 编码方式之间的差异有关(他们引用了 Microsoft 知识库文章)。如果有人知道我的代码中的什么可能导致它无法工作,或者如果这甚至可以合理修复,我将非常感谢找出如何解决。

I am getting an error that says "UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 2-3: illegal multibyte sequence" when I try to use my email parser to decode a shift_jis encoded email and convert it to unicode. The code and email can be found below:

import email.header
import base64
import sys
import email

def getrawemail():
    line = ' '
    raw_email = ''
    while line:
        line = sys.stdin.readline()
        raw_email += line
    return raw_email

def getheader(subject, charsets):
    for i in charsets:
        if isinstance(i, str):
            encoding = i
            break
    if subject[-2] == "?=":
        encoded = subject[5 + len(encoding):len(subject) - 2]
    else:
        encoded = subject[5 + len(encoding):]
    return (encoding, encoded)

def decodeheader((encoding, encoded)):
    decoded = base64.b64decode(encoded)
    decoded = unicode(decoded, encoding)
    return decoded

raw_email = getrawemail()
msg = email.message_from_string(raw_email)
subject = decodeheader(getheader(msg["Subject"], msg.get_charsets()))
print subject

Email: http://pastebin.com/L4jAkm5R

I have read on another Stack Overflow question that this may be related to a difference between how Unicode and shift_jis are encoded (they referenced this Microsoft Knowledge Base article). If anyone knows what in my code could be causing it to not work, or if this is even reasonably fixable, I would very much appreciate finding out how.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

江心雾 2024-11-04 12:54:44

以此字符串开头:

In [124]: msg['Subject']
Out[124]: '=?ISO-2022-JP?B?GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'

=?ISO-2022-JP?B? 表示该字符串先进行 ISO-2022-JP 编码,然后进行 Base64 编码。

In [125]: msg['Subject'].lstrip('=?ISO-2022-JP?B?')
Out[125]: 'GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'

不幸的是,尝试逆转该过程会导致错误:

In [126]: base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?'))
TypeError: Incorrect padding

阅读此 SO 答案让我尝试在字符串末尾添加“?=”:

In [130]: print(base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?')+'?=').decode('ISO-2022-JP'))
貴方にとても大切なお知らせがあり

根据谷歌翻译,这可能会被翻译为“你知道有一个非常重要的”。

因此看来主题行已被截断。

Starting with this string:

In [124]: msg['Subject']
Out[124]: '=?ISO-2022-JP?B?GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'

=?ISO-2022-JP?B? means the string is ISO-2022-JP encoded, then base64 encoded.

In [125]: msg['Subject'].lstrip('=?ISO-2022-JP?B?')
Out[125]: 'GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'

Unfortunately, trying to reverse that process results in an error:

In [126]: base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?'))
TypeError: Incorrect padding

Reading this SO answer lead me to try adding '?=' to the end of the string:

In [130]: print(base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?')+'?=').decode('ISO-2022-JP'))
貴方にとても大切なお知らせがあり

According to google translate, this may be translated as "You know there is a very important".

So it appears the subject line has been truncated.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文