Python IRC 机器人和编码问题

发布于 2024-07-22 07:33:42 字数 455 浏览 9 评论 0原文

目前我有一个用 python 编写的简单 IRC 机器人。

自从我将它迁移到区分字节和 unicode 字符串的 python 3.0 以来，我开始遇到编码问题。具体来说，其他人不发送 UTF-8。

现在，我可以告诉每个人发送 UTF-8（无论如何他们都应该发送），但更好的解决方案是尝试让 python 默认为其他编码等。

到目前为止，代码看起来像这样：

data = str(irc.recv(4096),"UTF-8", "replace")

至少不会抛出异常。但是，我想超越它：我希望我的机器人默认使用另一种编码，或者尝试以某种方式检测“麻烦的字符”。

此外，我需要弄清楚 mIRC 使用的这种神秘编码实际上是什么 - 因为其他客户端似乎工作正常并像他们应该的那样发送 UTF-8。

我应该如何去做这些事情？

原文

Currently I have a simple IRC bot written in python.

Since I migrated it to python 3.0 which differentiates between bytes and unicode strings I started having encoding issues. Specifically, with others not sending UTF-8.

Now, I could just tell everyone to send UTF-8 (which they should regardless) but an even better solution would be try to get python to default to some other encoding or such.

So far the code looks like this:

data = str(irc.recv(4096),"UTF-8", "replace")

Which at least doesn't throw exceptions. However, I want to go past it: I want my bot to default to another encoding, or try to detect "troublesome characters" somehow.

Additionally, I need to figure out what this mysterious encoding that mIRC uses actually is - as other clients appear to work fine and send UTF-8 like they should.

How should I go about doing those things?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

依靠 2024-07-29 07:33:42

chardet 应该会有所帮助 - 它是用于检测未知编码的规范 Python 库。

回复收藏 0 原文

小帐篷 2024-07-29 07:33:42

正如 RichieHindle 提到的，chardet 可能是您最好的解决方案。但是，如果您想覆盖大约 90% 的文本，您将看到您可以使用我使用的内容：

def decode(bytes):
    try:
        text = bytes.decode('utf-8')
    except UnicodeDecodeError:
        try:
            text = bytes.decode('iso-8859-1')
        except UnicodeDecodeError:
            text = bytes.decode('cp1252')
    return text


def encode(bytes):
    try:
        text = bytes.encode('utf-8')
    except UnicodeEncodeError:
        try:
            text = bytes.encode('iso-8859-1')
        except UnicodeEncodeError:
            text = bytes.encode('cp1252')
    return text

The chardet will probably be your best solution as RichieHindle mentioned. However, if you want to cover about 90% of the text you'll see you can use what I use:

def decode(bytes):
    try:
        text = bytes.decode('utf-8')
    except UnicodeDecodeError:
        try:
            text = bytes.decode('iso-8859-1')
        except UnicodeDecodeError:
            text = bytes.decode('cp1252')
    return text


def encode(bytes):
    try:
        text = bytes.encode('utf-8')
    except UnicodeEncodeError:
        try:
            text = bytes.encode('iso-8859-1')
        except UnicodeEncodeError:
            text = bytes.encode('cp1252')
    return text

回复收藏 0 原文

路弥 2024-07-29 07:33:42

对于消息较短的情况（IRC 中的情况），仅使用 chardet 会导致较差的结果。

Chardet 与记住整个消息中特定用户的编码相结合可能是有意义的。但是，为了简单起见，我将使用一些推测的编码（编码取决于文化和时代，请参阅 http:// /en.wikipedia.org/wiki/Internet_Relay_Chat#Character_encoding），如果他们失败了，我会去chardet（如果有人使用一些东亚编码，这将帮助我们）。

例如：

def decode_irc(raw, preferred_encs = ["UTF-8", "CP1252", "ISO-8859-1"]):
    changed = False
    for enc in preferred_encs:
        try:
            res = raw.decode(enc)
            changed = True
            break
        except:
            pass
    if not changed:
        try:
            enc = chardet.detect(raw)['encoding']
            res = raw.decode(enc)
        except:
            res = raw.decode(enc, 'ignore')
return res

Using only chardet leads to poor results for situations where messages are short (which is the case in IRC).

Chardet combined with remembering the encoding for specific user throughout the messages could make sense. However, for simplicity I'd use some presumable encodings (encodings depend on culture and epoch, see http://en.wikipedia.org/wiki/Internet_Relay_Chat#Character_encoding) and if they fail, I'd go to chardet (if someone uses some of Eastern Asian encodings, this will help us out).

For example:

def decode_irc(raw, preferred_encs = ["UTF-8", "CP1252", "ISO-8859-1"]):
    changed = False
    for enc in preferred_encs:
        try:
            res = raw.decode(enc)
            changed = True
            break
        except:
            pass
    if not changed:
        try:
            enc = chardet.detect(raw)['encoding']
            res = raw.decode(enc)
        except:
            res = raw.decode(enc, 'ignore')
return res

回复收藏 0 原文

将军与妓 2024-07-29 07:33:42

好吧，经过一些研究发现 chardet 在使用 python 3 时遇到了麻烦。事实证明，解决方案比我想象的要简单。如果 UTF-8 不能解决问题，我选择依靠 CP1252：

data = irc.recv ( 4096 )
try: data = str(data,"UTF-8")
except UnicodeDecodeError: data = str(data,"CP1252")

这似乎有效。虽然它没有检测到编码，所以如果有人输入的编码既不是 UTF-8 也不是 CP1252，我将再次遇到问题。

这实际上只是一个临时解决方案。

Ok, after some research turns out chardet is having troubles with python 3. The solution as it turns out is simpler than I thought. I chose to fall back on CP1252 if UTF-8 doesn't cut it:

data = irc.recv ( 4096 )
try: data = str(data,"UTF-8")
except UnicodeDecodeError: data = str(data,"CP1252")

Which seems to be working. Though it doesn't detect the encoding, and so if somebody came in with an encoding that is neither UTF-8 nor CP1252 I will again have a problem.

This is really just a temporary solution.

回复收藏 0 原文

~没有更多了~