将数据从新闻提要传递到 IRC 服务器时如何正确处理编码

发布于 2024-11-24 19:20:58 字数 934 浏览 3 评论 0原文

代码：

import socket, feedparser

feed = feedparser.parse("http://pwnmyi.com/feed")
latest = feed.entries[0]
art_name = latest.title

network = 'irc.rizon.net'
port = 6667
irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
irc.connect((network, port))
print irc.recv(4096)
irc.send('NICK PwnBot\r\n')
irc.send('USER PwnBot PwnBot PwnBot :PwnBot by Fike\r\n')
irc.send('JOIN #pwnmyi\r\n')
while True:
    data = irc.recv(4096)
    if data.find('PING') != -1:
        irc.send('PONG ' + data.split() [1] + '\r\n')
    if data.find( '!latest' ) != -1:
        irc.send('PRIVMSG #pwnmyi :Latest Article: ' + art_name + '\r\n')

它连接等等，但是当我在频道中执行 !latest 时，它就这样退出：

    irc.send('PRIVMSG #pwnmyi :Latest Article: ' + art_name + '\r\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 55: ordinal not in range(128)

你能帮我调试这段代码吗？它以前对我有用。

原文

Code:

import socket, feedparser

feed = feedparser.parse("http://pwnmyi.com/feed")
latest = feed.entries[0]
art_name = latest.title

network = 'irc.rizon.net'
port = 6667
irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
irc.connect((network, port))
print irc.recv(4096)
irc.send('NICK PwnBot\r\n')
irc.send('USER PwnBot PwnBot PwnBot :PwnBot by Fike\r\n')
irc.send('JOIN #pwnmyi\r\n')
while True:
    data = irc.recv(4096)
    if data.find('PING') != -1:
        irc.send('PONG ' + data.split() [1] + '\r\n')
    if data.find( '!latest' ) != -1:
        irc.send('PRIVMSG #pwnmyi :Latest Article: ' + art_name + '\r\n')

It connects etc., but then when I do !latest in the channel, it just quits with this:

    irc.send('PRIVMSG #pwnmyi :Latest Article: ' + art_name + '\r\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 55: ordinal not in range(128)

Could you please help me debug this code? It used to work for me before.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

思慕 2024-12-01 19:20:58

IRC 协议没有定义用于消息的特定字符集编码，而是一个 8 位协议，其中具有用于控制字符的某些八位字节。（请参阅 rfc1459 第 2.2 节。

显然流行的 mIRC 客户端如果能够识别 utf8 序列，就会对其进行解码，而这对于 irc 的使用来说非常有意义，因为 ascii 代码点使用与 ascii 字符相同的字节进行编码，而非 ascii 代码点都编码为值 > 127。

在 python 中，拼写为 unicode.encode(encoding=' utf8') 像这样：

>>> u'\u0ca0_\u0ca0'.encode('utf8')
'\xe0\xb2\xa0_\xe0\xb2\xa0'

the IRC protocol does not define a particular character set encoding used for messages, rather it's an 8bit protocol, which has certain octets used for control characters. (See rfc1459 section 2.2.

Apparently the popular mIRC client will decode utf8 sequences if it recognizes them as such, and this makes pretty decent sense for irc's use since ascii codepoints are encoded with the same bytes as the ascii characters, and non-ascii codepoints are all encoded as values > 127.

In python, that's spelled unicode.encode(encoding='utf8') like so:

>>> u'\u0ca0_\u0ca0'.encode('utf8')
'\xe0\xb2\xa0_\xe0\xb2\xa0'

回复收藏 0 原文

策马西风 2024-12-01 19:20:58

您必须对发布到 IRC 服务器的字符串进行编码。此外，根据 feedparser 返回的内容，您可能希望从特定编码对其进行解码。

编码取决于提要包含的内容。

回复收藏 0 原文

别再吹冷风 2024-12-01 19:20:58

latest.title 中包含非 ASCII 字符。

您必须删除它们、转义它们或翻译它们。

廉价且简单的方法是使用 repr()

    irc.send('PRIVMSG #pwnmyi :Latest Article: ' + repr(art_name) + '\r\n')

或更好的

    irc.send('PRIVMSG #pwnmyi :Latest Article: {0!r}\r\n'.format( art_name ) )

方法。从长远来看，您需要处理输入中的非 ASCII 字符。

latest.title has non-ASCII characters in it.

You must either remove them, escape them or translate them.

The cheap and easy way out is to use repr()

    irc.send('PRIVMSG #pwnmyi :Latest Article: ' + repr(art_name) + '\r\n')

Or better

    irc.send('PRIVMSG #pwnmyi :Latest Article: {0!r}\r\n'.format( art_name ) )

In the long run, you need to address non-ASCII characters in your input.

回复收藏 0 原文

莫多说 2024-12-01 19:20:58

就我个人而言，我建议将所有字符串转换为“utf-8”，您可以使用以下方法对 unicode 字符串进行编码/解码：

def decode(bytes):
    try:
        text = bytes.decode('utf-8')
    except UnicodeDecodeError:
        try:
            text = bytes.decode('iso-8859-1')
        except UnicodeDecodeError:
            text = bytes.decode('cp1252')
    return text


def encode(bytes):
    try:
        text = bytes.encode('utf-8')
    except UnicodeEncodeError:
        try:
            text = bytes.encode('iso-8859-1')
        except UnicodeEncodeError:
            text = bytes.encode('cp1252')
    return text

这是一个解释 Python Unicode 的优秀网站：http://farmdev.com/talks/unicode

其中最好的 3 个技巧是：

到处解码早期
Unicode
晚期编码

Personally I'd recommend converting all strings to 'utf-8', you can encode/decode unicode strings using this:

def decode(bytes):
    try:
        text = bytes.decode('utf-8')
    except UnicodeDecodeError:
        try:
            text = bytes.decode('iso-8859-1')
        except UnicodeDecodeError:
            text = bytes.decode('cp1252')
    return text


def encode(bytes):
    try:
        text = bytes.encode('utf-8')
    except UnicodeEncodeError:
        try:
            text = bytes.encode('iso-8859-1')
        except UnicodeEncodeError:
            text = bytes.encode('cp1252')
    return text

This is an excellent website that explains Python's Unicode: http://farmdev.com/talks/unicode

The best 3 tips from it are: