在 Python 2.6.5 中,是否有可用于 urllib.quote 和 urllib.unquote 的 unicode 替代品?

发布于 2024-10-30 05:16:42 字数 1096 浏览 3 评论 0原文

Python 的 urllib.quoteurllib.unquote 在 Python 2.6.5 中无法正确处理 Unicode。这就是发生的情况:

In [5]: print urllib.unquote(urllib.quote(u'Cataño'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

/home/kkinder/<ipython console> in <module>()

/usr/lib/python2.6/urllib.pyc in quote(s, safe)
   1222             safe_map[c] = (c in safe) and c or ('%%%02X' % i)
   1223         _safemaps[cachekey] = safe_map
-> 1224     res = map(safe_map.__getitem__, s)
   1225     return ''.join(res)
   1226 

KeyError: u'\xc3'

将值编码为 UTF8 也不起作用:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

它被识别为错误,并且 是一个修复,但不适用于我的 Python 版本。

我想要的是类似于 urllib.quote/urllib.unquote 的东西,但正确处理 unicode 变量,以便此代码可以工作:

decode_url(encode_url(u'Cataño')) == u'Cataño'

有什么建议吗?

Python's urllib.quote and urllib.unquote do not handle Unicode correctly in Python 2.6.5. This is what happens:

In [5]: print urllib.unquote(urllib.quote(u'Cataño'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

/home/kkinder/<ipython console> in <module>()

/usr/lib/python2.6/urllib.pyc in quote(s, safe)
   1222             safe_map[c] = (c in safe) and c or ('%%%02X' % i)
   1223         _safemaps[cachekey] = safe_map
-> 1224     res = map(safe_map.__getitem__, s)
   1225     return ''.join(res)
   1226 

KeyError: u'\xc3'

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

It's recognized as a bug and there is a fix, but not for my version of Python.

What I'd like is something similar to urllib.quote/urllib.unquote, but handles unicode variables correctly, such that this code would work:

decode_url(encode_url(u'Cataño')) == u'Cataño'

Any recommendations?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

Python 的 urllib.quote 和 urllib.unquote 无法正确处理 Unicode

urllib 根本不处理 Unicode。根据定义,URL 不包含非 ASCII 字符。当您处理 urllib 时,您应该仅使用字节字符串。如果您希望它们代表 Unicode 字符,则必须手动对它们进行编码和解码。

IRIs 可以包含非 ASCII 字符,将它们编码为 UTF-8 序列,但 Python 不支持,此时,已经有了一个irilib

将值编码为 UTF8 也不起作用:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

啊,现在您正在控制台中输入 Unicode,并对控制台执行 print-Unicode 操作。这通常是不可靠的,特别是在 Windows 中以及您使用 IPython 控制台的情况。

使用反斜杠序列长时间键入它,您可以更轻松地看到 urllib 位确实起作用:

>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'

>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'

Python's urllib.quote and urllib.unquote do not handle Unicode correctly

urllib does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.

IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib.

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

Ah, well now you're typing Unicode into a console, and doing print-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.

Type it out the long way with backslash sequences and you can more easily see that the urllib bit does actually work:

>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'

>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'
山田美奈子 2024-11-06 05:16:42

"""将值编码为 UTF8 也不起作用""" ...代码的结果是一个 str 对象,猜测它似乎是以 UTF-8 编码的输入。您需要对其进行解码或定义“不起作用”——您期望什么?

注意:这样我们就不需要猜测您终端的编码和数据类型,请使用 print repr(whatever) 而不是 printwhatever

>>> # Python 2.6.6
... from urllib import quote, unquote
>>> s = u"Cata\xf1o"
>>> q = quote(s.encode('utf8'))
>>> u = unquote(q).decode('utf8')
>>> for x in (s, q, u):
...     print repr(x)
...
u'Cata\xf1o'
'Cata%C3%B1o'
u'Cata\xf1o'
>>>

用于比较:

>>> # Python 3.2
... from urllib.parse import quote, unquote
>>> s = "Cata\xf1o"
>>> q = quote(s)
>>> u = unquote(q)
>>> for x in (s, q, u):
...     print(ascii(x))
...
'Cata\xf1o'
'Cata%C3%B1o'
'Cata\xf1o'
>>>

"""Encoding the value to UTF8 also does not work""" ... the result of your code is a str object which at a guess appears to be the input encoded in UTF-8. You need to decode it or define "does not work" -- what do you expect?

Note: So that we don't need to guess the encoding of your terminal and the type of your data, use print repr(whatever) instead of print whatever.

>>> # Python 2.6.6
... from urllib import quote, unquote
>>> s = u"Cata\xf1o"
>>> q = quote(s.encode('utf8'))
>>> u = unquote(q).decode('utf8')
>>> for x in (s, q, u):
...     print repr(x)
...
u'Cata\xf1o'
'Cata%C3%B1o'
u'Cata\xf1o'
>>>

For comparison:

>>> # Python 3.2
... from urllib.parse import quote, unquote
>>> s = "Cata\xf1o"
>>> q = quote(s)
>>> u = unquote(q)
>>> for x in (s, q, u):
...     print(ascii(x))
...
'Cata\xf1o'
'Cata%C3%B1o'
'Cata\xf1o'
>>>
∞觅青森が 2024-11-06 05:16:42

我遇到了同样的问题,并使用辅助函数来处理非 ascii 和 urllib.urlencode 函数(包括引用和取消引用):

def utf8_urlencode(params):
    import urllib as u
    # problem: u.urlencode(params.items()) is not unicode-safe. Must encode all params strings as utf8 first.
    # UTF-8 encodes all the keys and values in params dictionary
    for k,v in params.items():
        # TRY urllib.unquote_plus(artist.encode('utf-8')).decode('utf-8')
        if type(v) in (int, long, float):
            params[k] = v
        else:
            try:
                params[k.encode('utf-8')] = v.encode('utf-8')
            except Exception as e:
                logging.warning( '**ERROR utf8_urlencode ERROR** %s' % e )
    return u.urlencode(params.items()).decode('utf-8')

采用 使用 Python 进行 Unicode URL 编码/解码

I encountered the same problem and used a helper function to deal with non-ascii and urllib.urlencode function (which includes quote and unquote):

def utf8_urlencode(params):
    import urllib as u
    # problem: u.urlencode(params.items()) is not unicode-safe. Must encode all params strings as utf8 first.
    # UTF-8 encodes all the keys and values in params dictionary
    for k,v in params.items():
        # TRY urllib.unquote_plus(artist.encode('utf-8')).decode('utf-8')
        if type(v) in (int, long, float):
            params[k] = v
        else:
            try:
                params[k.encode('utf-8')] = v.encode('utf-8')
            except Exception as e:
                logging.warning( '**ERROR utf8_urlencode ERROR** %s' % e )
    return u.urlencode(params.items()).decode('utf-8')

adopted from Unicode URL encode / decode with Python

权谋诡计 2024-11-06 05:16:42

所以我遇到了同样的问题:我想将查询参数放入网址中,但其中一些包含奇怪的字符(变音符号)。

处理编码会产生混乱的 url 并且很脆弱。

我的解决方案是将每个重音/奇怪的 unicode 字符替换为其对应的 ascii 字符。由于 unidecode,这很简单:删除 Python unicode 字符串中的重音符号的最佳方法是什么?

pip install unidecode

那么

from unidecode import unidecode
print unidecode(u"éèê") 
# prints eee

我就有了一个干净的 url。也适用于中文等。

So I had the same problem: I wanted to put query parameters in an url, but some of them contained weird characters (diacritics).

Dealing with encoding gave a messy url and was fragile.

My solution was to replace every accent/weird unicode character to its ascii equivalent. It's straightforward thanks to unidecode: What is the best way to remove accents in a Python unicode string?

pip install unidecode

then

from unidecode import unidecode
print unidecode(u"éèê") 
# prints eee

so I have a clean url. Also works for chinese etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文