当前位置：文江博客话题详情

如何在 python 中规范化 URL

发布于 2024-07-05 03:08:57 字数 218 浏览 9 评论 0原文

我想知道我是否可以在 python 中规范化 URL。

例如，如果我有一个像这样的网址字符串：“http://www.example.com/foo goo/bar.html"

我需要一个 python 库，它将额外的空格（或任何其他非标准化字符）转换为正确的 URL。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深府石板幽径 2024-07-12 03:08:57

对 Python 3.5 有效：

import urllib.parse

urllib.parse.quote([your_url], "\./_-:")

示例：

import urllib.parse

print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))

输出将为 http://www.example.com /foo%20goo/bar.html

字体：https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

Valid for Python 3.5:

import urllib.parse

urllib.parse.quote([your_url], "\./_-:")

example:

import urllib.parse

print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))

the output will be http://www.example.com/foo%20goo/bar.html

Font: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

回复收藏 0 原文

橘亓 2024-07-12 03:08:57

我遇到这样的问题：只需要引用空格。

fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]") 确实有帮助，但它太复杂了。

所以我使用了一个简单的方法：url = url.replace(' ', '%20')，它并不完美，但它是最简单的方法，并且适用于这种情况。

回复收藏 0 原文

楠木可依 2024-07-12 03:08:57

仅供参考，urlnorm 已移至 github：
http://gist.github.com/246089

回复收藏 0 原文

墨离汐 2024-07-12 03:08:57

这里的很多答案都讨论了引用 URL，而不是标准化它们。

在 Python IMO 中规范化 url（用于重复数据删除等）的最佳工具是 w3lib 的 w3lib.url.canonicalize_url 实用程序。

摘自官方文档：

Canonicalize the given url by applying the following procedures:

 - sort query arguments, first by key, then by value
percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
 - percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
 - normalize all spaces (in query arguments) ‘+’ (plus symbol)
 - normalize percent encodings case (%2f -> %2F)
 - remove query arguments with blank values (unless keep_blank_values is True)
 - remove fragments (unless keep_fragments is True)
 - List item

The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

我使用过该实用程序在广泛抓取网络时取得了巨大成功，以避免由于较小的 url 差异（不同的参数顺序、锚点等）而产生重复请求

A lot of answers here talk about quoting URLs, not about normalizing them.

The best tool to normalize urls (for deduplication etc.) in Python IMO is w3lib's w3lib.url.canonicalize_url util.

Taken from the official docs:

Canonicalize the given url by applying the following procedures:

 - sort query arguments, first by key, then by value
percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
 - percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
 - normalize all spaces (in query arguments) ‘+’ (plus symbol)
 - normalize percent encodings case (%2f -> %2F)
 - remove query arguments with blank values (unless keep_blank_values is True)
 - remove fragments (unless keep_fragments is True)
 - List item

The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

I've used this util with great success when broad crawling the web to avoid duplicate requests because of minor url differences (different parameter order, anchors etc)

回复收藏 0 原文

萌辣 2024-07-12 03:08:57

Py3

from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
    parts = urlparse(url)
    return urlunparse(parts._replace(path=quote(parts.path)))

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'

Py2

import urlparse, urllib
def myquote(url):
    parts = urlparse.urlparse(url)
    return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'

这仅引用路径组件。

Py3

from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
    parts = urlparse(url)
    return urlunparse(parts._replace(path=quote(parts.path)))

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'

Py2

import urlparse, urllib
def myquote(url):
    parts = urlparse.urlparse(url)
    return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'

This quotes only the path component.

回复收藏 0 原文

A君 2024-07-12 03:08:57

因为此页面是有关该主题的 Google 搜索的最高结果，所以我认为值得一提的是使用 Python 在 URL 规范化方面所做的一些工作，这些工作超出了 urlencoding 空格字符的范围。例如，处理默认端口、字符大小写、缺少尾部斜杠等。

在开发 Atom 联合格式时，曾讨论过如何将 URL 标准化为规范格式； Atom/Pie wiki 上的文章 PaceCanonicalIds 对此进行了记录。那篇文章提供了一些很好的测试用例。

我相信这次讨论的结果之一是 Mark Nottingham 的 urlnorm.py 库，我'已经在几个项目中使用并取得了良好的效果。但是，该脚本不适用于此问题中给出的 URL。因此，更好的选择可能是 Sam Ruby 版本的 urlnorm.py，它可以处理该问题URL，以及来自 Atom wiki 的所有上述测试用例。

回复收藏 0 原文

逆蝶 2024-07-12 03:08:57

Python 2.7 中的真正修复对于这个问题

正确的解决方案是：

 # percent encode url, fixing lame server errors for e.g, like space
 # within url paths.
 fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

有关详细信息，请参阅 Issue918368: "urllib does not Correct server returned urls"

Real fix in Python 2.7 for that problem

Right solution was:

 # percent encode url, fixing lame server errors for e.g, like space
 # within url paths.
 fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

For more information see Issue918368: "urllib doesn't correct server returned urls"

回复收藏 0 原文

清引 2024-07-12 03:08:57

使用

urllib 文档：

引用（字符串[，安全]）
替换字符串中的特殊字符
使用“%xx”转义符。信件，
数字，字符“_.-”是
从未引用过。可选的保险箱
参数指定附加
不应被引用的字符
-- 它的默认值为“/”。
示例：quote('/~connolly/') 生成 '/%7econnolly/'。
quote_plus(字符串[,安全])
与 quote() 类似，但也替换空格
按引用要求通过加号
HTML 表单值。中的加号
原始字符串被转义，除非
它们包含在保险箱中。它也是
没有安全的默认值“/”。

编辑：在整个 URL 上使用 urllib.quote 或 urllib.quote_plus 会破坏它，正如 @ΤΖΩΤΖIΟΥ 指出的：

>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python25\lib\urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "c:\python25\lib\urllib2.py", line 373, in open
    protocol = req.get_type()
  File "c:\python25\lib\urllib2.py", line 244, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

@ΤΖΩΤΖIΟΥ 提供了一个使用 urlparse.urlparse 和 urlparse.urlunparse 解析 url 并仅对路径进行编码。这可能对您更有用，尽管如果您从已知协议和主机构建 URL 但具有可疑路径，您可能也可以避免 urlparse 并仅引用 URL 的可疑部分，并与已知的安全部件。

use urllib.quote or urllib.quote_plus

From the urllib documentation:

quote(string[, safe])
Replace special characters in string
using the "%xx" escape. Letters,
digits, and the characters "_.-" are
never quoted. The optional safe
parameter specifies additional
characters that should not be quoted
-- its default value is '/'.
Example: quote('/~connolly/') yields '/%7econnolly/'.
quote_plus(string[, safe])
Like quote(), but also replaces spaces
by plus signs, as required for quoting
HTML form values. Plus signs in the
original string are escaped unless
they are included in safe. It also
does not have safe default to '/'.

EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as @ΤΖΩΤΖΙΟΥ points out:

>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python25\lib\urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "c:\python25\lib\urllib2.py", line 373, in open
    protocol = req.get_type()
  File "c:\python25\lib\urllib2.py", line 244, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

@ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.

回复收藏 0 原文

苏辞 2024-07-12 03:08:57

看看这个模块：werkzeug.utils。（现在在 werkzeug.urls 中）

您正在寻找的函数称为“url_fix”，其工作原理如下：

>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

它在 Werkzeug 中的实现如下：

import urllib
import urlparse

def url_fix(s, charset='utf-8'):
    """Sometimes you get an URL by a user that just isn't a real
    URL because it contains unsafe characters like ' ' and so on.  This
    function can fix some of the problems in a similar way browsers
    handle data entered by the user:

    >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
    'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

    :param charset: The target charset for the URL if the url was
                    given as unicode string.
    """
    if isinstance(s, unicode):
        s = s.encode(charset, 'ignore')
    scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
    path = urllib.quote(path, '/%')
    qs = urllib.quote_plus(qs, ':&=')
    return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

Have a look at this module: werkzeug.utils. (now in werkzeug.urls)

The function you are looking for is called "url_fix" and works like this:

>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

It's implemented in Werkzeug as follows:

import urllib
import urlparse

def url_fix(s, charset='utf-8'):
    """Sometimes you get an URL by a user that just isn't a real
    URL because it contains unsafe characters like ' ' and so on.  This
    function can fix some of the problems in a similar way browsers
    handle data entered by the user:

    >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
    'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

    :param charset: The target charset for the URL if the url was
                    given as unicode string.
    """
    if isinstance(s, unicode):
        s = s.encode(charset, 'ignore')
    scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
    path = urllib.quote(path, '/%')
    qs = urllib.quote_plus(qs, ':&=')
    return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

回复收藏 0 原文

~没有更多了~