如何在 python 中规范化 URL
我想知道我是否可以在 python 中规范化 URL。
例如,如果我有一个像这样的网址字符串:“http://www.example.com/foo goo/bar.html"
我需要一个 python 库,它将额外的空格(或任何其他非标准化字符)转换为正确的 URL。
I'd like to know do I normalize a URL in python.
For example, If I have a url string like : "http://www.example.com/foo goo/bar.html"
I need a library in python that will transform the extra space (or any other non normalized character) to a proper URL.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
对 Python 3.5 有效:
示例:
输出将为 http://www.example.com /foo%20goo/bar.html
字体:https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote
Valid for Python 3.5:
example:
the output will be http://www.example.com/foo%20goo/bar.html
Font: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote
我遇到这样的问题:只需要引用空格。
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")
确实有帮助,但它太复杂了。所以我使用了一个简单的方法:
url = url.replace(' ', '%20')
,它并不完美,但它是最简单的方法,并且适用于这种情况。I encounter such an problem: need to quote the space only.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")
do help, but it's too complicated.So I used a simple way:
url = url.replace(' ', '%20')
, it's not perfect, but it's the simplest way and it works for this situation.仅供参考,urlnorm 已移至 github:
http://gist.github.com/246089
Just FYI, urlnorm has moved to github:
http://gist.github.com/246089
这里的很多答案都讨论了引用 URL,而不是标准化它们。
在 Python IMO 中规范化 url(用于重复数据删除等)的最佳工具是 w3lib 的
w3lib.url.canonicalize_url
实用程序。摘自官方文档:
我使用过该实用程序在广泛抓取网络时取得了巨大成功,以避免由于较小的 url 差异(不同的参数顺序、锚点等)而产生重复请求
A lot of answers here talk about quoting URLs, not about normalizing them.
The best tool to normalize urls (for deduplication etc.) in Python IMO is w3lib's
w3lib.url.canonicalize_url
util.Taken from the official docs:
I've used this util with great success when broad crawling the web to avoid duplicate requests because of minor url differences (different parameter order, anchors etc)
Py3
Py2
这仅引用路径组件。
Py3
Py2
This quotes only the path component.
因为此页面是有关该主题的 Google 搜索的最高结果,所以我认为值得一提的是使用 Python 在 URL 规范化方面所做的一些工作,这些工作超出了 urlencoding 空格字符的范围。 例如,处理默认端口、字符大小写、缺少尾部斜杠等。
在开发 Atom 联合格式时,曾讨论过如何将 URL 标准化为规范格式; Atom/Pie wiki 上的文章 PaceCanonicalIds 对此进行了记录。 那篇文章提供了一些很好的测试用例。
我相信这次讨论的结果之一是 Mark Nottingham 的 urlnorm.py 库,我'已经在几个项目中使用并取得了良好的效果。 但是,该脚本不适用于此问题中给出的 URL。 因此,更好的选择可能是 Sam Ruby 版本的 urlnorm.py,它可以处理该问题URL,以及来自 Atom wiki 的所有上述测试用例。
Because this page is a top result for Google searches on the topic, I think it's worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports, character case, lack of trailing slashes, etc.
When the Atom syndication format was being developed, there was some discussion on how to normalize URLs into canonical format; this is documented in the article PaceCanonicalIds on the Atom/Pie wiki. That article provides some good test cases.
I believe that one result of this discussion was Mark Nottingham's urlnorm.py library, which I've used with good results on a couple projects. That script doesn't work with the URL given in this question, however. So a better choice might be Sam Ruby's version of urlnorm.py, which handles that URL, and all of the aforementioned test cases from the Atom wiki.
Python 2.7 中的真正修复对于这个问题
正确的解决方案是:
有关详细信息,请参阅 Issue918368: "urllib does not Correct server returned urls"
Real fix in Python 2.7 for that problem
Right solution was:
For more information see Issue918368: "urllib doesn't correct server returned urls"
使用
urllib 文档:
编辑:在整个 URL 上使用 urllib.quote 或 urllib.quote_plus 会破坏它,正如 @ΤΖΩΤΖIΟΥ 指出的:
@ΤΖΩΤΖIΟΥ 提供了一个使用 urlparse.urlparse 和 urlparse.urlunparse 解析 url 并仅对路径进行编码。 这可能对您更有用,尽管如果您从已知协议和主机构建 URL 但具有可疑路径,您可能也可以避免 urlparse 并仅引用 URL 的可疑部分,并与已知的安全部件。
use
urllib.quote
orurllib.quote_plus
From the urllib documentation:
EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as @ΤΖΩΤΖΙΟΥ points out:
@ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.
看看这个模块:werkzeug.utils。 (现在在
werkzeug.urls
中)您正在寻找的函数称为“url_fix”,其工作原理如下:
它在 Werkzeug 中的实现如下:
Have a look at this module: werkzeug.utils. (now in
werkzeug.urls
)The function you are looking for is called "url_fix" and works like this:
It's implemented in Werkzeug as follows: