将 url 与 urlunparse 组合
我正在写一些东西来“清理”一个 URL。在这种情况下,我要做的就是返回一个伪造的方案,因为没有该方案,urlopen
将无法工作。但是,如果我使用 www.python.org
进行测试,它将返回 http:///www.python.org
。有谁知道为什么需要额外的 /,有没有办法在没有它的情况下返回它?
def FixScheme(website):
from urlparse import urlparse, urlunparse
scheme, netloc, path, params, query, fragment = urlparse(website)
if scheme == '':
return urlunparse(('http', netloc, path, params, query, fragment))
else:
return website
I'm writing something to 'clean' a URL. In this case all I'm trying to do is return a faked scheme as urlopen
won't work without one. However, if I test this with www.python.org
It'll return http:///www.python.org
. Does anyone know why the extra /, and is there a way to return this without it?
def FixScheme(website):
from urlparse import urlparse, urlunparse
scheme, netloc, path, params, query, fragment = urlparse(website)
if scheme == '':
return urlunparse(('http', netloc, path, params, query, fragment))
else:
return website
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题是,在解析非常不完整的 URL
www.python.org
时,您给出的字符串实际上被视为 URL 的path
组件,netloc
(网络位置)以及方案均为空。对于默认方案,您实际上可以将第二个参数scheme
传递给urlparse
(简化您的逻辑),但这对解决“空 netloc”问题没有帮助。所以你需要一些针对这种情况的逻辑,例如Problem is that in parsing the very incomplete URL
www.python.org
, the string you give is actually taken as thepath
component of the URL, with thenetloc
(network location) one being empty as well as the scheme. For defaulting the scheme you can actually pass a second parameterscheme
tourlparse
(simplifying your logic) but that does't help with the "empty netloc" problem. So you need some logic for that case, e.g.这是因为 urlparse 不是将“www.python.org”解释为主机名 (netloc),而是解释为路径,就像浏览器在 href 属性中遇到该字符串时所做的那样。那么 urlunparse 似乎专门解释了方案“http”。如果您输入“x”作为方案,您将得到“x:www.python.org”。
我不知道你正在处理什么范围的输入,但看起来你可能不需要 urlparse 和 urlunparse。
It's because urlparse is interpreting "www.python.org" not as the hostname (netloc), but as the path, just as a browser would if it encountered that string in an href attribute. Then urlunparse seems to interpret scheme "http" specially. If you put in "x" as the scheme, you'll get "x:www.python.org".
I don't know what range of inputs you're dealing with, but it looks like you might not want urlparse and urlunparse.