将 url 与 urlunparse 组合

发布于 2024-09-24 15:58:36 字数 492 浏览 8 评论 0原文

我正在写一些东西来“清理”一个 URL。在这种情况下,我要做的就是返回一个伪造的方案,因为没有该方案,urlopen 将无法工作。但是,如果我使用 www.python.org 进行测试,它将返回 http:///www.python.org。有谁知道为什么需要额外的 /,有没有办法在没有它的情况下返回它?

def FixScheme(website):

   from urlparse import urlparse, urlunparse

   scheme, netloc, path, params, query, fragment = urlparse(website)

   if scheme == '':
       return urlunparse(('http', netloc, path, params, query, fragment))
   else:
       return website

I'm writing something to 'clean' a URL. In this case all I'm trying to do is return a faked scheme as urlopen won't work without one. However, if I test this with www.python.org It'll return http:///www.python.org. Does anyone know why the extra /, and is there a way to return this without it?

def FixScheme(website):

   from urlparse import urlparse, urlunparse

   scheme, netloc, path, params, query, fragment = urlparse(website)

   if scheme == '':
       return urlunparse(('http', netloc, path, params, query, fragment))
   else:
       return website

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

爱已欠费 2024-10-01 15:58:36

问题是,在解析非常不完整的 URL www.python.org 时,您给出的字符串实际上被视为 URL 的 path 组件,netloc(网络位置)以及方案均为空。对于默认方案,您实际上可以将第二个参数 scheme 传递给 urlparse (简化您的逻辑),但这对解决“空 netloc”问题没有帮助。所以你需要一些针对这种情况的逻辑,例如

if not netloc:
    netloc, path = path, ''

Problem is that in parsing the very incomplete URL www.python.org, the string you give is actually taken as the path component of the URL, with the netloc (network location) one being empty as well as the scheme. For defaulting the scheme you can actually pass a second parameter scheme to urlparse (simplifying your logic) but that does't help with the "empty netloc" problem. So you need some logic for that case, e.g.

if not netloc:
    netloc, path = path, ''
疯到世界奔溃 2024-10-01 15:58:36

这是因为 urlparse 不是将“www.python.org”解释为主机名 (netloc),而是解释为路径,就像浏览器在 href 属性中遇到该字符串时所做的那样。那么 urlunparse 似乎专门解释了方案“http”。如果您输入“x”作为方案,您将得到“x:www.python.org”。

我不知道你正在处理什么范围的输入,但看起来你可能不需要 urlparse 和 urlunparse。

It's because urlparse is interpreting "www.python.org" not as the hostname (netloc), but as the path, just as a browser would if it encountered that string in an href attribute. Then urlunparse seems to interpret scheme "http" specially. If you put in "x" as the scheme, you'll get "x:www.python.org".

I don't know what range of inputs you're dealing with, but it looks like you might not want urlparse and urlunparse.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文