当前位置：文江博客话题详情

我怎样才能在前面加上“http://”必要时使用 url 协议？

发布于 2024-11-15 09:18:57 字数 985 浏览 7 评论 0原文

我需要解析一个 URL。我目前正在使用 urlparse.urlparse() 和 urlparse.urlsplit()。

问题是，当方案不存在时，我无法从 URL 获取“netloc”（主机）。我的意思是，如果我有以下网址：

www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1

我无法获取netloc：www.amazon.com

根据蟒蛇文档：

遵循语法规范 RFC 1808，urlparse 识别 netloc 仅当它被正确引入时 '//'。否则假定输入是一个相对 URL，从而开始带有路径组件。

所以，这是故意的。但是，我仍然不知道如何从该 URL 获取 netloc。

我想我可以检查该方案是否存在，如果不存在，则添加它，然后解析它。但这个解决方案似乎不太好。

你有更好的主意吗？

编辑： 感谢所有的答案。但是，我不能做科里和其他人提出的“开始”的事情。因为，如果我得到一个带有其他协议/方案的 URL，我会把它搞砸。请参阅：

如果我得到这个 URL：

ftp://something.com

使用建议的代码，我会在开头添加“http://”，这会弄乱它。

我找到的解决方案

if not urlparse.urlparse(url).scheme:
   url = "http://"+url
return urlparse.urlparse(url)

需要注意的是：

我首先进行一些验证，如果没有给出方案，我认为它是 http://

原文

I need to parse an URL. I'm currently using urlparse.urlparse() and urlparse.urlsplit().

The problem is that i can't get the "netloc" (host) from the URL when it's not present the scheme.
I mean, if i have the following URL:

www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1

I can't get the netloc: www.amazon.com

According to python docs:

Following the syntax specifications in
RFC 1808, urlparse recognizes a netloc
only if it is properly introduced by
‘//’. Otherwise the input is presumed
to be a relative URL and thus to start
with a path component.

So, it's this way on purpose. But, i still don't know how to get the netloc from that URL.

I think i could check if the scheme is present, and if it's not, then add it, and then parse it. But this solution doesn't seems really good.

Do you have a better idea?

EDIT:
Thanks for all the answers. But, i cannot do the "startswith" thing that's proposed by Corey and others. Becouse, if i get an URL with other protocol/scheme i would mess it up. See:

If i get this URL:

ftp://something.com

With the code proposed i would add "http://" to the start and would mess it up.

The solution i found

if not urlparse.urlparse(url).scheme:
   url = "http://"+url
return urlparse.urlparse(url)

Something to note:

I do some validation first, and if no scheme is given i consider it to be http://

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人间不值得 2024-11-22 09:18:57

看起来您需要指定协议才能获取 netloc。

如果它不存在，则添加它可能如下所示：

import urlparse

url = 'www.amazon.com/Programming-Python-Mark-Lutz'
if '//' not in url:
    url = '%s%s' % ('http://', url)
p = urlparse.urlparse(url)
print p.netloc

有关该问题的更多信息： https://bugs.python.org/问题754016

looks like you need to specify the protocol to get netloc.

adding it if it's not present might look like this:

import urlparse

url = 'www.amazon.com/Programming-Python-Mark-Lutz'
if '//' not in url:
    url = '%s%s' % ('http://', url)
p = urlparse.urlparse(url)
print p.netloc

More about the issue: https://bugs.python.org/issue754016

回复收藏 0 原文

铃予 2024-11-22 09:18:57

该文档有这个确切的示例，就在您粘贴的文本下方。如果不存在，添加“//”将得到您想要的。如果您不知道它是否具有协议和“//”，您可以使用正则表达式（甚至只是查看它是否已包含“//”）来确定是否需要添加它。

您的另一个选择是使用 split('/') 并获取它返回的列表的第一个元素，这仅在 url 没有协议或 '//' 时才有效。

编辑（为未来的读者添加）：用于检测协议的正则表达式类似于 re.match('(?:http|ftp|https)://', url)

回复收藏 0 原文

只想待在家 2024-11-22 09:18:57

如果协议是始终http，则只能使用一行：

return "http://" + url.split("://")[-1]

更好的选择是如果协议通过则使用该协议：

return url if "://" in url else "http://" + url

If the protocol is always http you can use only one line:

return "http://" + url.split("://")[-1]

A better option is to use the protocol if it passed:

return url if "://" in url else "http://" + url

回复收藏 0 原文

荆棘i 2024-11-22 09:18:57

来自文档：

遵循 RFC 1808 中的语法规范，urlparse 仅在通过“//”正确引入的情况下才能识别 netloc。否则，输入被假定为相对 URL，因此以路径组件开始。

所以你可以这样做：

In [1]: from urlparse import urlparse

In [2]: def get_netloc(u):
   ...:     if not u.startswith('http'):
   ...:         u = '//' + u
   ...:     return urlparse(u).netloc
   ...: 

In [3]: get_netloc('www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[3]: 'www.amazon.com'

In [4]: get_netloc('http://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[4]: 'www.amazon.com'

In [5]: get_netloc('https://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[5]: 'www.amazon.com'

From the docs:

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

So you can just do:

In [1]: from urlparse import urlparse

In [2]: def get_netloc(u):
   ...:     if not u.startswith('http'):
   ...:         u = '//' + u
   ...:     return urlparse(u).netloc
   ...: 

In [3]: get_netloc('www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[3]: 'www.amazon.com'

In [4]: get_netloc('http://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[4]: 'www.amazon.com'

In [5]: get_netloc('https://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[5]: 'www.amazon.com'

回复收藏 0 原文

逆流 2024-11-22 09:18:57

您是否考虑过仅检查网址开头是否存在“http://”，如果不存在则添加它？另一个解决方案是，假设第一部分确实是 netloc，而不是相对 url 的一部分，则只获取第一个“/”之前的所有内容，并将其用作 netloc。

回复收藏 0 原文

能否归途做我良人 2024-11-22 09:18:57

这一条班轮就可以做到。

netloc = urlparse('//' + ''.join(urlparse(url)[1:])).netloc

This one liner would do it.

netloc = urlparse('//' + ''.join(urlparse(url)[1:])).netloc

回复收藏 0 原文

~没有更多了~

关于作者

放赐

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

我怎样才能在前面加上“http://”必要时使用 url 协议？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

我怎样才能在前面加上“http://”必要时使用 url 协议？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。