我怎样才能在前面加上“http://”必要时使用 url 协议?
我需要解析一个 URL。我目前正在使用 urlparse.urlparse() 和 urlparse.urlsplit()。
问题是,当方案不存在时,我无法从 URL 获取“netloc”(主机)。 我的意思是,如果我有以下网址:
www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1
我无法获取netloc:www.amazon.com
根据蟒蛇文档:
遵循语法规范 RFC 1808,urlparse 识别 netloc 仅当它被正确引入时 '//'。否则假定输入 是一个相对 URL,从而开始 带有路径组件。
所以,这是故意的。但是,我仍然不知道如何从该 URL 获取 netloc。
我想我可以检查该方案是否存在,如果不存在,则添加它,然后解析它。但这个解决方案似乎不太好。
你有更好的主意吗?
编辑: 感谢所有的答案。但是,我不能做科里和其他人提出的“开始”的事情。因为,如果我得到一个带有其他协议/方案的 URL,我会把它搞砸。请参阅:
如果我得到这个 URL:
ftp://something.com
使用建议的代码,我会在开头添加“http://”,这会弄乱它。
我找到的解决方案
if not urlparse.urlparse(url).scheme:
url = "http://"+url
return urlparse.urlparse(url)
需要注意的是:
我首先进行一些验证,如果没有给出方案,我认为它是 http://
I need to parse an URL. I'm currently using urlparse.urlparse() and urlparse.urlsplit().
The problem is that i can't get the "netloc" (host) from the URL when it's not present the scheme.
I mean, if i have the following URL:
www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1
I can't get the netloc: www.amazon.com
According to python docs:
Following the syntax specifications in
RFC 1808, urlparse recognizes a netloc
only if it is properly introduced by
‘//’. Otherwise the input is presumed
to be a relative URL and thus to start
with a path component.
So, it's this way on purpose. But, i still don't know how to get the netloc from that URL.
I think i could check if the scheme is present, and if it's not, then add it, and then parse it. But this solution doesn't seems really good.
Do you have a better idea?
EDIT:
Thanks for all the answers. But, i cannot do the "startswith" thing that's proposed by Corey and others. Becouse, if i get an URL with other protocol/scheme i would mess it up. See:
If i get this URL:
ftp://something.com
With the code proposed i would add "http://" to the start and would mess it up.
The solution i found
if not urlparse.urlparse(url).scheme:
url = "http://"+url
return urlparse.urlparse(url)
Something to note:
I do some validation first, and if no scheme is given i consider it to be http://
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
看起来您需要指定协议才能获取 netloc。
如果它不存在,则添加它可能如下所示:
有关该问题的更多信息: https://bugs.python.org/问题754016
looks like you need to specify the protocol to get netloc.
adding it if it's not present might look like this:
More about the issue: https://bugs.python.org/issue754016
该文档有这个确切的示例,就在您粘贴的文本下方。如果不存在,添加“//”将得到您想要的。如果您不知道它是否具有协议和“//”,您可以使用正则表达式(甚至只是查看它是否已包含“//”)来确定是否需要添加它。
您的另一个选择是使用 split('/') 并获取它返回的列表的第一个元素,这仅在 url 没有协议或 '//' 时才有效。
编辑(为未来的读者添加):用于检测协议的正则表达式类似于
re.match('(?:http|ftp|https)://', url)
The documentation has this exact example, just below the text you pasted. Adding '//' if it's not there will get what you want. If you don't know whether it'll have the protocol and '//' you can use a regex (or even just see if it already contains '//') to determine whether or not you need to add it.
Your other option would be to use split('/') and take the first element of the list it returns, which will ONLY work when the url has no protocol or '//'.
EDIT (adding for future readers): a regex for detecting the protocol would be something like
re.match('(?:http|ftp|https)://', url)
如果协议是始终http,则只能使用一行:
更好的选择是如果协议通过则使用该协议:
If the protocol is always http you can use only one line:
A better option is to use the protocol if it passed:
来自文档:
所以你可以这样做:
From the docs:
So you can just do:
您是否考虑过仅检查网址开头是否存在“http://”,如果不存在则添加它?另一个解决方案是,假设第一部分确实是 netloc,而不是相对 url 的一部分,则只获取第一个“/”之前的所有内容,并将其用作 netloc。
Have you considered just checking for the presence of "http://" at the start of the url, and add it if it's not there? Another solution, assuming that first part really is the netloc and not part of a relative url, is to just grab everything up to the first "/" and use that as the netloc.
这一条班轮就可以做到。
This one liner would do it.