Python urlparse:小问题
我正在制作一个解析 html 并从中获取图像的应用程序。使用 Beautiful Soup 解析很容易,下载 html 和图像也可以使用 urllib2。
我确实在使用 urlparse 来从相对路径中创建绝对路径时遇到问题。这个问题最好用一个例子来解释:
>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'
正如你所看到的,urlparse 不会去掉 ../ 。当我尝试下载图像时,这会出现问题:
HTTPError: HTTP Error 400: Bad Request
Is there a way to fix this Problem in urllib?
I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.
I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:
>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'
As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:
HTTPError: HTTP Error 400: Bad Request
Is there a way to fix this problem in urllib?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
“..”会打开一个目录(“.”是当前目录),因此将其与域名 url 结合起来没有多大意义。也许你需要的是:
".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:
我认为你能做的最好的事情就是预先解析原始 URL,并检查路径组件。一个简单的测试是
然后您可以将其与 demas 建议的索引结合起来。例如:
这样,您就不会尝试转到根 URL 的父级。
I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is
Then you can combine it with the indexing suggested by demas. For example:
This way, you will not attempt to go to the parent of the root URL.
如果您希望
/../test
与文件系统中的路径/test
含义相同,那么您可以使用normpath()
:If you'd like that
/../test
would mean the same as/test
like paths in a file system then you could usenormpath()
:这是你所需要的吗?
It is what you need?