Python urlparse：小问题

发布于 2024-09-30 14:17:54 字数 480 浏览 11 评论 0原文

我正在制作一个解析 html 并从中获取图像的应用程序。使用 Beautiful Soup 解析很容易，下载 html 和图像也可以使用 urllib2。

我确实在使用 urlparse 来从相对路径中创建绝对路径时遇到问题。这个问题最好用一个例子来解释：

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

正如你所看到的，urlparse 不会去掉 ../ 。当我尝试下载图像时，这会出现问题：

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this Problem in urllib?

原文

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

和我恋爱吧 2024-10-07 14:17:54

“..”会打开一个目录（“.”是当前目录），因此将其与域名 url 结合起来没有多大意义。也许你需要的是：

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

回复收藏 0 原文

夏九 2024-10-07 14:17:54

我认为你能做的最好的事情就是预先解析原始 URL，并检查路径组件。一个简单的测试是

if len(urlparse.urlparse(baseurl).path) > 1:

然后您可以将其与 demas 建议的索引结合起来。例如：

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

这样，您就不会尝试转到根 URL 的父级。

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

回复收藏 0 原文

咋地 2024-10-07 14:17:54

如果您希望 /../test 与文件系统中的路径 /test 含义相同，那么您可以使用 normpath()：

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

回复收藏 0 原文

记忆之渊 2024-10-07 14:17:54

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

这是你所需要的吗？

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

回复收藏 0 原文

~没有更多了~

关于作者

走野

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

Python urlparse：小问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

Python urlparse：小问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。