Python urlparse:小问题

发布于 2024-09-30 14:17:54 字数 480 浏览 11 评论 0原文

我正在制作一个解析 html 并从中获取图像的应用程序。使用 Beautiful Soup 解析很容易,下载 html 和图像也可以使用 urllib2。

我确实在使用 urlparse 来从相对路径中创建绝对路径时遇到问题。这个问题最好用一个例子来解释:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

正如你所看到的,urlparse 不会去掉 ../ 。当我尝试下载图像时,这会出现问题:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this Problem in urllib?

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

和我恋爱吧 2024-10-07 14:17:54

“..”会打开一个目录(“.”是当前目录),因此将其与域名 url 结合起来没有多大意义。也许你需要的是:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'
夏九 2024-10-07 14:17:54

我认为你能做的最好的事情就是预先解析原始 URL,并检查路径组件。一个简单的测试是

if len(urlparse.urlparse(baseurl).path) > 1:

然后您可以将其与 demas 建议的索引结合起来。例如:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

这样,您就不会尝试转到根 URL 的父级。

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

咋地 2024-10-07 14:17:54

如果您希望 /../test 与文件系统中的路径 /test 含义相同,那么您可以使用 normpath()

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'
记忆之渊 2024-10-07 14:17:54
urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

这是你所需要的吗?

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文