Python，XPath：查找所有图像链接

发布于 2024-10-05 07:09:00 字数 447 浏览 5 评论 0原文

我在 Python 中使用 lxml 来解析一些 HTML，我想提取所有图像链接。我现在的做法是：

//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]

这种方法有几个问题：

您必须列出所有情况下所有可能的图像扩展名（“jpg”和“JPG”），这
在奇怪的情况下并不优雅， href 可能包含 .jpg 在中间的某个地方，而不是在字符串的末尾

我想使用正则表达式，但我失败了：

//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]

这总是给我返回所有链接...

有谁知道正确、优雅的方法来做到这一点或者我的正则表达式方法有什么问题？

原文

I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:

//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]

There are a couple of problem with this approach:

you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant
in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string

I wanted to use regexp, but I failed:

//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]

This returned me all links all the time ...

Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

信愁 2024-10-12 07:09:00

代替：

a[contains(@href,'.jpg')]

使用：（

a[substring(@href, string-length(@href)-3)='.jpg']

对于其他可能的结尾也使用相同的表达模式）。

上面的表达式是相当于以下 XPath 2.0 表达式的 XPath 1.0：

a[ends-with(@href, '.jpg')]

Instead of:

a[contains(@href,'.jpg')]

Use:

a[substring(@href, string-length(@href)-3)='.jpg']

(and the same expression pattern for the other possible endings).

The above expression is the XPath 1.0 equivalent to the following XPath 2.0 expression:

a[ends-with(@href, '.jpg')]

回复收藏 0 原文

述情 2024-10-12 07:09:00

使用 XPath 返回所有元素，并使用 Python 列表理解来过滤到与您的正则表达式匹配的元素。

回复收藏 0 原文

拥抱没勇气 2024-10-12 07:09:00

lxml 支持 EXSLT 命名空间中的正则表达式：

from lxml import html

# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html')

# find the first <a href that ends with .png or .jpg or .jpeg ignoring case
ns = {'re': "http://exslt.org/regular-expressions"}
img_url = doc.xpath(r"//a[re:test(@href, '\.(?:png|jpg|jpeg)', 'i')]/@href",
                    namespaces=ns, smart_strings=False)[0]
print(img_url)

lxml supports regular expressions in EXSLT namespace:

from lxml import html

# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html')

# find the first <a href that ends with .png or .jpg or .jpeg ignoring case
ns = {'re': "http://exslt.org/regular-expressions"}
img_url = doc.xpath(r"//a[re:test(@href, '\.(?:png|jpg|jpeg)', 'i')]/@href",
                    namespaces=ns, smart_strings=False)[0]
print(img_url)

回复收藏 0 原文

冷…雨湿花 2024-10-12 07:09:00

因为无法保证链接根本具有文件扩展名，或者文件扩展名甚至无法与内容（例如，返回错误 HTML 的 .jpg URL）匹配，从而限制了您的选择。

从站点收集所有图像的唯一正确方法是获取每个链接并使用 HTTP HEAD 请求对其进行查询，以找出服务器为其发送的内容类型。如果内容类型是图像/（任何内容），则它是图像，否则不是图像。

不过，抓取常见文件扩展名的 URL 可能会获取 99.9% 的图像。它并不优雅，但大多数 HTML 也不优雅。在这种情况下，我建议您乐意接受 99.9%。多出来的0.1%是不值得的。

回复收藏 0 原文

心安伴我暖 2024-10-12 07:09:00

使用：

//a[@href[contains('|png|jpg|jpeg|',
                   concat('|',
                          substring-after(substring(.,string-legth()-4),'.'),
                          '|')]]

Use:

//a[@href[contains('|png|jpg|jpeg|',
                   concat('|',
                          substring-after(substring(.,string-legth()-4),'.'),
                          '|')]]

回复收藏 0 原文

~没有更多了~