Python,XPath:查找所有图像链接

发布于 2024-10-05 07:09:00 字数 447 浏览 0 评论 0原文

我在 Python 中使用 lxml 来解析一些 HTML,我想提取所有图像链接。我现在的做法是:

//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]

这种方法有几个问题:

  • 您必须列出所有情况下所有可能的图像扩展名(“jpg”和“JPG”),这
  • 在奇怪的情况下并不优雅, href 可能包含 .jpg 在中间的某个地方,而不是在字符串的末尾

我想使用正则表达式,但我失败了:

//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]

这总是给我返回所有链接...

有谁知道正确、优雅的方法来做到这一点或者我的正则表达式方法有什么问题?

I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:

//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]

There are a couple of problem with this approach:

  • you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant
  • in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string

I wanted to use regexp, but I failed:

//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]

This returned me all links all the time ...

Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

信愁 2024-10-12 07:09:00

代替

a[contains(@href,'.jpg')]

使用:(

a[substring(@href, string-length(@href)-3)='.jpg']

对于其他可能的结尾也使用相同的表达模式)。

上面的表达式是相当于以下 XPath 2.0 表达式的 XPath 1.0

a[ends-with(@href, '.jpg')]

Instead of:

a[contains(@href,'.jpg')]

Use:

a[substring(@href, string-length(@href)-3)='.jpg']

(and the same expression pattern for the other possible endings).

The above expression is the XPath 1.0 equivalent to the following XPath 2.0 expression:

a[ends-with(@href, '.jpg')]
述情 2024-10-12 07:09:00

Use XPath to return all <a> elements and use a Python list comprehension to filter down to those matching your regex.

拥抱没勇气 2024-10-12 07:09:00

lxml 支持 EXSLT 命名空间中的正则表达式:

from lxml import html

# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html')

# find the first <a href that ends with .png or .jpg or .jpeg ignoring case
ns = {'re': "http://exslt.org/regular-expressions"}
img_url = doc.xpath(r"//a[re:test(@href, '\.(?:png|jpg|jpeg)', 'i')]/@href",
                    namespaces=ns, smart_strings=False)[0]
print(img_url)

lxml supports regular expressions in EXSLT namespace:

from lxml import html

# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html')

# find the first <a href that ends with .png or .jpg or .jpeg ignoring case
ns = {'re': "http://exslt.org/regular-expressions"}
img_url = doc.xpath(r"//a[re:test(@href, '\.(?:png|jpg|jpeg)', 'i')]/@href",
                    namespaces=ns, smart_strings=False)[0]
print(img_url)
冷…雨湿花 2024-10-12 07:09:00

因为无法保证链接根本具有文件扩展名,或者文件扩展名甚至无法与内容(例如,返回错误 HTML 的 .jpg URL)匹配,从而限制了您的选择。

从站点收集所有图像的唯一正确方法是获取每个链接并使用 HTTP HEAD 请求对其进行查询,以找出服务器为其发送的内容类型。如果内容类型是图像/(任何内容),则它是图像,否则不是图像。

不过,抓取常见文件扩展名的 URL 可能会获取 99.9% 的图像。它并不优雅,但大多数 HTML 也不优雅。在这种情况下,我建议您乐意接受 99.9%。多出来的0.1%是不值得的。

Because there's no guarantee that the link has a file extension at all, or that the file extension even matches the content (.jpg URLs returning error HTML, for example) that limits your options.

The only correct way to gather all images from a site would be to get every link and query it with an HTTP HEAD request to find out what Content-type the server is sending for it. If the content type is image/(anything) it's an image, otherwise it's not.

Scraping the URLs for common file extensions is probably going to get you 99.9% of images though. It's not elegant, but neither is most HTML. I recommend being happy to settle for 99.9% in this case. The extra 0.1% isn't worth it.

心安伴我暖 2024-10-12 07:09:00

使用:

//a[@href[contains('|png|jpg|jpeg|',
                   concat('|',
                          substring-after(substring(.,string-legth()-4),'.'),
                          '|')]]

Use:

//a[@href[contains('|png|jpg|jpeg|',
                   concat('|',
                          substring-after(substring(.,string-legth()-4),'.'),
                          '|')]]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文