Python,XPath:查找所有图像链接
我在 Python 中使用 lxml 来解析一些 HTML,我想提取所有图像链接。我现在的做法是:
//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]
这种方法有几个问题:
- 您必须列出所有情况下所有可能的图像扩展名(“jpg”和“JPG”),这
- 在奇怪的情况下并不优雅, href 可能包含 .jpg 在中间的某个地方,而不是在字符串的末尾
我想使用正则表达式,但我失败了:
//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]
这总是给我返回所有链接...
有谁知道正确、优雅的方法来做到这一点或者我的正则表达式方法有什么问题?
I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:
//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]
There are a couple of problem with this approach:
- you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant
- in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string
I wanted to use regexp, but I failed:
//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]
This returned me all links all the time ...
Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
代替:
使用:(
对于其他可能的结尾也使用相同的表达模式)。
上面的表达式是相当于以下 XPath 2.0 表达式的 XPath 1.0:
Instead of:
Use:
(and the same expression pattern for the other possible endings).
The above expression is the XPath 1.0 equivalent to the following XPath 2.0 expression:
使用 XPath 返回所有
元素,并使用 Python 列表理解来过滤到与您的正则表达式匹配的元素。
Use XPath to return all
<a>
elements and use a Python list comprehension to filter down to those matching your regex.lxml
支持 EXSLT 命名空间中的正则表达式:lxml
supports regular expressions in EXSLT namespace:因为无法保证链接根本具有文件扩展名,或者文件扩展名甚至无法与内容(例如,返回错误 HTML 的 .jpg URL)匹配,从而限制了您的选择。
从站点收集所有图像的唯一正确方法是获取每个链接并使用 HTTP HEAD 请求对其进行查询,以找出服务器为其发送的内容类型。如果内容类型是图像/(任何内容),则它是图像,否则不是图像。
不过,抓取常见文件扩展名的 URL 可能会获取 99.9% 的图像。它并不优雅,但大多数 HTML 也不优雅。在这种情况下,我建议您乐意接受 99.9%。多出来的0.1%是不值得的。
Because there's no guarantee that the link has a file extension at all, or that the file extension even matches the content (.jpg URLs returning error HTML, for example) that limits your options.
The only correct way to gather all images from a site would be to get every link and query it with an HTTP HEAD request to find out what Content-type the server is sending for it. If the content type is image/(anything) it's an image, otherwise it's not.
Scraping the URLs for common file extensions is probably going to get you 99.9% of images though. It's not elegant, but neither is most HTML. I recommend being happy to settle for 99.9% in this case. The extra 0.1% isn't worth it.
使用:
Use: