解析 html 文件并将找到的图像添加到 zip 文件中
我正在尝试解析 html 的所有 img 标签,下载 src 指向的所有图像,然后将这些文件添加到 zip 文件中。我更愿意在内存中完成所有这些,因为我可以保证不会有那么多图像。
假设 images 变量已经通过解析 html 填充。我需要帮助是将图像放入 zip 文件中。
from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen
s = StringIO()
zip_file = ZipFile(s, 'w')
try:
for image in images:
internet_image = urlopen(image)
zip_file.writestr('some-image.jpg', internet_image.fp.read())
# it is not obvious why I have to use writestr() instead of write()
finally:
zip_file.close()
I am trying to parse an html for all its img tags, download all the images pointed to by src, and then add those files to a zip file. I would prefer to do all this in memory since I can guarantee there won't be that many images.
Assume the images variable is already populated from parsing the html. What I need help with is getting the images into the zipfile.
from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen
s = StringIO()
zip_file = ZipFile(s, 'w')
try:
for image in images:
internet_image = urlopen(image)
zip_file.writestr('some-image.jpg', internet_image.fp.read())
# it is not obvious why I have to use writestr() instead of write()
finally:
zip_file.close()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了回答您有关如何创建 ZIP 存档的具体问题(这里的其他人已经讨论了解析 URL),我测试了您的代码。您已经非常接近成品了。
以下是我将如何增强创建 Zip 存档所需的内容(在本例中,我将存档写入驱动器,以便我可以验证它是否正确写入)。
To answer your specific question about how you create the ZIP archive (others here have discussed parsing the URLs), I tested out your code. You are really remarkably close to having a finished product already.
Here's how I would augment what you have to create a Zip archive (in this example, I'm writing the archive to the drive so that I can verify it was properly written).
我能想到的最简单的方法是使用 BeautifulSoup 库。
大致如下:
这应该为您提供一个从您的 img 标签派生的 URL 列表,以便循环遍历。
The easiest way I can think of to do this would be to use the BeautifulSoup library.
Something along the lines of:
That should give you a list of URLs derived from your img tags to loop through.
我不太确定你在这里问什么,因为你似乎已经解决了大部分问题。
您是否研究过 HtmlParser 来实际执行 HTML 解析?我不会尝试自己手动滚动解析器 - 这是一项涉及大量边缘情况的主要任务。除了最微不足道的情况之外,甚至不要考虑正则表达式。
对于每个
标签,您可以使用 HttpLib 实际获取每个图像。为了加快 zip 文件的编译速度,可能值得在多个线程中获取图像。
I'm not quite sure what you're asking here, since you appear to have most of it sorted.
Have you investigated HtmlParser to actually perform the HTML parsing ? I wouldn't try hand-rolling a parser yourself - it's a major task with numerous edge cases. Don't even think about regexps for anything but the most trivial cases.
For each
<img/>
tag you can use HttpLib to actually get each image. It may be worth getting the images in multiple threads to speed up the compilation of the zip file.