解析 html 文件并将找到的图像添加到 zip 文件中

发布于 2024-08-16 08:34:05 字数 553 浏览 6 评论 0原文

我正在尝试解析 html 的所有 img 标签,下载 src 指向的所有图像,然后将这些文件添加到 zip 文件中。我更愿意在内存中完成所有这些,因为我可以保证不会有那么多图像。

假设 images 变量已经通过解析 html 填充。我需要帮助是将图像放入 zip 文件中。

from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen

s = StringIO()
zip_file = ZipFile(s, 'w')
try:
    for image in images:
        internet_image = urlopen(image)
        zip_file.writestr('some-image.jpg', internet_image.fp.read())
        # it is not obvious why I have to use writestr() instead of write()
finally:
    zip_file.close()

I am trying to parse an html for all its img tags, download all the images pointed to by src, and then add those files to a zip file. I would prefer to do all this in memory since I can guarantee there won't be that many images.

Assume the images variable is already populated from parsing the html. What I need help with is getting the images into the zipfile.

from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen

s = StringIO()
zip_file = ZipFile(s, 'w')
try:
    for image in images:
        internet_image = urlopen(image)
        zip_file.writestr('some-image.jpg', internet_image.fp.read())
        # it is not obvious why I have to use writestr() instead of write()
finally:
    zip_file.close()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

为了回答您有关如何创建 ZIP 存档的具体问题(这里的其他人已经讨论了解析 URL),我测试了您的代码。您已经非常接近成品了。

以下是我将如何增强创建 Zip 存档所需的内容(在本例中,我将存档写入驱动器,以便我可以验证它是否正确写入)。

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()

To answer your specific question about how you create the ZIP archive (others here have discussed parsing the URLs), I tested out your code. You are really remarkably close to having a finished product already.

Here's how I would augment what you have to create a Zip archive (in this example, I'm writing the archive to the drive so that I can verify it was properly written).

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()
云胡 2024-08-23 08:34:05

我能想到的最简单的方法是使用 BeautifulSoup 库。

大致如下:

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

这应该为您提供一个从您的 img 标签派生的 URL 列表,以便循环遍历。

The easiest way I can think of to do this would be to use the BeautifulSoup library.

Something along the lines of:

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

That should give you a list of URLs derived from your img tags to loop through.

暖伴 2024-08-23 08:34:05

我不太确定你在这里问什么,因为你似乎已经解决了大部分问题。

您是否研究过 HtmlParser 来实际执行 HTML 解析?我不会尝试自己手动滚动解析器 - 这是一项涉及大量边缘情况的主要任务。除了最微不足道的情况之外,甚至不要考虑正则表达式。

对于每个 标签,您可以使用 HttpLib 实际获取每个图像。为了加快 zip 文件的编译速度,可能值得在多个线程中获取图像。

I'm not quite sure what you're asking here, since you appear to have most of it sorted.

Have you investigated HtmlParser to actually perform the HTML parsing ? I wouldn't try hand-rolling a parser yourself - it's a major task with numerous edge cases. Don't even think about regexps for anything but the most trivial cases.

For each <img/> tag you can use HttpLib to actually get each image. It may be worth getting the images in multiple threads to speed up the compilation of the zip file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文