解析 html 文件并将找到的图像添加到 zip 文件中

发布于 2024-08-16 08:34:05 字数 553 浏览 6 评论 0原文

我正在尝试解析 html 的所有 img 标签，下载 src 指向的所有图像，然后将这些文件添加到 zip 文件中。我更愿意在内存中完成所有这些，因为我可以保证不会有那么多图像。

假设 images 变量已经通过解析 html 填充。我需要帮助是将图像放入 zip 文件中。

from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen

s = StringIO()
zip_file = ZipFile(s, 'w')
try:
    for image in images:
        internet_image = urlopen(image)
        zip_file.writestr('some-image.jpg', internet_image.fp.read())
        # it is not obvious why I have to use writestr() instead of write()
finally:
    zip_file.close()

原文

I am trying to parse an html for all its img tags, download all the images pointed to by src, and then add those files to a zip file. I would prefer to do all this in memory since I can guarantee there won't be that many images.

Assume the images variable is already populated from parsing the html. What I need help with is getting the images into the zipfile.

from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen

s = StringIO()
zip_file = ZipFile(s, 'w')
try:
    for image in images:
        internet_image = urlopen(image)
        zip_file.writestr('some-image.jpg', internet_image.fp.read())
        # it is not obvious why I have to use writestr() instead of write()
finally:
    zip_file.close()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

习惯那些不曾习惯的习惯 2024-08-23 08:34:05

为了回答您有关如何创建 ZIP 存档的具体问题（这里的其他人已经讨论了解析 URL），我测试了您的代码。您已经非常接近成品了。

以下是我将如何增强创建 Zip 存档所需的内容（在本例中，我将存档写入驱动器，以便我可以验证它是否正确写入）。

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()

To answer your specific question about how you create the ZIP archive (others here have discussed parsing the URLs), I tested out your code. You are really remarkably close to having a finished product already.

Here's how I would augment what you have to create a Zip archive (in this example, I'm writing the archive to the drive so that I can verify it was properly written).

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()

回复收藏 0 原文

云胡 2024-08-23 08:34:05

我能想到的最简单的方法是使用 BeautifulSoup 库。

大致如下：

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

这应该为您提供一个从您的 img 标签派生的 URL 列表，以便循环遍历。

The easiest way I can think of to do this would be to use the BeautifulSoup library.

Something along the lines of:

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

That should give you a list of URLs derived from your img tags to loop through.

回复收藏 0 原文