如何在 python 中检索网页,包括任何图像

发布于 2024-12-03 05:13:58 字数 439 浏览 4 评论 0原文

我正在尝试检索网页的来源,包括所有图像。目前我有这个:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

它可以很好地检索源代码,但我还需要下载任何链接的图像。

我想我可以创建一个正则表达式来搜索 img src 或下载源中的类似内容;但是,我想知道是否有 urllib 函数也可以检索图像?类似于 wget 命令:

wget -r --no-parent http://127.0.0.1/myurl.php

我不想使用 os 模块并运行 wget,因为我希望脚本在所有系统上运行。因此,我也无法使用任何第三方模块。

非常感谢任何帮助!谢谢

I'm trying to retrieve the source of a webpage, including any images. At the moment I have this:

import urllib

page = urllib.urlretrieve('http://127.0.0.1/myurl.php', 'urlgot.php')
print urlgot.php

which retrieves the source fine, but I also need to download any linked images.

I was thinking I could create a regular expression which searched for img src or similar in the downloaded source; however, I was wondering if there was urllib function that would retrieve the images as well? Similar to the wget command of:

wget -r --no-parent http://127.0.0.1/myurl.php

I don't want to use the os module and run the wget, as I want the script to run on all systems. For this reason I can't use any third party modules either.

Any help is much appreciated! Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

成熟稳重的好男人 2024-12-10 05:13:58

当 Python 内置了一个非常好的解析器时,不要使用正则表达式:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)

Don't use regex when there is a perfectly good parser built in to Python:

from urllib.request import urlretrieve  # Py2: from urllib
from html.parser import HTMLParser      # Py2: from HTMLParser

base_url = 'http://127.0.0.1/'

class ImgParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.downloads = []
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for attr in attrs:
                if attr[0] == 'src':
                    self.downloads.append(attr[1])

parser = ImgParser()
with open('test.html') as f:
    # instead you could feed it the original url obj directly
    parser.feed(f.read())

for path in parser.downloads:
    url = base_url + path
    print(url)
    urlretrieve(url, path)
笑,眼淚并存 2024-12-10 05:13:58

使用 BeautifulSoup 解析返回的 HTML 并搜索图像链接。您可能还需要递归地获取框架和 iframe。

Use BeautifulSoup to parse the returned HTML and search for image links. You might also need to recursively fetch frames and iframes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文