使用Python / urllib / beautifulsoup从URL批量下载文本和图像?

发布于 2024-12-12 03:44:12 字数 1133 浏览 0 评论 0原文

我已经浏览了这里的几篇文章,但我就是无法理解使用 Python 从给定 URL 批量下载图像和文本。

import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys

def getAllImages(url):
    query = urllib2.Request(url)
    user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
    query.add_header("User-Agent", user_agent)

    page = BeautifulSoup(urllib2.urlopen(query))
    for div in page.findAll("div", {"class": "thumbnail"}):
        print "found thumbnail"
        for img in div.findAll("img"):
            print "found image"
            src = img["src"]
            if src:
                src = absolutize(src, pageurl)
                f = open(src,'wb')
                f.write(urllib.urlopen(src).read())
                f.close()
        for h5 in div.findAll("h5"):
            print "found Headline"
            value = (h5.contents[0])
            print >> headlines.txt, value


def main():
    getAllImages("http://www.nytimes.com/")

上面是一些更新的代码。发生什么,没什么。代码没有找到任何带有缩略图的 div,显然,没有任何打印结果......所以我可能错过了一些获取包含图像和标题的正确 div 的指针?

多谢!

I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python.

import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys

def getAllImages(url):
    query = urllib2.Request(url)
    user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
    query.add_header("User-Agent", user_agent)

    page = BeautifulSoup(urllib2.urlopen(query))
    for div in page.findAll("div", {"class": "thumbnail"}):
        print "found thumbnail"
        for img in div.findAll("img"):
            print "found image"
            src = img["src"]
            if src:
                src = absolutize(src, pageurl)
                f = open(src,'wb')
                f.write(urllib.urlopen(src).read())
                f.close()
        for h5 in div.findAll("h5"):
            print "found Headline"
            value = (h5.contents[0])
            print >> headlines.txt, value


def main():
    getAllImages("http://www.nytimes.com/")

Above is now some updated code. What happens, is nothing. The code does not get to find any div with a thumbnail, obviously, no result in any of the print.... So probably I am missing some pointers in getting to the right divs containing the images and headlines?

Thanks a lot!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

度的依靠╰つ 2024-12-19 03:44:12

您使用的操作系统不知道如何写入您在 src 中传递的文件路径。确保用于将文件保存到磁盘的名称是操作系统实际可以使用的名称:

src = "abc.com/alpha/beta/charlie.jpg"
with open(src, "wb") as f:
    # IOError - cannot open file abc.com/alpha/beta/charlie.jpg

src = "alpha/beta/charlie.jpg"
os.makedirs(os.path.dirname(src))
with open(src, "wb" as f:
    # Golden - write file here

一切都会开始工作。

一些额外的想法:

  1. 确保规范化保存文件路径(例如 os.path.join(some_root_dir, *relative_file_path*)) - 否则您将在整个硬盘驱动器上写入图像,具体取决于在他们的src上。
  2. 除非您正在运行某种测试,否则最好在 user_agent 字符串中宣传您是机器人并尊重 robots.txt 文件(或者提供某种类型的联系信息,以便人们可以在需要时要求您停止)。

The OS you are using doesn't know how to write to the file path you are passing it in src. Make sure that the name you use to save the file to disk is one the OS can actually use:

src = "abc.com/alpha/beta/charlie.jpg"
with open(src, "wb") as f:
    # IOError - cannot open file abc.com/alpha/beta/charlie.jpg

src = "alpha/beta/charlie.jpg"
os.makedirs(os.path.dirname(src))
with open(src, "wb" as f:
    # Golden - write file here

and everything will start working.

A couple of additional thoughts:

  1. Make sure to normalize the save file path (e. g. os.path.join(some_root_dir, *relative_file_path*)) - otherwise you'll be writing images all over your hard drive depending on their src.
  2. Unless you are running tests of some kind, it's good to advertise that you are a bot in your user_agent string and honor robots.txt files (or alternately, provide some kind of contact information so people can ask you to stop if they need to).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文