使用Python / urllib / beautifulsoup从URL批量下载文本和图像?
我已经浏览了这里的几篇文章,但我就是无法理解使用 Python 从给定 URL 批量下载图像和文本。
import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys
def getAllImages(url):
query = urllib2.Request(url)
user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
query.add_header("User-Agent", user_agent)
page = BeautifulSoup(urllib2.urlopen(query))
for div in page.findAll("div", {"class": "thumbnail"}):
print "found thumbnail"
for img in div.findAll("img"):
print "found image"
src = img["src"]
if src:
src = absolutize(src, pageurl)
f = open(src,'wb')
f.write(urllib.urlopen(src).read())
f.close()
for h5 in div.findAll("h5"):
print "found Headline"
value = (h5.contents[0])
print >> headlines.txt, value
def main():
getAllImages("http://www.nytimes.com/")
上面是一些更新的代码。发生什么,没什么。代码没有找到任何带有缩略图的 div,显然,没有任何打印结果......所以我可能错过了一些获取包含图像和标题的正确 div 的指针?
多谢!
I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python.
import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys
def getAllImages(url):
query = urllib2.Request(url)
user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
query.add_header("User-Agent", user_agent)
page = BeautifulSoup(urllib2.urlopen(query))
for div in page.findAll("div", {"class": "thumbnail"}):
print "found thumbnail"
for img in div.findAll("img"):
print "found image"
src = img["src"]
if src:
src = absolutize(src, pageurl)
f = open(src,'wb')
f.write(urllib.urlopen(src).read())
f.close()
for h5 in div.findAll("h5"):
print "found Headline"
value = (h5.contents[0])
print >> headlines.txt, value
def main():
getAllImages("http://www.nytimes.com/")
Above is now some updated code. What happens, is nothing. The code does not get to find any div with a thumbnail, obviously, no result in any of the print.... So probably I am missing some pointers in getting to the right divs containing the images and headlines?
Thanks a lot!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您使用的操作系统不知道如何写入您在
src
中传递的文件路径。确保用于将文件保存到磁盘的名称是操作系统实际可以使用的名称:一切都会开始工作。
一些额外的想法:
src
上。user_agent
字符串中宣传您是机器人并尊重robots.txt
文件(或者提供某种类型的联系信息,以便人们可以在需要时要求您停止)。The OS you are using doesn't know how to write to the file path you are passing it in
src
. Make sure that the name you use to save the file to disk is one the OS can actually use:and everything will start working.
A couple of additional thoughts:
os.path.join(some_root_dir, *relative_file_path*)
) - otherwise you'll be writing images all over your hard drive depending on theirsrc
.user_agent
string and honorrobots.txt
files (or alternately, provide some kind of contact information so people can ask you to stop if they need to).