通过urllib和python下载图片
所以我正在尝试制作一个Python脚本来下载网络漫画并将它们放在我桌面上的一个文件夹中。我在这里发现了一些类似的程序,它们执行类似的操作,但与我需要的完全不同。我发现最相似的就在这里(http: //bytes.com/topic/python/answers/850927-problem-using-urllib-download-images)。我尝试使用这段代码:
>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg")
('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>)
然后我在计算机中搜索文件“00000001.jpg”,但我发现的只是它的缓存图片。我什至不确定它是否将文件保存到我的计算机上。一旦我了解了如何下载文件,我想我就知道如何处理其余的事情了。本质上只需使用 for 循环并在“00000000”.“jpg”处拆分字符串,然后将“00000000”增加到最大数字,我必须以某种方式确定该数字。关于执行此操作的最佳方法或如何正确下载文件有什么建议吗?
谢谢!
编辑 6/15/10
这是完成的脚本,它将文件保存到您选择的任何目录。由于某种奇怪的原因,文件没有下载,但他们只是下载了。任何有关如何清理它的建议将不胜感激。我目前正在研究如何找出网站上存在的许多漫画,以便我可以获得最新的漫画,而不是在引发一定数量的异常后退出程序。
import urllib
import os
comicCounter=len(os.listdir('/file'))+1 # reads the number of files in the folder to start downloading at the next comic
errorCount=0
def download_comic(url,comicName):
"""
download a comic in the form of
url = http://www.example.com
comicName = '00000000.jpg'
"""
image=urllib.URLopener()
image.retrieve(url,comicName) # download comicName at URL
while comicCounter <= 1000: # not the most elegant solution
os.chdir('/file') # set where files download to
try:
if comicCounter < 10: # needed to break into 10^n segments because comic names are a set of zeros followed by a number
comicNumber=str('0000000'+str(comicCounter)) # string containing the eight digit comic number
comicName=str(comicNumber+".jpg") # string containing the file name
url=str("http://www.gunnerkrigg.com//comics/"+comicName) # creates the URL for the comic
comicCounter+=1 # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
download_comic(url,comicName) # uses the function defined above to download the comic
print url
if 10 <= comicCounter < 100:
comicNumber=str('000000'+str(comicCounter))
comicName=str(comicNumber+".jpg")
url=str("http://www.gunnerkrigg.com//comics/"+comicName)
comicCounter+=1
download_comic(url,comicName)
print url
if 100 <= comicCounter < 1000:
comicNumber=str('00000'+str(comicCounter))
comicName=str(comicNumber+".jpg")
url=str("http://www.gunnerkrigg.com//comics/"+comicName)
comicCounter+=1
download_comic(url,comicName)
print url
else: # quit the program if any number outside this range shows up
quit
except IOError: # urllib raises an IOError for a 404 error, when the comic doesn't exist
errorCount+=1 # add one to the error count
if errorCount>3: # if more than three errors occur during downloading, quit the program
break
else:
print str("comic"+ ' ' + str(comicCounter) + ' ' + "does not exist") # otherwise say that the certain comic number doesn't exist
print "all comics are up to date" # prints if all comics are downloaded
So I'm trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I've found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images). I tried using this code:
>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg")
('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>)
I then searched my computer for a file "00000001.jpg", but all I found was the cached picture of it. I'm not even sure it saved the file to my computer. Once I understand how to get the file downloaded, I think I know how to handle the rest. Essentially just use a for loop and split the string at the '00000000'.'jpg' and increment the '00000000' up to the largest number, which I would have to somehow determine. Any reccomendations on the best way to do this or how to download the file correctly?
Thanks!
EDIT 6/15/10
Here is the completed script, it saves the files to any directory you choose. For some odd reason, the files weren't downloading and they just did. Any suggestions on how to clean it up would be much appreciated. I'm currently working out how to find out many comics exist on the site so I can get just the latest one, rather than having the program quit after a certain number of exceptions are raised.
import urllib
import os
comicCounter=len(os.listdir('/file'))+1 # reads the number of files in the folder to start downloading at the next comic
errorCount=0
def download_comic(url,comicName):
"""
download a comic in the form of
url = http://www.example.com
comicName = '00000000.jpg'
"""
image=urllib.URLopener()
image.retrieve(url,comicName) # download comicName at URL
while comicCounter <= 1000: # not the most elegant solution
os.chdir('/file') # set where files download to
try:
if comicCounter < 10: # needed to break into 10^n segments because comic names are a set of zeros followed by a number
comicNumber=str('0000000'+str(comicCounter)) # string containing the eight digit comic number
comicName=str(comicNumber+".jpg") # string containing the file name
url=str("http://www.gunnerkrigg.com//comics/"+comicName) # creates the URL for the comic
comicCounter+=1 # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
download_comic(url,comicName) # uses the function defined above to download the comic
print url
if 10 <= comicCounter < 100:
comicNumber=str('000000'+str(comicCounter))
comicName=str(comicNumber+".jpg")
url=str("http://www.gunnerkrigg.com//comics/"+comicName)
comicCounter+=1
download_comic(url,comicName)
print url
if 100 <= comicCounter < 1000:
comicNumber=str('00000'+str(comicCounter))
comicName=str(comicNumber+".jpg")
url=str("http://www.gunnerkrigg.com//comics/"+comicName)
comicCounter+=1
download_comic(url,comicName)
print url
else: # quit the program if any number outside this range shows up
quit
except IOError: # urllib raises an IOError for a 404 error, when the comic doesn't exist
errorCount+=1 # add one to the error count
if errorCount>3: # if more than three errors occur during downloading, quit the program
break
else:
print str("comic"+ ' ' + str(comicCounter) + ' ' + "does not exist") # otherwise say that the certain comic number doesn't exist
print "all comics are up to date" # prints if all comics are downloaded
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(20)
Python 2
使用 urllib.urlretrieve
Python 3
使用urllib.request.urlretrieve (Python 3 遗留接口的一部分,工作原理完全相同)
Python 2
Using urllib.urlretrieve
Python 3
Using urllib.request.urlretrieve (part of Python 3's legacy interface, works exactly the same)
Python 2:
Python 3:
Python 2:
Python 3:
只是为了记录,使用请求库。
虽然它应该检查 requests.get() 错误。
Just for the record, using requests library.
Though it should check for requests.get() error.
对于 Python 3,您需要导入
import urllib.request
:有关详细信息,请查看 链接
For Python 3 you will need to import
import urllib.request
:for more info check out the link
@DiGMi 答案的 Python 3 版本:
Python 3 version of @DiGMi's answer:
我找到了这个答案 我以更可靠的方式编辑它,
从此您在下载时永远不会获得任何其他资源或异常。
I have found this answer and I edit that in more reliable way
From this you never get any other resources or exceptions while downloading.
最简单的方法是使用
.read()
读取部分或整个响应,然后将其写入您在已知正确位置打开的文件中。It's easiest to just use
.read()
to read the partial or entire response, then write it into a file you've opened in a known good location.如果您知道这些文件位于网站
site
的同一目录dir
中,并且具有以下格式:filename_01.jpg, ..., filename_10.jpg,则下载他们所有人:If you know that the files are located in the same directory
dir
of the websitesite
and have the following format: filename_01.jpg, ..., filename_10.jpg then download all of them:也许您需要“用户代理”:
Maybe you need 'User-Agent':
使用 urllib,您可以立即完成此操作。
Using urllib, you can get this done instantly.
除了建议您仔细阅读
retrieve()
的文档 (http://docs.python.org/library/urllib.html#urllib.URLopener.retrieve),我建议实际调用read()
的内容响应,然后将其保存到您选择的文件中,而不是将其保留在检索创建的临时文件中。Aside from suggesting you read the docs for
retrieve()
carefully (http://docs.python.org/library/urllib.html#urllib.URLopener.retrieve), I would suggest actually callingread()
on the content of the response, and then saving it into a file of your choosing rather than leaving it in the temporary file that retrieve creates.上述所有代码都不允许保留原始图像名称,而有时这是必需的。
这将有助于将图像保存到本地驱动器,保留原始图像名称
试试这个了解更多详情。
All the above codes, do not allow to preserve the original image name, which sometimes is required.
This will help in saving the images to your local drive, preserving the original image name
Try this for more details.
这对我使用 python 3 有用。
它从 csv 文件获取 URL 列表,并开始将它们下载到文件夹中。如果内容或图像不存在,它将接受该异常并继续发挥其魔力。
This worked for me using python 3.
It gets a list of URLs from the csv file and starts downloading them into a folder. In case the content or image does not exist it takes that exception and continues making its magic.
根据 urllib.request.urlretrieve — Python 3.9.2文档,该函数是从 Python 2 模块
urllib
移植的(与urllib2
相反)。它可能会在未来的某个时候被弃用。因此,最好使用 requests.get(url , params=None, **kwargs)。这是一个 MWE。
请参阅 通过以下方式下载 Google 的 WebP 图片使用 Selenium WebDriver 截屏。
According to urllib.request.urlretrieve — Python 3.9.2 documentation, The function is ported from the Python 2 module
urllib
(as opposed tourllib2
). It might become deprecated at some point in the future.Because of this, it might be better to use requests.get(url, params=None, **kwargs). Here is a MWE.
Refer to Downlolad Google’s WebP Images via Take Screenshots with Selenium WebDriver.
一个更简单的解决方案可能是(python 3):
A simpler solution may be(python 3):
这个怎么样:
What about this:
如果您需要代理支持,您可以这样做:
If you need proxy support you can do this:
另一种方法是通过 fastai 库。这对我来说就像一个魅力。我在使用 urlretrieve 时遇到了 SSL: CERTIFICATE_VERIFY_FAILED 错误,所以我尝试了这一点。
Another way to do this is via the fastai library. This worked like a charm for me. I was facing a
SSL: CERTIFICATE_VERIFY_FAILED Error
usingurlretrieve
so I tried that.使用请求
Using requests
而如果你想下载类似网站目录结构的图片,可以这样做:
And if you want to download images similar to the website directory structure, you can do this: