python3。如何将下载的网页保存到指定的DIR?
我正在尝试保存所有< a> Python主页中的链接到一个名为“下载页面”的文件夹。但是,通过FO循环进行了2次迭代后,我收到以下错误:
trackback(最近的呼叫最新电话):file“/users/lucas/python/ap book 练习/网络刮擦/linkverification.py”,第26行,在 downloadedpage = open(OS.Path.Join('下载页面',os.path.basename(linkltoopen)),'wb')isadirectoryError:[errno 21] 是一个目录:'下载页/'
我不确定为什么会发生这种情况,因为看来页面被保存为由于看到'< _io.bufferedwriter name ='下载pages/www.python.org#content'>''> ,这对我说是正确的道路。
这是我的代码:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
感谢任何建议,谢谢。
I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题在于,当您尝试使用.html dir的页面basename之类的事情时,它可以正常工作,但是当您尝试使用未在URL上指定它的操作,例如“ http:// python) 。因此,要努力工作,最简单的解决方案是使用@thyebri所说的绝对路径。
而且,请记住,您编写的文件不能包含
'/','\'\'或'?''
之类的字符> re 库我将执行以下操作:
因此,首先我删除部分我删除
“ https://”
part,然后使用正则表达式库,我替换了所有常规符号URL链接中存在带有Dash' - ''
的链接中,这就是将给文件的名称。希望它有效!
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like @Thyebri said.
And also, remember that the file you write cannot contain characters like
'/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the
re
library i would do the following:So, first i remove part i remove the
"https://"
part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash'-'
and that is the name that will be given to the file.Hope it works!