Web刮擦和使用Python下载Excel文件

发布于 2025-02-10 16:40:14 字数 1089 浏览 1 评论 0原文

我一直在尝试为其Excel文件刮擦网站。我打算为其中包含来自数据档案部分的大部分数据做一次。我可以通过URLIB请求一次下载单个文件，并在几个不同的文件上尝试。但是，当我尝试创建一个函数以下载所有功能时，我一直在收到一些错误。发生的第一个错误只是将HTTP文件地址作为列表。我将验证验证更改为False（不是出于安全原因的最佳实践），以解决它给我且奏效的认证SSL错误。然后，我通过废弃并将其下载到特定文件夹中再次尝试进一步进行。我以前在一个类似的项目中做到了这一点，并且在认证错误SSL方面并没有花费这么困难。

import requests
from bs4 import BeautifulSoup
import os

os.chdir(r'C:\ The out put path were it will go\\')
 
url = 'https://pages.stern.nyu.edu/~adamodar/pc/archives/'
reqs = requests.get(url, verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
file_type = '.xls'
 
urls = []
for link in soup.find_all('a'):
    file_link = link.get('href')
    if file_type in file_link:
        print(file_link)
        with open(link.text, 'wb') as file:
            response = requests.get(url + file_link)
            file.write(response.content)

即使在验证False之后，这是错误的原因，这似乎在生成列表之前就可以解决问题。每次尝试尝试抓住第一个文件，但不会循环到下一个文件。

requests.exceptions.sslerror：httpsconnectionpool（host ='pages.sern.nyu.edu'，port = 443）：最大重试超过：，'[ssl：cidtul_verify_failed]证书验证失败：无法获得本地发行者证书（_SSL.C：1129）'）

我想缺少什么？

原文

I've been trying to scrape a website for its excel files. I'm planning on doing this once for the bulk of data it contains from its data archives section. I've been able to download individual files one at a time with urlib requests and tried it on several different files manually. But when I try to create a function to download all of them I've been receiving some errors. The first error that was occurring was just getting the http file addresses as a list. I changed the verify to false (not the best practice for security reasons) to work around the certification ssl error it was giving me and it worked. I then attempted again going further by scrapping and downloading it to a specific folder. I've done this before with a similar project and didn't nearly have this hard of time with certification error ssl.

import requests
from bs4 import BeautifulSoup
import os

os.chdir(r'C:\ The out put path were it will go\\')
 
url = 'https://pages.stern.nyu.edu/~adamodar/pc/archives/'
reqs = requests.get(url, verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
file_type = '.xls'
 
urls = []
for link in soup.find_all('a'):
    file_link = link.get('href')
    if file_type in file_link:
        print(file_link)
        with open(link.text, 'wb') as file:
            response = requests.get(url + file_link)
            file.write(response.content)

This is the error is has been giving me even after verifying false, which seemed to solve the problem before generating the list. It's grabbing the first file each time tried but it doesn't loop to the next.

requests.exceptions.SSLError: HTTPSConnectionPool(host='pages.stern.nyu.edu', port=443): Max retries exceeded with url: /~adamodar/pc/archives/BAMLEMPBPUBSICRPIEY19.xls (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')

What am I missing? I thought I fixed the verification issue.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜点 2025-02-17 16:40:14

您忘记设置verify = false获取文件时

urls = []
for link in soup.find_all('a'):
    file_link = link.get('href')
    if file_type in file_link:
        print(file_link)
        with open(link.text, 'wb') as file:
            response = requests.get(url + file_link, verify=False) # <-- This is where you forgot
            file.write(response.content)

You forgot to set verify=False when you get your files

urls = []
for link in soup.find_all('a'):
    file_link = link.get('href')
    if file_type in file_link:
        print(file_link)
        with open(link.text, 'wb') as file:
            response = requests.get(url + file_link, verify=False) # <-- This is where you forgot
            file.write(response.content)

回复收藏 0 原文

~没有更多了~