如何从多个URL python下载所有PDF文件

发布于 2025-02-12 02:54:26 字数 341 浏览 2 评论 0原文

使用python,我想从网站

url_asn="https://www.asn.fr/recherche?filter_year[from]={}&filter_year[to]={}&limit=50&search_content_type=&search_text={}&sort_type=date&page={}"

中下载所有PDF文件(以“ INS”开头的名称除外),如果link ['href']不是pdf,那么打开它并下载PDF文件(如果存在) - 对于每个页面,将其交叉至最后一页。

Using Python, I'd like to download all pdf files(except names that begin by "INS") from website

url_asn="https://www.asn.fr/recherche?filter_year[from]={}&filter_year[to]={}&limit=50&search_content_type=&search_text={}&sort_type=date&page={}"

if link['href'] is not pdf, then open it and download pdf files if they exist - for each page, interate to last page.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

纵山崖 2025-02-19 02:54:28

可能会起作用吗?
我添加了每行评论。

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = " " # url to scrape

#If there is no such folder, the script will create one automatically
folder_location = r'/webscraping' # folder location
# create folder if it doesn't exist
if not os.path.exists(folder_location):os.mkdir(folder_location)
 
response = requests.get(url) # get the html
soup= BeautifulSoup(response.text, "html.parser") # parse the html 
for link in soup.select("a[href$='.pdf']"): # select all the pdf links
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1]) # join the folder location and the filename
    with open(filename, 'wb') as f: 
# open the file and write the pdf
        f.write(requests.get(urljoin(url,link['href'])).content) 

probably this will work?
I have added comments for every line.

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = " " # url to scrape

#If there is no such folder, the script will create one automatically
folder_location = r'/webscraping' # folder location
# create folder if it doesn't exist
if not os.path.exists(folder_location):os.mkdir(folder_location)
 
response = requests.get(url) # get the html
soup= BeautifulSoup(response.text, "html.parser") # parse the html 
for link in soup.select("a[href$='.pdf']"): # select all the pdf links
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1]) # join the folder location and the filename
    with open(filename, 'wb') as f: 
# open the file and write the pdf
        f.write(requests.get(urljoin(url,link['href'])).content) 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文