使用 BeautifulSoup 从网页中抓取数据框中的 pdf 链接

发布于 2025-01-11 15:08:49 字数 625 浏览 0 评论 0原文

我想提取所有 pdf 链接,这些链接将我们直接带到可以下载所有 pdf 的页面。我想将这些 pdf 存储在数据框中

url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = BeautifulSoup(source.text , "html.parser")
news_check = soup.find_all("a" , class_ = "articlelist__contentDownloadItem")
for i in news_check :
    print(i)
    break
    
data = set()
for i in soup.find_all('a'):
    for j in i.find_all('href'):
        pdf_link = "https://www.volvogroup.com" + j.get('.pdf')
        data.add(j)
        print(pdf_link)

I want to extract all the pdf links which takes us to the page directly from where we can download all the pdfs . I want to store these pdfs in a data frame

url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = BeautifulSoup(source.text , "html.parser")
news_check = soup.find_all("a" , class_ = "articlelist__contentDownloadItem")
for i in news_check :
    print(i)
    break
    
data = set()
for i in soup.find_all('a'):
    for j in i.find_all('href'):
        pdf_link = "https://www.volvogroup.com" + j.get('.pdf')
        data.add(j)
        print(pdf_link)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

昨迟人 2025-01-18 15:08:49

您可以尝试以下代码来获取 pdf 链接:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = bs(source.text , "html.parser")

news_check = soup.find_all("a" , class_ = "articlelist__contentDownloadItem")
    
data = set()

for i in news_check:
    pdf_link ="https://www.volvogroup.com"  + i['href']
    
    data.add(pdf_link)
    
    # for j in i.find_all('href'):
    #     pdf_link = + j.get('.pdf')
    #     data.add(j)
    #     print(pdf_link)
    
df = pd.DataFrame(data)
print(df)

输出:

0   https://www.volvogroup.com/content/dam/volvo-g...
1   https://www.volvogroup.com/content/dam/volvo-g...
2   https://www.volvogroup.com/content/dam/volvo-g...
3   https://www.volvogroup.com/content/dam/volvo-g...
4   https://www.volvogroup.com/content/dam/volvo-g...
5   https://www.volvogroup.com/content/dam/volvo-g...
6   https://www.volvogroup.com/content/dam/volvo-g...
7   https://www.volvogroup.com/content/dam/volvo-g...
8   https://www.volvogroup.com/content/dam/volvo-g...
9   https://www.volvogroup.com/content/dam/volvo-g...
10  https://www.volvogroup.com/content/dam/volvo-g...
11  https://www.volvogroup.com/content/dam/volvo-g...
12  https://www.volvogroup.com/content/dam/volvo-g...

You can try below code to get pdf link:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = bs(source.text , "html.parser")

news_check = soup.find_all("a" , class_ = "articlelist__contentDownloadItem")
    
data = set()

for i in news_check:
    pdf_link ="https://www.volvogroup.com"  + i['href']
    
    data.add(pdf_link)
    
    # for j in i.find_all('href'):
    #     pdf_link = + j.get('.pdf')
    #     data.add(j)
    #     print(pdf_link)
    
df = pd.DataFrame(data)
print(df)

Output :

0   https://www.volvogroup.com/content/dam/volvo-g...
1   https://www.volvogroup.com/content/dam/volvo-g...
2   https://www.volvogroup.com/content/dam/volvo-g...
3   https://www.volvogroup.com/content/dam/volvo-g...
4   https://www.volvogroup.com/content/dam/volvo-g...
5   https://www.volvogroup.com/content/dam/volvo-g...
6   https://www.volvogroup.com/content/dam/volvo-g...
7   https://www.volvogroup.com/content/dam/volvo-g...
8   https://www.volvogroup.com/content/dam/volvo-g...
9   https://www.volvogroup.com/content/dam/volvo-g...
10  https://www.volvogroup.com/content/dam/volvo-g...
11  https://www.volvogroup.com/content/dam/volvo-g...
12  https://www.volvogroup.com/content/dam/volvo-g...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文