Python从URL列表中下载/scrape SSRN论文

发布于 2025-01-31 18:39:16 字数 688 浏览 0 评论 0原文

我有一堆链接,除了末尾ID之外,我的链接完全相同。我要做的就是循环浏览每个链接,然后使用下载为PDF按钮作为PDF下载纸张。在理想的世界中,文件名将是论文的标题,但是如果不可能,我以后可以重命名。下载它们更重要。我有200个链接,但我将在这里提供5个链接。

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134

我想做什么吗?我很熟悉循环浏览URL来刮擦桌子,但是我从未尝试使用下载按钮做任何事情。

我没有示例代码,因为我不知道从哪里开始。但是类似

for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)

I have a bunch of links that are the exact same except for the id at the end. All I want to do is loop through each link and download the paper as a PDF using the download as PDF button. In an ideal world, the filename would be the title of the paper but if that isn't possible I can rename them later. Getting them all downloaded is more important. I have like 200 links but I will provide 5 here for an example.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134

Is what I want to do possible? I have some familiarity with looping through URLs to scrape tables but I have never tried to do anything with a download button.

I don't have example code because I don't know where to start here. But something like

for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

内心激荡 2025-02-07 18:39:16

尝试:

import requests
from bs4 import BeautifulSoup

urls = [
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}


for url in urls:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    pdf_url = (
        "https://papers.ssrn.com/sol3/"
        + soup.select_one("a[data-abstract-id]")["href"]
    )
    filename = url.split("=")[-1] + ".pdf"

    print(f"Downloading {pdf_url} as {filename}")

    with open(filename, "wb") as f_out:
        f_out.write(
            requests.get(pdf_url, headers={**headers, "Referer": url}).content
        )

打印:

Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf

并将PDF保存为:

andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root  993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root  685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root  939157 máj 24 01:10 3860262.pdf

Try:

import requests
from bs4 import BeautifulSoup

urls = [
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}


for url in urls:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    pdf_url = (
        "https://papers.ssrn.com/sol3/"
        + soup.select_one("a[data-abstract-id]")["href"]
    )
    filename = url.split("=")[-1] + ".pdf"

    print(f"Downloading {pdf_url} as {filename}")

    with open(filename, "wb") as f_out:
        f_out.write(
            requests.get(pdf_url, headers={**headers, "Referer": url}).content
        )

Prints:

Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf

and saves the PDFs as:

andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root  993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root  685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root  939157 máj 24 01:10 3860262.pdf
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文