Python从URL列表中下载/scrape SSRN论文

发布于 2025-01-31 18:39:16 字数 688 浏览 0 评论 0原文

我有一堆链接，除了末尾ID之外，我的链接完全相同。我要做的就是循环浏览每个链接，然后使用下载为PDF按钮作为PDF下载纸张。在理想的世界中，文件名将是论文的标题，但是如果不可能，我以后可以重命名。下载它们更重要。我有200个链接，但我将在这里提供5个链接。

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134

我想做什么吗？我很熟悉循环浏览URL来刮擦桌子，但是我从未尝试使用下载按钮做任何事情。

我没有示例代码，因为我不知道从哪里开始。但是类似

for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)

原文

I have a bunch of links that are the exact same except for the id at the end. All I want to do is loop through each link and download the paper as a PDF using the download as PDF button. In an ideal world, the filename would be the title of the paper but if that isn't possible I can rename them later. Getting them all downloaded is more important. I have like 200 links but I will provide 5 here for an example.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134

Is what I want to do possible? I have some familiarity with looping through URLs to scrape tables but I have never tried to do anything with a download button.

I don't have example code because I don't know where to start here. But something like

for url in urls:
(go to each link)
(download as pdf via the "download this paper" button)
(save file as title of paper)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

内心激荡 2025-02-07 18:39:16

尝试：

import requests
from bs4 import BeautifulSoup

urls = [
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}


for url in urls:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    pdf_url = (
        "https://papers.ssrn.com/sol3/"
        + soup.select_one("a[data-abstract-id]")["href"]
    )
    filename = url.split("=")[-1] + ".pdf"

    print(f"Downloading {pdf_url} as {filename}")

    with open(filename, "wb") as f_out:
        f_out.write(
            requests.get(pdf_url, headers={**headers, "Referer": url}).content
        )

打印：

Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf

并将PDF保存为：

andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root  993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root  685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root  939157 máj 24 01:10 3860262.pdf

Try:

import requests
from bs4 import BeautifulSoup

urls = [
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3860262",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2521007",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3146924",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2488552",
    "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3330134",
]

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}


for url in urls:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    pdf_url = (
        "https://papers.ssrn.com/sol3/"
        + soup.select_one("a[data-abstract-id]")["href"]
    )
    filename = url.split("=")[-1] + ".pdf"

    print(f"Downloading {pdf_url} as {filename}")

    with open(filename, "wb") as f_out:
        f_out.write(
            requests.get(pdf_url, headers={**headers, "Referer": url}).content
        )

Prints:

Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3860262_code1719241.pdf?abstractid=3860262&mirid=1 as 3860262.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2521007_code576529.pdf?abstractid=2521007&mirid=1 as 2521007.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID4066577_code104690.pdf?abstractid=3146924&mirid=1 as 3146924.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2505208_code16198.pdf?abstractid=2488552&mirid=1 as 2488552.pdf
Downloading https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID3506882_code16198.pdf?abstractid=3330134&mirid=1 as 3330134.pdf

and saves the PDFs as:

andrej@PC:~$ ls -alF *pdf
-rw-r--r-- 1 root root  993466 máj 24 01:10 2488552.pdf
-rw-r--r-- 1 root root 3583616 máj 24 01:10 2521007.pdf
-rw-r--r-- 1 root root 1938284 máj 24 01:10 3146924.pdf
-rw-r--r-- 1 root root  685777 máj 24 01:10 3330134.pdf
-rw-r--r-- 1 root root  939157 máj 24 01:10 3860262.pdf

回复收藏 0 原文

~没有更多了~