使用node/puppeteer/js从直接下载链接下载pdf文件

发布于 2025-01-11 02:09:17 字数 1831 浏览 0 评论 0原文

我需要从直接下载链接下载大约 300 个文件。当直接在浏览器中打开链接时，会触发自动 pdf 下载。文件已下载，但浏览器不会去任何地方。链接如下：

www.link.com/store/item/123

在链接中，123部分在每次循环时都会改变。

我正在考虑使用 puppeteer （带有 goto），但我想由于访问链接会自动触发 pdf 的下载，并且实际上并没有转到该页面，所以它失败了。

这是我尝试过的，但它根本不起作用：

const links = ['123', '456'];
(async () => {
    const browser = await puppeteer.launch({
        headless: false //preferably would run with true
    });

    links.forEach( async link => {
        const page = await browser.newPage();
        await page.goto(
            linkBeginning + link
        );
    
        await browser.close();
    })
})();

我四处搜索，但我无法真正找到这个特定的情况，所有其他情况都更关注用户端或在实际链接中包含目标文件（例如 xx/商店/doc.pdf）。但不太确定这是否有影响。我只需要一个脚本，可以让我一次性运行 pdf 文件。

如果有人在 php/python 中有一个解决方案也可以工作，因为这只是一次性的事情。

编辑：最终用html做了

<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <script src="sku.js"></script>

    <script>
        const linkStart = 'https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/';
        sku.forEach(element => {
            document.write('<a target = "_blank" class="click" href="' + linkStart + element.id +'">'+ element.id +'</a></br>')
        });
    </script>
</head>
<body></body>
</html>

<script>
    const clickInterval = setInterval(function () {
    const el = document.querySelector('.click:not(.clicked)');
    if(el){
        el.classList.add('clicked');
        el.click()
    } else {
        clearInterval(clickInterval);
    }    
}, 2000);

</script>

原文

I need to download some 300 files from a direct download link. When the link is opened directly in the browser, an automatic pdf download gets triggered. The file gets downloaded and the browser doesn't go anywhere. The links are as follows:

www.link.com/store/item/123

In the link, the 123 part would be changed on every loop.

I was thinking of using puppeteer (with goto), but I guess since visiting the link automatically triggers the download of the pdf and doesnt actually go to the page, it fails.

This is what I tried, but its not working at all:

const links = ['123', '456'];
(async () => {
    const browser = await puppeteer.launch({
        headless: false //preferably would run with true
    });

    links.forEach( async link => {
        const page = await browser.newPage();
        await page.goto(
            linkBeginning + link
        );
    
        await browser.close();
    })
})();

I searched around, but I could not really find this specific case, all other cases are more focused on the user side or have the target file in the actual link (like xx/store/doc.pdf). Not quite sure if that makes a difference though. I would just need a script that will get me the pdf files for a one time run.

If anyone has a solution in php/python that would work as well, as this is just a one off thing.

edit:
ended up doing it in html

<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <script src="sku.js"></script>

    <script>
        const linkStart = 'https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/';
        sku.forEach(element => {
            document.write('<a target = "_blank" class="click" href="' + linkStart + element.id +'">'+ element.id +'</a></br>')
        });
    </script>
</head>
<body></body>
</html>

<script>
    const clickInterval = setInterval(function () {
    const el = document.querySelector('.click:not(.clicked)');
    if(el){
        el.classList.add('clicked');
        el.click()
    } else {
        clearInterval(clickInterval);
    }    
}, 2000);

</script>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

等往事风中吹 2025-01-18 02:09:17

你不需要 puppeteer 来做到这一点，并且你可以在 NodeJS 中相当轻松地实现它：

import http from "https";
import fs from "fs";

(async () => {
  const skus = ["00548", "03575"];

  const filesPromiseArray = skus.map(
    (sku) =>
      new Promise((resolve, reject) => {
        const file = fs.createWriteStream(`${sku}.pdf`);
        const request = http.get(`https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/${sku}`, (response) =>
          response.pipe(file)
        );
        file.on("finish", resolve);
        file.on("error", reject);
      })
  );
  try {
    await Promise.all(filesPromiseArray);
  } catch {
    console.log("There was an error downloading one of the files");
  }
})();

这段代码在做什么？

获取 skus 数组，我们使用 .map() 将它们转换为请求数组。
在 .map() 中，我们创建一个 Promise，当文件下载完成时，它会成功 (resolve)，否则会失败 (< code>reject) 如果下载错误。
然后我们等待我们刚刚创建的所有请求。如果其中之一失败，它将记录下来。

注意：

如果您使用 CommonJS（package.json 中的"type":"commonjs",），请将两个导入替换为：

const http = require('https');
const fs = require('fs');

You don't need puppeteer to do this, and you can achieve it fairly easily in NodeJS:

import http from "https";
import fs from "fs";

(async () => {
  const skus = ["00548", "03575"];

  const filesPromiseArray = skus.map(
    (sku) =>
      new Promise((resolve, reject) => {
        const file = fs.createWriteStream(`${sku}.pdf`);
        const request = http.get(`https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/${sku}`, (response) =>
          response.pipe(file)
        );
        file.on("finish", resolve);
        file.on("error", reject);
      })
  );
  try {
    await Promise.all(filesPromiseArray);
  } catch {
    console.log("There was an error downloading one of the files");
  }
})();

What is this code doing?

Taking your array of skus, we are using .map() to transform them into an array of requests.
Inside the .map() we're creating a Promise which will be successful (resolve) when the file finishes downloading, or unsuccessful (reject) if the download errors.
We then await all of the requests that we have just created. If one of them fails it will log.

Note:

If you are using CommonJS ("type":"commonjs", in your package.json), replace the two imports with:

const http = require('https');
const fs = require('fs');

回复收藏 0 原文

只是我以为 2025-01-18 02:09:17

您将 browser.close() 放在循环内并不是一件好事。
所以我将它移到了 forEach 之外，并将其更改为 page.close() 。

const links = ['123', '456']
const linkBeginning = 'https://www.link.com/store/item/'
;(async () => {
    const browser = await puppeteer.launch({
        headless: false //preferably would run with true
    })

    links.forEach( async link => {
        const page = await browser.newPage()
        const session = await page.target().createCDPSession()
        await session.send('Page.setDownloadBehavior', {
            behavior: 'allow',
            downloadPath: './pdf/'
        })
        await page.goto(linkBeginning + link)
    
        await page.close() // Don't use browser.close() inside loop
    })
    await browser.close() // Use here instead
})()

Your placement for browser.close() inside loop isn't a good thing.
So i moved it outside forEach and change it to page.close() instead.

const links = ['123', '456']
const linkBeginning = 'https://www.link.com/store/item/'
;(async () => {
    const browser = await puppeteer.launch({
        headless: false //preferably would run with true
    })

    links.forEach( async link => {
        const page = await browser.newPage()
        const session = await page.target().createCDPSession()
        await session.send('Page.setDownloadBehavior', {
            behavior: 'allow',
            downloadPath: './pdf/'
        })
        await page.goto(linkBeginning + link)
    
        await page.close() // Don't use browser.close() inside loop
    })
    await browser.close() // Use here instead
})()

回复收藏 0 原文

~没有更多了~