使用node/puppeteer/js从直接下载链接下载pdf文件
我需要从直接下载链接下载大约 300 个文件。当直接在浏览器中打开链接时,会触发自动 pdf 下载。文件已下载,但浏览器不会去任何地方。链接如下:
www.link.com/store/item/123
在链接中,123部分在每次循环时都会改变。
我正在考虑使用 puppeteer (带有 goto),但我想由于访问链接会自动触发 pdf 的下载,并且实际上并没有转到该页面,所以它失败了。
这是我尝试过的,但它根本不起作用:
const links = ['123', '456'];
(async () => {
const browser = await puppeteer.launch({
headless: false //preferably would run with true
});
links.forEach( async link => {
const page = await browser.newPage();
await page.goto(
linkBeginning + link
);
await browser.close();
})
})();
我四处搜索,但我无法真正找到这个特定的情况,所有其他情况都更关注用户端或在实际链接中包含目标文件(例如 xx/商店/doc.pdf)。但不太确定这是否有影响。我只需要一个脚本,可以让我一次性运行 pdf 文件。
如果有人在 php/python 中有一个解决方案也可以工作,因为这只是一次性的事情。
编辑: 最终用html做了
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<script src="sku.js"></script>
<script>
const linkStart = 'https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/';
sku.forEach(element => {
document.write('<a target = "_blank" class="click" href="' + linkStart + element.id +'">'+ element.id +'</a></br>')
});
</script>
</head>
<body></body>
</html>
<script>
const clickInterval = setInterval(function () {
const el = document.querySelector('.click:not(.clicked)');
if(el){
el.classList.add('clicked');
el.click()
} else {
clearInterval(clickInterval);
}
}, 2000);
</script>
I need to download some 300 files from a direct download link. When the link is opened directly in the browser, an automatic pdf download gets triggered. The file gets downloaded and the browser doesn't go anywhere. The links are as follows:
www.link.com/store/item/123
In the link, the 123 part would be changed on every loop.
I was thinking of using puppeteer (with goto), but I guess since visiting the link automatically triggers the download of the pdf and doesnt actually go to the page, it fails.
This is what I tried, but its not working at all:
const links = ['123', '456'];
(async () => {
const browser = await puppeteer.launch({
headless: false //preferably would run with true
});
links.forEach( async link => {
const page = await browser.newPage();
await page.goto(
linkBeginning + link
);
await browser.close();
})
})();
I searched around, but I could not really find this specific case, all other cases are more focused on the user side or have the target file in the actual link (like xx/store/doc.pdf). Not quite sure if that makes a difference though. I would just need a script that will get me the pdf files for a one time run.
If anyone has a solution in php/python that would work as well, as this is just a one off thing.
edit:
ended up doing it in html
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<script src="sku.js"></script>
<script>
const linkStart = 'https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/';
sku.forEach(element => {
document.write('<a target = "_blank" class="click" href="' + linkStart + element.id +'">'+ element.id +'</a></br>')
});
</script>
</head>
<body></body>
</html>
<script>
const clickInterval = setInterval(function () {
const el = document.querySelector('.click:not(.clicked)');
if(el){
el.classList.add('clicked');
el.click()
} else {
clearInterval(clickInterval);
}
}, 2000);
</script>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你不需要 puppeteer 来做到这一点,并且你可以在 NodeJS 中相当轻松地实现它:
这段代码在做什么?
skus
数组,我们使用.map()
将它们转换为请求数组。.map()
中,我们创建一个Promise
,当文件下载完成时,它会成功 (resolve
),否则会失败 (< code>reject) 如果下载错误。等待
我们刚刚创建的所有请求。如果其中之一失败,它将记录下来。注意:
如果您使用 CommonJS(
package.json
中的"type":"commonjs",
),请将两个导入替换为:You don't need puppeteer to do this, and you can achieve it fairly easily in NodeJS:
What is this code doing?
skus
, we are using.map()
to transform them into an array of requests..map()
we're creating aPromise
which will be successful (resolve
) when the file finishes downloading, or unsuccessful (reject
) if the download errors.await
all of the requests that we have just created. If one of them fails it will log.Note:
If you are using CommonJS (
"type":"commonjs",
in yourpackage.json
), replace the two imports with:您将 browser.close() 放在循环内并不是一件好事。
所以我将它移到了 forEach 之外,并将其更改为 page.close() 。
Your placement for browser.close() inside loop isn't a good thing.
So i moved it outside forEach and change it to page.close() instead.