如何使用 Puppeteer 和 Headless Chrome 递归调用多个 URL?
我正在尝试编写一个程序来同时扫描多个 URL(并行),并且我已提取站点地图并将其作为数组存储在变量中,如下所示。但我无法使用 Puppeteer 打开。我收到以下错误:
originalMessage:“无法导航到无效的 URL”
我的代码如下。有人可以帮我吗?
const sitemapper = require('@mastixmc/sitemapper');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml';
/*If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*/
const options = {
delay: 3000,
limit: 50000
};
const sitemapXMLParser = new SitemapXMLParser(url, options);
sitemapXMLParser.fetch().then(result => {
var locs = result.map(value => value.loc)
var locsFiltered = locs.toString().replace("[",'<br>');
const urls = locsFiltered
console.log(locsFiltered)
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = locsFiltered
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
}};
scrapeProduct();
});
I am trying to write a program to scan multiple URLs at the same time (parallelizaiton) and I have extracted sitemap and stored it as an array in a Variable as shown below. But i am unable to open using Puppeteer. I am getting the below error:
originalMessage: 'Cannot navigate to invalid URL'
My code below. Can someone please help me out .
const sitemapper = require('@mastixmc/sitemapper');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml';
/*If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*/
const options = {
delay: 3000,
limit: 50000
};
const sitemapXMLParser = new SitemapXMLParser(url, options);
sitemapXMLParser.fetch().then(result => {
var locs = result.map(value => value.loc)
var locsFiltered = locs.toString().replace("[",'<br>');
const urls = locsFiltered
console.log(locsFiltered)
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = locsFiltered
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
}};
scrapeProduct();
});
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您看到无效的 URL,因为您通过错误的方法将数组转换为 URL 字符串。
这行代码更好:
因此,为了抓取 CNN 网站,我添加了
puppeteer-cluster
以提高速度:You see invalid URL because you've convert an array into URL string by wrong method.
These line is a better one:
So to scrape CNN sites, i've added
puppeteer-cluster
for speed: