如何使用 Puppeteer 和 Headless Chrome 递归调用多个 URL?

发布于 2025-01-10 10:48:21 字数 1432 浏览 1 评论 0原文

我正在尝试编写一个程序来同时扫描多个 URL(并行),并且我已提取站点地图并将其作为数组存储在变量中,如下所示。但我无法使用 Puppeteer 打开。我收到以下错误:

originalMessage:“无法导航到无效的 URL”

我的代码如下。有人可以帮我吗?

const sitemapper = require('@mastixmc/sitemapper');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml';

/*If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*/

const options = {
    delay: 3000,
    limit: 50000
};

const sitemapXMLParser = new SitemapXMLParser(url, options);

sitemapXMLParser.fetch().then(result => {
    var locs = result.map(value => value.loc)   
    var locsFiltered = locs.toString().replace("[",'<br>');
    const urls = locsFiltered
    console.log(locsFiltered)
   

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
    const urls = locsFiltered
    const browser = await puppeteer.launch({
        headless: false
    });
    for (i = 0; i < urls.length; i++) {
        const page = await browser.newPage();
        const url = urls[i];
        const promise = page.waitForNavigation({
            waitUntil: "networkidle2"
        });
        await page.goto(`${url}`);
    }};
   
    scrapeProduct();
    
}); 

I am trying to write a program to scan multiple URLs at the same time (parallelizaiton) and I have extracted sitemap and stored it as an array in a Variable as shown below. But i am unable to open using Puppeteer. I am getting the below error:

originalMessage: 'Cannot navigate to invalid URL'

My code below. Can someone please help me out .

const sitemapper = require('@mastixmc/sitemapper');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml';

/*If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*/

const options = {
    delay: 3000,
    limit: 50000
};

const sitemapXMLParser = new SitemapXMLParser(url, options);

sitemapXMLParser.fetch().then(result => {
    var locs = result.map(value => value.loc)   
    var locsFiltered = locs.toString().replace("[",'<br>');
    const urls = locsFiltered
    console.log(locsFiltered)
   

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
    const urls = locsFiltered
    const browser = await puppeteer.launch({
        headless: false
    });
    for (i = 0; i < urls.length; i++) {
        const page = await browser.newPage();
        const url = urls[i];
        const promise = page.waitForNavigation({
            waitUntil: "networkidle2"
        });
        await page.goto(`${url}`);
    }};
   
    scrapeProduct();
    
}); 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

渡你暖光 2025-01-17 10:48:21

您看到无效的 URL,因为您通过错误的方法将数组转换为 URL 字符串。
这行代码更好:

// var locsFiltered = locs.toString().replace("[",'<br>') // This is wrong
// const urls = locsFiltered                              // So value is invalid
// console.log(locsFiltered)

const urls = locs.map(value => value[0])  // This is better

因此,为了抓取 CNN 网站,我添加了 puppeteer-cluster 以提高速度:

const { Cluster } = require('puppeteer-cluster')
const sitemapper = require('@mastixmc/sitemapper')
const SitemapXMLParser = require('sitemap-xml-parser')
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml'


async function scrapeProduct(locs) {
    const urls = locs.map(value => value[0])
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2, // You can set this to any number you like
        puppeteerOptions: {
            headless: false,
            devtools: false,
            args: [],
        }
    })

    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url, {timeout: 0, waitUntil: 'networkidle2'})
        const screen = await page.screenshot()
        // Store screenshot, do something else
    })

    for (i = 0; i < urls.length; i++) {
        console.log(urls[i])
        await cluster.queue(urls[i])
    }

    await cluster.idle()
    await cluster.close()
}

/******
If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*******/
const options = {
    delay: 3000,
    limit: 50000
}
const sitemapXMLParser = new SitemapXMLParser(url, options)
sitemapXMLParser.fetch().then(async result => {
    var locs = result.map(value => value.loc)
    await scrapeProduct(locs)
})

You see invalid URL because you've convert an array into URL string by wrong method.
These line is a better one:

// var locsFiltered = locs.toString().replace("[",'<br>') // This is wrong
// const urls = locsFiltered                              // So value is invalid
// console.log(locsFiltered)

const urls = locs.map(value => value[0])  // This is better

So to scrape CNN sites, i've added puppeteer-cluster for speed:

const { Cluster } = require('puppeteer-cluster')
const sitemapper = require('@mastixmc/sitemapper')
const SitemapXMLParser = require('sitemap-xml-parser')
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml'


async function scrapeProduct(locs) {
    const urls = locs.map(value => value[0])
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2, // You can set this to any number you like
        puppeteerOptions: {
            headless: false,
            devtools: false,
            args: [],
        }
    })

    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url, {timeout: 0, waitUntil: 'networkidle2'})
        const screen = await page.screenshot()
        // Store screenshot, do something else
    })

    for (i = 0; i < urls.length; i++) {
        console.log(urls[i])
        await cluster.queue(urls[i])
    }

    await cluster.idle()
    await cluster.close()
}

/******
If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*******/
const options = {
    delay: 3000,
    limit: 50000
}
const sitemapXMLParser = new SitemapXMLParser(url, options)
sitemapXMLParser.fetch().then(async result => {
    var locs = result.map(value => value.loc)
    await scrapeProduct(locs)
})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文