为什么Pyppeteer需要这么长的时间才能在AWS lambda上加载一个网页

发布于 2025-01-23 05:52:06 字数 3040 浏览 2 评论 0 原文

我目前正在尝试爬网 mvn存储库在AWS lambda上使用Puppeteer。但是,我的测试功能将运行15分钟,然后继续失败(请参阅下文)。似乎打开了浏览器,但没有爬行。

这是我当前的代码:

import json
import asyncio
from pyppeteer import launch
import pyppeteer
import zipfile
import boto3
import time
# import pandas as pd
import os
import logging
import subprocess
from pyppeteer.launcher import Launcher

logger = logging.getLogger()
logger.setLevel(logging.INFO)

pyppeteer.DEBUG = True

async def main(name, url):
    browser = await launch(headless=True, args=["--no-sandbox"], executablePath="/opt/python/headless-chromium")
    page = await browser.newPage()
    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36')
    await page.goto(url)


def lambda_handler(event, context):
    asyncio.get_event_loop().run_until_complete(main('lol','https://mvnrepository.com/artifact/com.adobe.xmp/xmpcore'))
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

为此的层是:

以下是函数时的输出,

Test Event Name
dd

Response
{
  "errorMessage": "2022-04-22T06:28:32.470Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Task timed out after 900.10 seconds"
}

Function Logs
START RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd Version: $LATEST
[INFO]  2022-04-22T06:13:32.424Z    e9be66b9-1fd0-4df9-a0b4-9815067169cd    Found credentials in environment variables.
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:51625/devtools/browser/1651a2a3-9b53-4f0a-883f-4850a6d693ed
END RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd
REPORT RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd  Duration: 900104.69 ms  Billed Duration: 900000 ms  Memory Size: 10240 MB   Max Memory Used: 364 MB Init Duration: 490.52 ms    
2022-04-22T06:28:32.470Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Task timed out after 900.10 seconds

Request ID
e9be66b9-1fd0-4df9-a0b4-9815067169cd

除了我之前尝试过的方法外,我还遵循以下教程,但无济于事:

PS我可以在我的Localhost上运行上述脚本

I am currently trying to crawl MVN Repository using puppeteer on AWS Lambda. However, my test function would run for 15 minutes and proceed to fail after that (See below). It seems like the browser is opened but it doesn't crawl.

Here is my current code:

import json
import asyncio
from pyppeteer import launch
import pyppeteer
import zipfile
import boto3
import time
# import pandas as pd
import os
import logging
import subprocess
from pyppeteer.launcher import Launcher

logger = logging.getLogger()
logger.setLevel(logging.INFO)

pyppeteer.DEBUG = True

async def main(name, url):
    browser = await launch(headless=True, args=["--no-sandbox"], executablePath="/opt/python/headless-chromium")
    page = await browser.newPage()
    await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36')
    await page.goto(url)


def lambda_handler(event, context):
    asyncio.get_event_loop().run_until_complete(main('lol','https://mvnrepository.com/artifact/com.adobe.xmp/xmpcore'))
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

The layers for this are:

The following is the output after the function has timed out:

Test Event Name
dd

Response
{
  "errorMessage": "2022-04-22T06:28:32.470Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Task timed out after 900.10 seconds"
}

Function Logs
START RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd Version: $LATEST
[INFO]  2022-04-22T06:13:32.424Z    e9be66b9-1fd0-4df9-a0b4-9815067169cd    Found credentials in environment variables.
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:51625/devtools/browser/1651a2a3-9b53-4f0a-883f-4850a6d693ed
END RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd
REPORT RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd  Duration: 900104.69 ms  Billed Duration: 900000 ms  Memory Size: 10240 MB   Max Memory Used: 364 MB Init Duration: 490.52 ms    
2022-04-22T06:28:32.470Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Task timed out after 900.10 seconds

Request ID
e9be66b9-1fd0-4df9-a0b4-9815067169cd

Apart from the method I tried earlier, I also followed the following tutorials but to no avail:

P.S. I am able to run the above script with no issues on my localhost

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

乖乖兔^ω^ 2025-01-30 05:52:06

我构建了类似的配置,但使用Pyppeteer 1.0.2。当我尝试从您提到的URL(mvnrepository)生成PDF文件时,我得到了一个丑陋的验证码问题:。您是否尝试过爬行其他网站?这可能是问题。

如果您找到了解决方法,请告诉我。

I built a similar configuration but using pyppeteer 1.0.2. When I tried to generate a PDF file from the URL you mentioned (mvnrepository), I got an ugly captcha issue: screen. Have you tried crawling other websites? This could be the problem.

Please let me know if you found a workaround.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文