为什么我的 selenium webdrivers 在 lambda 中崩溃/不响应？

发布于 2025-01-10 16:57:20 字数 1683 浏览 4 评论 0原文

        lambda_options = [
            '--autoplay-policy=user-gesture-required',
            '--disable-background-networking',
            '--disable-background-timer-throttling',
            '--disable-backgrounding-occluded-windows',
            '--disable-breakpad',
            '--disable-client-side-phishing-detection',
            '--disable-default-apps',
            '--disable-dev-shm-usage',
            '--disable-extensions',
            '--disable-features=AudioServiceOutOfProcess',
            '--disable-hang-monitor',
            '--disable-notifications',
            '--disable-offer-store-unmasked-wallet-cards',
            '--disable-print-preview',
            '--disable-prompt-on-repost',
            '--disable-speech-api',
            '--disable-sync',
            '--ignore-gpu-blacklist',
            '--ignore-certificate-errors',
            '--mute-audio',
            '--no-default-browser-check',
            '--no-first-run',
            '--no-pings',
            '--no-sandbox',
            '--no-zygote',
            '--password-store=basic',
            '--use-gl=swiftshader',
            '--use-mock-keychain',
            '--single-process',
            '--headless']
        for argument in lambda_options:
            options.add_argument(argument)

        process_ids = list(range(10))
        drivers = {i: webdriver.Chrome(executable_path='path', options=options) for i in process_ids}

这些是我的 chrome 选项以及我如何将它们设置为在单个 lambda 调用中运行 10 个实例。当我在我的电脑上以非无头方式运行它时，爬虫会由于页面加载错误或硒不响应而错过很少的站点，但在 lambda 中我丢失了大量数据。我可以做什么来纠正这个问题？

我主要使用 python selenium 进行抓取，并使用 beautifulsoup 抓取一些页面，并且我访问的网站需要执行一些操作才能获取我想要的数据。

原文

        lambda_options = [
            '--autoplay-policy=user-gesture-required',
            '--disable-background-networking',
            '--disable-background-timer-throttling',
            '--disable-backgrounding-occluded-windows',
            '--disable-breakpad',
            '--disable-client-side-phishing-detection',
            '--disable-default-apps',
            '--disable-dev-shm-usage',
            '--disable-extensions',
            '--disable-features=AudioServiceOutOfProcess',
            '--disable-hang-monitor',
            '--disable-notifications',
            '--disable-offer-store-unmasked-wallet-cards',
            '--disable-print-preview',
            '--disable-prompt-on-repost',
            '--disable-speech-api',
            '--disable-sync',
            '--ignore-gpu-blacklist',
            '--ignore-certificate-errors',
            '--mute-audio',
            '--no-default-browser-check',
            '--no-first-run',
            '--no-pings',
            '--no-sandbox',
            '--no-zygote',
            '--password-store=basic',
            '--use-gl=swiftshader',
            '--use-mock-keychain',
            '--single-process',
            '--headless']
        for argument in lambda_options:
            options.add_argument(argument)

        process_ids = list(range(10))
        drivers = {i: webdriver.Chrome(executable_path='path', options=options) for i in process_ids}

So these are my chrome options and how i set them up to have 10 instances running in a single lambda invocation. When I run it non headless on my pc, the crawler misses very few sites due to errors with page loading or selenium not responding, but in lambdas I am missing a ton of data. What can I do to rectify this?

I am scraping mostly with python selenium and some pages with beautifulsoup, and the sites I visit require some actions to be done before I can grab the data I want.

分享到QQ

分享到微博