为什么我的 selenium webdrivers 在 lambda 中崩溃/不响应?
lambda_options = [
'--autoplay-policy=user-gesture-required',
'--disable-background-networking',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-breakpad',
'--disable-client-side-phishing-detection',
'--disable-default-apps',
'--disable-dev-shm-usage',
'--disable-extensions',
'--disable-features=AudioServiceOutOfProcess',
'--disable-hang-monitor',
'--disable-notifications',
'--disable-offer-store-unmasked-wallet-cards',
'--disable-print-preview',
'--disable-prompt-on-repost',
'--disable-speech-api',
'--disable-sync',
'--ignore-gpu-blacklist',
'--ignore-certificate-errors',
'--mute-audio',
'--no-default-browser-check',
'--no-first-run',
'--no-pings',
'--no-sandbox',
'--no-zygote',
'--password-store=basic',
'--use-gl=swiftshader',
'--use-mock-keychain',
'--single-process',
'--headless']
for argument in lambda_options:
options.add_argument(argument)
process_ids = list(range(10))
drivers = {i: webdriver.Chrome(executable_path='path', options=options) for i in process_ids}
这些是我的 chrome 选项以及我如何将它们设置为在单个 lambda 调用中运行 10 个实例。当我在我的电脑上以非无头方式运行它时,爬虫会由于页面加载错误或硒不响应而错过很少的站点,但在 lambda 中我丢失了大量数据。我可以做什么来纠正这个问题?
我主要使用 python selenium 进行抓取,并使用 beautifulsoup 抓取一些页面,并且我访问的网站需要执行一些操作才能获取我想要的数据。
lambda_options = [
'--autoplay-policy=user-gesture-required',
'--disable-background-networking',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-breakpad',
'--disable-client-side-phishing-detection',
'--disable-default-apps',
'--disable-dev-shm-usage',
'--disable-extensions',
'--disable-features=AudioServiceOutOfProcess',
'--disable-hang-monitor',
'--disable-notifications',
'--disable-offer-store-unmasked-wallet-cards',
'--disable-print-preview',
'--disable-prompt-on-repost',
'--disable-speech-api',
'--disable-sync',
'--ignore-gpu-blacklist',
'--ignore-certificate-errors',
'--mute-audio',
'--no-default-browser-check',
'--no-first-run',
'--no-pings',
'--no-sandbox',
'--no-zygote',
'--password-store=basic',
'--use-gl=swiftshader',
'--use-mock-keychain',
'--single-process',
'--headless']
for argument in lambda_options:
options.add_argument(argument)
process_ids = list(range(10))
drivers = {i: webdriver.Chrome(executable_path='path', options=options) for i in process_ids}
So these are my chrome options and how i set them up to have 10 instances running in a single lambda invocation. When I run it non headless on my pc, the crawler misses very few sites due to errors with page loading or selenium not responding, but in lambdas I am missing a ton of data. What can I do to rectify this?
I am scraping mostly with python selenium and some pages with beautifulsoup, and the sites I visit require some actions to be done before I can grab the data I want.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论