AWS lambda -Python Webscraping-无法绕过AWS IP的CloudFare反机器人，但在本地IP中工作

发布于 2025-02-09 14:36:19 字数 1159 浏览 2 评论 0 原文

我构建了一个简单的Python Web刮板，该网络刮板在本地可以按预期工作，但在AWS Lambda上不起作用 - 特别是我想刮擦的网站。我仅测试了代码的刮擦部分，并且可以确认这是Cloudflare的反机器人问题。

我已经梳理了相关的SO和中等文章，并尝试了：

添加适当的标题，
使用不同的库来指定用户代理
（ urllib ， cloudscraper ， selenium ）
按照本文使用虚拟显示（ pyvirtualdisplay xvfb ），如本文：如何绕过Selenium中的Cloudflare Bot保护

urllib 版本的示例代码以说明问题：

import json
import urllib.request

def lambda_handler(event, context):
    url = 'https://disboard.org/servers/tag/python/15'
    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
    req = urllib.request.Request(url, headers = headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()
    return respData

上面代码返回a 403 状态 + recaptcha。

我知道，与住宅IP相比，Antispam更仔细地处理数据中心IP范围 - 对此有任何解决方法吗？

预先感谢您。

原文

I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.

I've combed through relevant SO and medium articles and tried:

adding the appropriate headers
specifying user agent
using different libraries (urllib, cloudscraper, selenium)
using a virtual display (pyvirtualdisplay with xvfb) as according to this post: How to bypass Cloudflare bot protection in selenium

Example code of the urllib version to illustrate the question:

import json
import urllib.request

def lambda_handler(event, context):
    url = 'https://disboard.org/servers/tag/python/15'
    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
    req = urllib.request.Request(url, headers = headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()
    return respData

The above code returns a 403 status + reCAPTCHA.

I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?

Thank you in advance.

分享到QQ

分享到微博