网络抓取即使在提及代理服务器之后也保持阻止网站

发布于 2025-01-22 15:20:40 字数 2213 浏览 0 评论 0原文

我正在取消网站craiglist.com,但是在得到某些请求后,它会不断阻止我的设备。我尝试了带有python'requests'模块的解决方案,但不明白如何每次指定标题。这是代码:

from bs4 import BeautifulSoup
import requests,json

list_of_tuples_with_given_zipcodes = []
id_of_apartments = []

params = {
    'sort': 'dd',
    'filter': 'reviews-dd',
    'res_id': 18439027
}

http_proxy  = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy   = "ftp://10.10.1.10:3128"

proxies = { 
              "http"  : http_proxy, 
              "https" : https_proxy, 
              "ftp"   : ftp_proxy
            }

for i in range(1,30):
    content = requests.get('https://losangeles.craigslist.org/search/apa?s = ' + str(i),params = params)   #https://losangeles.craigslist.org/search/apa?s=120
    # content = requests.get('https://www.zillow.com/homes/for_rent/')
    soup = BeautifulSoup(content.content, 'html.parser')
    my_anchors = list(soup.find_all("a",{"class": "result-image gallery"}))
    for index,each_anchor_tag in enumerate(my_anchors):
        URL_to_look_for_zipcode = soup.find_all("a",{"class": "result-title"})      #taking set so that a page is not visited twice.
    for each_href in URL_to_look_for_zipcode:
        # content_href = requests.get(each_href['href'])   #script id="ld_posting_data" type="application/ld+json">
        content_href = requests.get(each_href['href'])   #script id="ld_posting_data" type="application/ld+json">
        # print(each_href['href'])
        soup_href = BeautifulSoup(content_href.content, 'html.parser')
        my_script_tags = soup_href.find("script",{"id": "ld_posting_data"})
    # for each_tag in my_script_tags:
        if my_script_tags:
            res = json.loads(str(list(my_script_tags)[0]))
            if res and 'address' in list(res.keys()):
                if res['address']['postalCode'] == "90012":    #use the input zipcode entered by the user.
                    list_of_tuples_with_given_zipcodes.append(each_href['href'])

我仍然不确定http_proxy变量的值。我将其指定为给出的内容,但是否应该是映射到Localhost端口号的设备的IP地址?它仍然会不断阻止代码。

请帮忙。

I am scrapping the website craiglist.com but after getting certain requests it keeps blocking my device. I tried out the solution in Proxies with Python 'Requests' module but didn't understand how to specify the headers every time. Here's the code :

from bs4 import BeautifulSoup
import requests,json

list_of_tuples_with_given_zipcodes = []
id_of_apartments = []

params = {
    'sort': 'dd',
    'filter': 'reviews-dd',
    'res_id': 18439027
}

http_proxy  = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy   = "ftp://10.10.1.10:3128"

proxies = { 
              "http"  : http_proxy, 
              "https" : https_proxy, 
              "ftp"   : ftp_proxy
            }

for i in range(1,30):
    content = requests.get('https://losangeles.craigslist.org/search/apa?s = ' + str(i),params = params)   #https://losangeles.craigslist.org/search/apa?s=120
    # content = requests.get('https://www.zillow.com/homes/for_rent/')
    soup = BeautifulSoup(content.content, 'html.parser')
    my_anchors = list(soup.find_all("a",{"class": "result-image gallery"}))
    for index,each_anchor_tag in enumerate(my_anchors):
        URL_to_look_for_zipcode = soup.find_all("a",{"class": "result-title"})      #taking set so that a page is not visited twice.
    for each_href in URL_to_look_for_zipcode:
        # content_href = requests.get(each_href['href'])   #script id="ld_posting_data" type="application/ld+json">
        content_href = requests.get(each_href['href'])   #script id="ld_posting_data" type="application/ld+json">
        # print(each_href['href'])
        soup_href = BeautifulSoup(content_href.content, 'html.parser')
        my_script_tags = soup_href.find("script",{"id": "ld_posting_data"})
    # for each_tag in my_script_tags:
        if my_script_tags:
            res = json.loads(str(list(my_script_tags)[0]))
            if res and 'address' in list(res.keys()):
                if res['address']['postalCode'] == "90012":    #use the input zipcode entered by the user.
                    list_of_tuples_with_given_zipcodes.append(each_href['href'])

I am still not sure about the value of the http_proxy variable. I specified it as what was given but should it be the IP address of my device mapped to the localhost port number? It still keeps blocking the code.

Please help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

野侃 2025-01-29 15:20:40

请求的获取方法可让您指定代理在呼叫上使用

r = requests.get(url,headers = headers,proxies = perxies)

request's GET method lets you specify the proxy to use it on a call

r = requests.get(url, headers=headers, proxies=proxies)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文