网络抓取即使在提及代理服务器之后也保持阻止网站
我正在取消网站craiglist.com,但是在得到某些请求后,它会不断阻止我的设备。我尝试了带有python'requests'模块的解决方案,但不明白如何每次指定标题
。这是代码:
from bs4 import BeautifulSoup
import requests,json
list_of_tuples_with_given_zipcodes = []
id_of_apartments = []
params = {
'sort': 'dd',
'filter': 'reviews-dd',
'res_id': 18439027
}
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxies = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
for i in range(1,30):
content = requests.get('https://losangeles.craigslist.org/search/apa?s = ' + str(i),params = params) #https://losangeles.craigslist.org/search/apa?s=120
# content = requests.get('https://www.zillow.com/homes/for_rent/')
soup = BeautifulSoup(content.content, 'html.parser')
my_anchors = list(soup.find_all("a",{"class": "result-image gallery"}))
for index,each_anchor_tag in enumerate(my_anchors):
URL_to_look_for_zipcode = soup.find_all("a",{"class": "result-title"}) #taking set so that a page is not visited twice.
for each_href in URL_to_look_for_zipcode:
# content_href = requests.get(each_href['href']) #script id="ld_posting_data" type="application/ld+json">
content_href = requests.get(each_href['href']) #script id="ld_posting_data" type="application/ld+json">
# print(each_href['href'])
soup_href = BeautifulSoup(content_href.content, 'html.parser')
my_script_tags = soup_href.find("script",{"id": "ld_posting_data"})
# for each_tag in my_script_tags:
if my_script_tags:
res = json.loads(str(list(my_script_tags)[0]))
if res and 'address' in list(res.keys()):
if res['address']['postalCode'] == "90012": #use the input zipcode entered by the user.
list_of_tuples_with_given_zipcodes.append(each_href['href'])
我仍然不确定http_proxy
变量的值。我将其指定为给出的内容,但是否应该是映射到Localhost端口号的设备的IP地址?它仍然会不断阻止代码。
请帮忙。
I am scrapping the website craiglist.com but after getting certain requests it keeps blocking my device. I tried out the solution in Proxies with Python 'Requests' module but didn't understand how to specify the headers
every time. Here's the code :
from bs4 import BeautifulSoup
import requests,json
list_of_tuples_with_given_zipcodes = []
id_of_apartments = []
params = {
'sort': 'dd',
'filter': 'reviews-dd',
'res_id': 18439027
}
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxies = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
for i in range(1,30):
content = requests.get('https://losangeles.craigslist.org/search/apa?s = ' + str(i),params = params) #https://losangeles.craigslist.org/search/apa?s=120
# content = requests.get('https://www.zillow.com/homes/for_rent/')
soup = BeautifulSoup(content.content, 'html.parser')
my_anchors = list(soup.find_all("a",{"class": "result-image gallery"}))
for index,each_anchor_tag in enumerate(my_anchors):
URL_to_look_for_zipcode = soup.find_all("a",{"class": "result-title"}) #taking set so that a page is not visited twice.
for each_href in URL_to_look_for_zipcode:
# content_href = requests.get(each_href['href']) #script id="ld_posting_data" type="application/ld+json">
content_href = requests.get(each_href['href']) #script id="ld_posting_data" type="application/ld+json">
# print(each_href['href'])
soup_href = BeautifulSoup(content_href.content, 'html.parser')
my_script_tags = soup_href.find("script",{"id": "ld_posting_data"})
# for each_tag in my_script_tags:
if my_script_tags:
res = json.loads(str(list(my_script_tags)[0]))
if res and 'address' in list(res.keys()):
if res['address']['postalCode'] == "90012": #use the input zipcode entered by the user.
list_of_tuples_with_given_zipcodes.append(each_href['href'])
I am still not sure about the value of the http_proxy
variable. I specified it as what was given but should it be the IP address of my device mapped to the localhost port number? It still keeps blocking the code.
Please help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请求的获取方法可让您指定代理在呼叫上使用
r = requests.get(url,headers = headers,proxies = perxies)
request's GET method lets you specify the proxy to use it on a call
r = requests.get(url, headers=headers, proxies=proxies)