使用美丽的汤刮擦并获得错误并“访问此页面”已被拒绝。”
我正在为HTML解析,并在Python中用美丽的汤刮擦Trulia。我是Python的新手,感觉好像我的代码是正确的,但是我一直被拒绝。我认为这是因为我在网站上访问了太多次,这就是为什么我尝试了睡眠功能,但即使那样,我也被拒绝了。我想使用一个for循环来一次刮擦多个页面,我仍然可以一次刮擦一个页面,但是每当我尝试刮擦多次并使用for循环时,我就会被拒绝。
```
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
import time
real_estate_new=pd.DataFrame(columns=['Address', 'Beds', 'Baths', 'Price', 'sqft'])
address=[]
beds=[]
baths=[]
prices=[]
sqft=[]
for i in range(1,6):
time.sleep(5)
website = requests.get('https://www.trulia.com/for_sale/Knoxville,TN/1p_beds/' + str(i) +
'_p/')
#print('https://www.trulia.com/for_sale/Knoxville,TN/1p_beds/' + str(i) + '_p/')
soup = BeautifulSoup(website.content, 'html.parser')
result = soup.find_all('li', {'class' : 'Grid__CellBox-sc-144isrp-0
SearchResultsList__WideCell-b7y9ki-2 jiZmPM'})
result_update = [k for k in result if k.has_attr('data-testid')]
for result in result_update:
try:
address.append(result.find('div', {'data-testid':'property-address'}).get_text())
except:
address.append('n/a')
print(address)
try:
beds.append(result.find('div', {'data-testid':'property-beds'}).get_text())
except:
beds.append('n/a')
try:
baths.append(result.find('div', {'data-testid':'property-baths'}).get_text())
except:
baths.append('n/a')
try:
prices.append(result.find('div', {'data-testid':'property-price'}).get_text())
except:
prices.append('n/a')
try:
sqft.append(result.find('div', {'data-testid':'property-price'}).get_text())
except:
sqft.append('n/a')
for j in range (len(address)):
real_estate_new=real_estate_new.append({'Address':address[j], 'Beds':beds[j],
'Baths':baths[j], 'Price':prices[j], 'sqft':sqft[j]}, ignore_index=True)
print(soup.prettify())
I am working on an Html parse and scraping Trulia with Beautiful soup in python. I am fairly new to python and feel as though my code is correct but I keep getting access denied. I assume this is because I am hitting the website too many times which is why I tried a sleep function, but even then I am getting access denied. I want to use a for loop to scrape multiple pages at once, I am still able to scrape one page at a time but whenever I attempt to scrape multiple and use the for loop I get access denied.
```
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
import time
real_estate_new=pd.DataFrame(columns=['Address', 'Beds', 'Baths', 'Price', 'sqft'])
address=[]
beds=[]
baths=[]
prices=[]
sqft=[]
for i in range(1,6):
time.sleep(5)
website = requests.get('https://www.trulia.com/for_sale/Knoxville,TN/1p_beds/' + str(i) +
'_p/')
#print('https://www.trulia.com/for_sale/Knoxville,TN/1p_beds/' + str(i) + '_p/')
soup = BeautifulSoup(website.content, 'html.parser')
result = soup.find_all('li', {'class' : 'Grid__CellBox-sc-144isrp-0
SearchResultsList__WideCell-b7y9ki-2 jiZmPM'})
result_update = [k for k in result if k.has_attr('data-testid')]
for result in result_update:
try:
address.append(result.find('div', {'data-testid':'property-address'}).get_text())
except:
address.append('n/a')
print(address)
try:
beds.append(result.find('div', {'data-testid':'property-beds'}).get_text())
except:
beds.append('n/a')
try:
baths.append(result.find('div', {'data-testid':'property-baths'}).get_text())
except:
baths.append('n/a')
try:
prices.append(result.find('div', {'data-testid':'property-price'}).get_text())
except:
prices.append('n/a')
try:
sqft.append(result.find('div', {'data-testid':'property-price'}).get_text())
except:
sqft.append('n/a')
for j in range (len(address)):
real_estate_new=real_estate_new.append({'Address':address[j], 'Beds':beds[j],
'Baths':baths[j], 'Price':prices[j], 'sqft':sqft[j]}, ignore_index=True)
print(soup.prettify())
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议使用GraphQl。首先,我们需要查询有效载荷。在其中,我们可以更改页面,城市以及我们需要搜索的所有内容。我将举一个第一页的例子,限制为190。田纳西州诺克斯维尔市。
现在,我们需要设置标题:
发布请求,以向我们提供大量信息。我将显示您在示例中拥有的内容。
输出:
I would suggest using graphql. First we need payload for query. Inside it, we can change pages, cities and everything we need to search. I will give an example of the first page with limits of 190. City of Knoxville, TN.
Now we need set up headers:
POST request to provide us with a huge amount of information. I will display what you had in the example.
Outputs: