难以快速从代理列表中选择功能代理
我已经使用请求模块创建了一个刮板,该模块在其中实现了代理的旋转(从免费代理站点中获取),以从黄页获取内容。
该脚本似乎正常工作,但是要慢得多,因为要花很多时间才能找到工作代理。我试图重复使用相同的工作代理(发现时),直到它死了,为此我必须声明proxies
和proxy_url
as global> global
。
尽管shop_name
和类别
在登录页面中可用,但我从内页面上刮了两者,以便脚本可以证明它使用相同的工作代理(当它找到一个时, )多次。
这是我尝试使用的脚本:
import random
import requests
from bs4 import BeautifulSoup
base = 'https://www.yellowpages.com{}'
link = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
def get_proxies():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = []
for item in soup.select("table.table tbody tr"):
if not item.select_one("td"):break
ip = item.select_one("td").text
port = item.select_one("td:nth-of-type(2)").text
proxies.append(f"{ip}:{port}")
return [{'https': f'http://{x}'} for x in proxies]
def fetch_resp(link,headers):
global proxies, proxy_url
while True:
print("currently being used:",proxy_url)
try:
res = requests.get(link, headers=headers, proxies=proxy_url, timeout=10)
print("status code",res.status_code)
assert res.status_code == 200
return res
except Exception as e:
proxy_url = proxies.pop(random.randrange(len(proxies)))
def fetch_links(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".v-card > .info a.business-name"):
yield base.format(item.get("href"))
def get_content(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
shop_name = soup.select_one(".sales-info > h1.business-name").get_text(strip=True)
categories = ' '.join([i.text for i in soup.select(".categories > a")])
return shop_name,categories
if __name__ == '__main__':
proxies = get_proxies()
proxy_url = proxies.pop(random.randrange(len(proxies)))
for inner_link in fetch_links(link,headers):
print(get_content(inner_link,headers))
如何快速从代理列表中选择功能代理?
I've created a scraper using requests module implementing rotation of proxies (taken from a free proxy site) within it to fetch content from yellowpages.
The script appears to work correctly but it is terribly slow as it takes a lot of time to find a working proxy. I've tried to reuse the same working proxy (when found) until it is dead and for that I had to declare proxies
and proxy_url
as global
.
Although shop_name
and categories
are available in landing pages, I scraped both of them from inner pages so that the script can demonstrate that it uses the same working proxy (when it finds one) multiple times.
This is the script I'm trying with:
import random
import requests
from bs4 import BeautifulSoup
base = 'https://www.yellowpages.com{}'
link = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
def get_proxies():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = []
for item in soup.select("table.table tbody tr"):
if not item.select_one("td"):break
ip = item.select_one("td").text
port = item.select_one("td:nth-of-type(2)").text
proxies.append(f"{ip}:{port}")
return [{'https': f'http://{x}'} for x in proxies]
def fetch_resp(link,headers):
global proxies, proxy_url
while True:
print("currently being used:",proxy_url)
try:
res = requests.get(link, headers=headers, proxies=proxy_url, timeout=10)
print("status code",res.status_code)
assert res.status_code == 200
return res
except Exception as e:
proxy_url = proxies.pop(random.randrange(len(proxies)))
def fetch_links(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".v-card > .info a.business-name"):
yield base.format(item.get("href"))
def get_content(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
shop_name = soup.select_one(".sales-info > h1.business-name").get_text(strip=True)
categories = ' '.join([i.text for i in soup.select(".categories > a")])
return shop_name,categories
if __name__ == '__main__':
proxies = get_proxies()
proxy_url = proxies.pop(random.randrange(len(proxies)))
for inner_link in fetch_links(link,headers):
print(get_content(inner_link,headers))
How can I quickly select a functional proxy from a list of proxies?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请让我指出,使用免费的代理IP地址可能会非常有问题。这些类型的代理因存在连接问题而臭名昭著,例如与延迟相关的超时。再加上这些站点也可以间歇性,这意味着它们可以随时下降。 这些站点被滥用,因此可以被阻止。
有时,
更新07-11-2022 16:47 GMT
我今天早上尝试了另一种代理验证方法。它比
代理法官
方法稍快。这两种方法的问题是错误处理。在验证代理IP地址并将经过验证的地址传递给您的功能fetch_resp
时,我必须捕获以下所有错误。从页面提取时,有时代理会失败,这会导致延迟。您无能为力防止这些失败。您唯一可以做的就是捕获错误并重新处理请求。
我能够通过将线程添加到功能
get_content
来改善提取时间。提高代码速度的唯一方法是将其重新设计以同时查询每个页面元素。如果不这样做,这是一个时机瓶颈。
这是我用来验证代理地址的代码。
更新07-10-2022 23:53格林尼治标准时间
我对这个问题做了更多研究。我注意到网站 https://www.sslproxies.org 提供了100 https的列表。在那些小于20%的人中,
代理法官
测试。即使在获得20%之后,当传递给您的函数fetch_resp
时,有些人也会失败。它们可能出于多种原因失败,包括ConnectTimeOut,MaxRetryError,ProxyError等。当发生这种情况时,您可以使用相同的链接(URL),标头和新代理重新运行该功能。这些错误的最佳解决方法是使用商业代理服务。在我的最新测试中,我能够获取潜在功能代理的列表,并提取与您搜索相关的所有25页的所有内容。以下是此测试的时间码:
如果我将螺纹与函数
fetch_resp
一起使用,我可以加快此功能。以下是我正在使用的当前代码。我需要改进错误处理,但目前有效。
更新07-06-2022 11:02 GMT
这似乎是您的核心问题:
首先,我所有以前的代码都能够验证代理在给定时间的时刻工作。经过验证后,我可以从您的 Yellow页面中查询和提取数据在 Los Angeles中搜索 pizza 。
使用我以前的方法,我可以在0:00:45.367209秒内查询和提取与您搜索相关的所有24页的数据。
回到你的问题。
网站 https://www.sslproxies.org 提供了100 https Proxies的列表。零保证当前所有100个都在运行。识别工作人员的方法之一是使用代理法官服务。
在我以前的代码中,我不断从100列表中选择一个随机代理,并将此代理传递给代理法官进行验证。一旦验证了代理人正在工作,它将用于查询和提取数据 Yellow Pages 。
上面的方法有效,但我想知道100张通过代理法官服务的嗅探测试中有多少代理。我试图使用基本
对循环
进行检查,这是致命的慢速。我决定condurrent.futures
以加快验证。以下代码大约需要1分钟才能获得HTTPS代理列表,并使用代理法官服务来验证它们。
这是获得在特定时刻起作用的免费代理列表的最快方法。
更新代码07-05-2022 17:07 GMT
我在下面添加了一个代码片段以查询第二页。我这样做是为了看看代理是否保持不变,这也是如此。您仍然需要添加一些错误处理。
在我的测试中,我能够查询与您在0:00:45.367209秒中搜索相关的所有24页。我不认为这种查询和提取速度速度较慢
。我会采用与下面相同的方法,但是我会为此搜索要求新的代理,因为免费代理确实有局限性,例如寿命和绩效退化。
截短的输出
更新代码07-05-2022 14:07 GMT
我重新编写了我的代码发表于07-01-2022,以输出这些数据元素,业务名称,业务类别和商业网站。
更新代码07-01-2022
我指出,使用免费代理错误时。我添加了
requests_retry_session
函数来处理此功能。我没有刷新您的所有代码,但是我确实确保可以查询网站并使用免费代理产生结果。您应该能够将我的代码处理到您的代码中。以前的答案
06-30-2022:
在某些测试中,我找到了一个错误,因此我更新了代码以处理该错误。
06-28-2022:
您可以使用代理法官,该法官用于测试代理服务器的性能和匿名状态。
下面的代码来自我的先前的答案。
我今天注意到Python软件包 http_request_randomizer 需要修改的问题,因为它们当前在HTTP_REQUEST_RANDOMIZER的1.3.2版中不起作用。
您需要在
freeproxyparser.py
中修改第27行:您需要在
sslproxyparser.py
中修改27行27:我发现需要修复另一个错误。这是在
proxy_checking.py
中,我必须添加行如果url!= none:
Please let me point out that using free proxy IP addresses can be highly problematic. These type of proxies are notorious for having connections issues, such as timeouts related to latency. Plus these sites can also be intermittent, which means that they can go down at anytime. And sometimes these sites are being abused, so they can get blocked.
With that being said, below are multiple methods that can be used to accomplish your use case related to scraping content from the Yellow Pages.
UPDATE 07-11-2022 16:47 GMT
I tried a different proxy validation method this morning. It is slightly faster than the
proxy judge
method. The issue with both these methods is error handling. I have to catch all the errors below when validating a proxy IP address and passing a validated address to your functionfetch_resp
.Occasionally a proxy fails when extracting from a page, which causes a delay. There is nothing you can do to prevent these failures. The only thing you can do is catch the error and reprocess the request.
I was able to improve the extraction time by adding threading to function
get_content
.The only way you can increase the speed of your code is to redesign it to query each page element at the same time. If you don't this is a timing bottleneck.
Here is the code that I used to validate the proxy addresses.
UPDATE 07-10-2022 23:53 GMT
I did some more research into this question. I have noted that the website https://www.sslproxies.org provides a list of 100 HTTPS. Out of those less than 20% pass the
proxy judge
test. Even after obtaining this 20% some will fail when being passed to your functionfetch_resp
. They can fail for multiple reasons, which include ConnectTimeout, MaxRetryError, ProxyError, etc. When this happens you can rerun the function with the same link (url), headers and a new proxy. The best workaround for these errors is to use a commercial proxy service.In my latest test I was able to obtain a list of potentially functional proxies and extract all the content for all 25 pages related to your search. Below is the timeDelta for this test:
I can speed this up if I use threading with the function
fetch_resp
.Below is the current code that I'm using. I need to improve the error handling, but it currently works.
UPDATE 07-06-2022 11:02 GMT
This seems to be your core question:
First, all my previous code is able to validate that a proxy is working at a given moment in time. Once validated I'm able to query and extract data from your Yellow Pages search for pizza in Los Angeles.
Using my previous method I'm able to query and extract data for all 24 pages related to your search in 0:00:45.367209 seconds.
Back to your question.
The website https://www.sslproxies.org provides a list of 100 HTTPS proxies. There is zero guarantee that all 100 are currently operational. One of the ways to identify the working ones is using a Proxy Judge service.
In my previous code I continually selected a random proxy from the list of 100 and passed this proxy to a Proxy Judge for validation. Once a proxy is validated to be working it is used to query and extract data Yellow Pages.
The method above works, but I was wondering how many proxies out of the 100 pass the sniff test for the Proxy Judge service. I attempted to check using a basic
for loop
, which was deathly slow. I decided toconcurrent.futures
to speed up the validation.The code below takes about 1 minute to obtain a list of HTTPS proxies and validate them using a Proxy Judge service.
This is the fastest way to obtain a list of free proxies that are functional at a specific moment in time.
UPDATE CODE 07-05-2022 17:07 GMT
I added a snippet of code below to query the second page. I did this to see if the proxy stayed the same, which it did. You still need to add some error handling.
In my testing I was able to query all 24 pages related to your search in 0:00:45.367209 seconds. I don't consider this query and extraction speed slow by any means.
Concerning performing a different search. I would do the same method as below, but I would request a new proxy for this search, because free proxies do have limitations, such as life time and performance degradation.
truncated output
UPDATE CODE 07-05-2022 14:07 GMT
I reworked my code posted on 07-01-2022 to output these data elements, business name, business categories and business website.
UPDATE CODE 07-01-2022
I noted that when using the free proxies errors were being thrown. I added the
requests_retry_session
function to handle this. I didn't rework all your code, but I did make sure that I could query the site and produce results using a free proxy. You should be able to work my code into yours.PREVIOUS ANSWERS
06-30-2022:
During some testing I found a bug, so I updated my code to handle the bug.
06-28-2022:
You could use a proxy judge, which is used for testing the performance and the anonymity status of a proxy server.
The code below is from one of my previous answers.
I noted today that the Python package HTTP_Request_Randomizer has a couple of
Beautiful Soup
path problems that need to be modified, because they currently don't work in version 1.3.2 of HTTP_Request_Randomizer.You need to modified line 27 in
FreeProxyParser.py
to this:You need to modified line 27 in
SslProxyParser.py
to this:I found another bug that needs to be fixed. This one is in the
proxy_checking.py
I had to add the lineif url != None: