难以快速从代理列表中选择功能代理

发布于 2025-02-10 08:49:20 字数 2399 浏览 1 评论 0原文

我已经使用请求模块创建了一个刮板,该模块在其中实现了代理的旋转(从免费代理站点中获取),以从黄页获取内容。

该脚本似乎正常工作,但是要慢得多,因为要花很多时间才能找到工作代理。我试图重复使用相同的工作代理(发现时),直到它死了,为此我必须声明proxiesproxy_url as global> global

尽管shop_name类别在登录页面中可用,但我从内页面上刮了两者,以便脚本可以证明它使用相同的工作代理(当它找到一个时, )多次。

这是我尝试使用的脚本:

import random
import requests
from bs4 import BeautifulSoup

base = 'https://www.yellowpages.com{}'
link = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}

def get_proxies():   
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text,"lxml")
    proxies = []
    for item in soup.select("table.table tbody tr"):
        if not item.select_one("td"):break
        ip = item.select_one("td").text
        port = item.select_one("td:nth-of-type(2)").text
        proxies.append(f"{ip}:{port}")

    return [{'https': f'http://{x}'} for x in proxies]


def fetch_resp(link,headers):
    global proxies, proxy_url

    while True:
        print("currently being used:",proxy_url)
        
        try:
            res = requests.get(link, headers=headers, proxies=proxy_url, timeout=10)
            print("status code",res.status_code)
            assert res.status_code == 200
            return res
        except Exception as e:
            proxy_url = proxies.pop(random.randrange(len(proxies)))


def fetch_links(link,headers):
    res = fetch_resp(link,headers)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".v-card > .info a.business-name"):
        yield base.format(item.get("href"))


def get_content(link,headers):
    res = fetch_resp(link,headers)
    soup = BeautifulSoup(res.text,"lxml")
    shop_name = soup.select_one(".sales-info > h1.business-name").get_text(strip=True)
    categories = ' '.join([i.text for i in soup.select(".categories > a")])
    return shop_name,categories


if __name__ == '__main__':
    proxies = get_proxies()
    proxy_url = proxies.pop(random.randrange(len(proxies)))
    for inner_link in fetch_links(link,headers):
        print(get_content(inner_link,headers))

如何快速从代理列表中选择功能代理?

I've created a scraper using requests module implementing rotation of proxies (taken from a free proxy site) within it to fetch content from yellowpages.

The script appears to work correctly but it is terribly slow as it takes a lot of time to find a working proxy. I've tried to reuse the same working proxy (when found) until it is dead and for that I had to declare proxies and proxy_url as global.

Although shop_name and categories are available in landing pages, I scraped both of them from inner pages so that the script can demonstrate that it uses the same working proxy (when it finds one) multiple times.

This is the script I'm trying with:

import random
import requests
from bs4 import BeautifulSoup

base = 'https://www.yellowpages.com{}'
link = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}

def get_proxies():   
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text,"lxml")
    proxies = []
    for item in soup.select("table.table tbody tr"):
        if not item.select_one("td"):break
        ip = item.select_one("td").text
        port = item.select_one("td:nth-of-type(2)").text
        proxies.append(f"{ip}:{port}")

    return [{'https': f'http://{x}'} for x in proxies]


def fetch_resp(link,headers):
    global proxies, proxy_url

    while True:
        print("currently being used:",proxy_url)
        
        try:
            res = requests.get(link, headers=headers, proxies=proxy_url, timeout=10)
            print("status code",res.status_code)
            assert res.status_code == 200
            return res
        except Exception as e:
            proxy_url = proxies.pop(random.randrange(len(proxies)))


def fetch_links(link,headers):
    res = fetch_resp(link,headers)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".v-card > .info a.business-name"):
        yield base.format(item.get("href"))


def get_content(link,headers):
    res = fetch_resp(link,headers)
    soup = BeautifulSoup(res.text,"lxml")
    shop_name = soup.select_one(".sales-info > h1.business-name").get_text(strip=True)
    categories = ' '.join([i.text for i in soup.select(".categories > a")])
    return shop_name,categories


if __name__ == '__main__':
    proxies = get_proxies()
    proxy_url = proxies.pop(random.randrange(len(proxies)))
    for inner_link in fetch_links(link,headers):
        print(get_content(inner_link,headers))

How can I quickly select a functional proxy from a list of proxies?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鲜血染红嫁衣 2025-02-17 08:49:21

请让我指出,使用免费的代理IP地址可能会非常有问题。这些类型的代理因存在连接问题而臭名昭著,例如与延迟相关的超时。再加上这些站点也可以间歇性,这意味着它们可以随时下降。 这些站点被滥用,因此可以被阻止。

有时,

更新07-11-2022 16:47 GMT

我今天早上尝试了另一种代理验证方法。它比代理法官方法稍快。这两种方法的问题是错误处理。在验证代理IP地址并将经过验证的地址传递给您的功能fetch_resp时,我必须捕获以下所有错误。

ConnectionResetError
requests.exceptions.ConnectTimeout
requests.exceptions.ProxyError
requests.exceptions.ConnectionError
requests.exceptions.HTTPError
requests.exceptions.Timeout 
requests.exceptions.TooManyRedirects
urllib3.exceptions.MaxRetryError
urllib3.exceptions.ProxySchemeUnknown
urllib3.exceptions.ProtocolError

从页面提取时,有时代理会失败,这会导致延迟。您无能为力防止这些失败。您唯一可以做的就是捕获错误并重新处理请求。

我能够通过将线程添加到功能get_content来改善提取时间。

Content Extraction Runtime: 0:00:03.475362
Total Runtime: 0:01:16.617862

提高代码速度的唯一方法是将其重新设计以同时查询每个页面元素。如果不这样做,这是一个时机瓶颈。

这是我用来验证代理地址的代码。

def check_proxy(proxy):
    try:
        session = requests.Session()
        session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
        session.max_redirects = 300
        proxy = proxy.split('\n', 1)[0]
        # print('Checking ' + proxy)
        req = session.get("http://google.com", proxies={'http':'http://' + proxy}, timeout=30, allow_redirects=True)
        if req.status_code == 200:
            return proxy

    except requests.exceptions.ConnectTimeout as e:
        return None
    except requests.exceptions.ConnectionError as e:
        return None
    except ConnectionResetError as e:
        # print('Error,ConnectionReset!')
        return None
    except requests.exceptions.HTTPError as e:
        return None
    except requests.exceptions.Timeout as e:
        return None
    except ProxySchemeUnknown as e:
        return None
    except ProtocolError as e:
        return None
    except requests.exceptions.ChunkedEncodingError as e:
        return None
    except requests.exceptions.TooManyRedirects as e:
        return None

更新07-10-2022 23:53格林尼治标准时间

我对这个问题做了更多研究。我注意到网站 https://www.sslproxies.org 提供了100 https的列表。在那些小于20%的人中,代理法官测试。即使在获得20%之后,当传递给您的函数fetch_resp时,有些人也会失败。它们可能出于多种原因失败,包括ConnectTimeOut,MaxRetryError,ProxyError等。当发生这种情况时,您可以使用相同的链接(URL),标头和新代理重新运行该功能。这些错误的最佳解决方法是使用商业代理服务。

在我的最新测试中,我能够获取潜在功能代理的列表,并提取与您搜索相关的所有25页的所有内容。以下是此测试的时间码:

Content Extraction Runtime: 0:00:34.176803
Total Runtime: 0:01:22.429338

如果我将螺纹与函数fetch_resp一起使用,我可以加快此功能。

以下是我正在使用的当前代码。我需要改进错误处理,但目前有效。

import time
import random
import requests
from datetime import timedelta
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import MaxRetryError, ProxySchemeUnknown
from concurrent.futures import ThreadPoolExecutor, as_completed

proxies_addresses = []
current_proxy = ''


def requests_retry_session(retries=5,
                           backoff_factor=0.5,
                           status_force_list=(500, 502, 503, 504),
                           session=None,
                           ):
    session = session or requests.Session()

    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_force_list,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def ssl_proxy_addresses():
    global proxies_addresses
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text, "lxml")
    proxies = []
    table = soup.find('tbody')
    table_rows = table.find_all('tr')
    for row in table_rows:
        ip_address = row.find_all('td')[0]
        port_number = row.find_all('td')[1]
        proxies.append(f'{ip_address.text}:{port_number.text}')
    proxies_addresses = proxies
    return proxies


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    if proxy_status is True:
        return current_proxy_address
    else:
        return None


def get_proxy_address():
    global proxies_addresses
    proxy_addresses = ssl_proxy_addresses()
    processes = []
    with ThreadPoolExecutor(max_workers=40) as executor:
        for proxy_address in proxy_addresses:
            processes.append(executor.submit(proxy_verification, proxy_address))

    proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
    proxies_addresses = proxies
    return proxies_addresses


def fetch_resp(link, http_headers, proxy_url):
    try:
        print(F'Current Proxy: {proxy_url}')

        response = requests_retry_session().get(link,
                                                headers=http_headers,
                                                allow_redirects=True,
                                                verify=True,
                                                proxies=proxy_url,
                                                timeout=(30, 45)
                                                )

        print("status code", response.status_code)
        if response.status_code == 200:
            return response
        else:
            current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
            fetch_resp(link, http_headers, current_proxy)

    except requests.exceptions.ConnectTimeout as e:
        print('Error,Timeout!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.ProxyError as e:
        print('ProxyError!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.ConnectionError as e:
        print('Connection Error')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.HTTPError as e:
        print('HTTP ERROR!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.Timeout as e:
        print('Error! Connection Timeout!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except ProxySchemeUnknown as e:
        print('ERROR unknown Proxy Scheme!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except MaxRetryError as e:
        print('MaxRetryError')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.TooManyRedirects as e:
        print('ERROR! Too many redirects!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass


def get_content(http_headers, proxy_url):
    start_time = time.time()
    results = []
    pages = int(25)
    for page_number in range(1, pages):
        print(page_number)
        next_url = f"https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los%20Angeles%2C%20CA" \
                   f"&page={page_number}"
        res = fetch_resp(next_url, http_headers, proxy_url)
        soup = BeautifulSoup(res.text, "lxml")
        info_sections = soup.find_all('li', {'class': 'business-card'})
        for info_section in info_sections:
            shop_name = info_section.find('h2', {'class': 'title business-name'})
            categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
            results.append({shop_name.text, categories})
    end_time = time.time() - start_time
    print(f'Content Extraction Runtime: {timedelta(seconds=end_time)}')
    return results


start_time = time.time()
get_proxy_address()

if len(proxies_addresses) != 0:
    print(proxies_addresses)
    print('\n')
    current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
    print(current_proxy)
    print('\n')

    base_url = 'https://www.yellowpages.com{}'
    current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
    }

    PROXIES = {
        'https': f"http://{current_proxy}"
    }

    results = get_content(headers, PROXIES)

end_time = time.time() - start_time
print(f'Total Runtime: {timedelta(seconds=end_time)}')

更新07-06-2022 11:02 GMT

这似乎是您的核心问题:

如何快速从代理列表中选择功能代理?

首先,我所有以前的代码都能够验证代理在给定时间的时刻工作。经过验证后,我可以从您的 Yellow页面中查询和提取数据 Los Angeles中搜索 pizza

使用我以前的方法,我可以在0:00:45.367209秒内查询和提取与您搜索相关的所有24页的数据。

回到你的问题。

网站 https://www.sslproxies.org 提供了100 https Proxies的列表。零保证当前所有100个都在运行。识别工作人员的方法之一是使用代理法官服务。

在我以前的代码中,我不断从100列表中选择一个随机代理,并将此代理传递给代理法官进行验证。一旦验证了代理人正在工作,它将用于查询和提取数据 Yellow Pages

上面的方法有效,但我想知道100张通过代理法官服务的嗅探测试中有多少代理。我试图使用基本对循环进行检查,这是致命的慢速。我决定condurrent.futures以加快验证。

以下代码大约需要1分钟才能获得HTTPS代理列表,并使用代理法官服务来验证它们。

这是获得在特定时刻起作用的免费代理列表的最快方法。

import requests
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from concurrent.futures import ThreadPoolExecutor, as_completed

def ssl_proxy_addresses():
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text, "lxml")
    proxies = []
    table = soup.find('tbody')
    table_rows = table.find_all('tr')
    for row in table_rows:
        ip_address = row.find_all('td')[0]
        port_number = row.find_all('td')[1]
        proxies.append(f'{ip_address.text}:{port_number.text}')
    return proxies


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    if proxy_status is True:
        return current_proxy_address
    else:
        return None


def get_proxy_address():
    proxy_addresses = ssl_proxy_addresses()
    processes = []
    with ThreadPoolExecutor(max_workers=20) as executor:
        for proxy_address in proxy_addresses:
            processes.append(executor.submit(proxy_verification, proxy_address))

    proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
    print(len(proxies))
    13

    print(proxies)
    ['34.228.74.208:8080', '198.41.67.18:8080', '139.9.64.238:443', '216.238.72.163:59394', '64.189.24.250:3129', '62.193.108.133:1976', '210.212.227.68:3128', '47.241.165.133:443', '20.26.4.251:3128', '185.76.9.123:3128', '129.41.171.244:8000', '12.231.44.251:3128', '5.161.105.105:80']

更新代码07-05-2022 17:07 GMT

我在下面添加了一个代码片段以查询第二页。我这样做是为了看看代理是否保持不变,这也是如此。您仍然需要添加一些错误处理。

在我的测试中,我能够查询与您在0:00:45.367209秒中搜索相关的所有24页。我不认为这种查询和提取速度速度较慢

。我会采用与下面相同的方法,但是我会为此搜索要求新的代理,因为免费代理确实有局限性,例如寿命和绩效退化。

import random
import logging
import requests
import traceback
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import ProxySchemeUnknown
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy

current_proxy = ''


def requests_retry_session(retries=5,
                           backoff_factor=0.5,
                           status_force_list=(500, 502, 503, 504),
                           session=None,
                           ):
    session = session or requests.Session()

    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_force_list,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def random_ssl_proxy_address():
    try:
        # Obtain a list of HTTPS proxies
        # Suppress the console debugging output by setting the log level
        req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)

        # Obtain a random single proxy from the list of proxy addresses
        random_proxy = random.sample(req_proxy.get_proxy_list(), 1)

        return random_proxy[0].get_address()
    except AttributeError as e:
        pass


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    return proxy_status


def get_proxy_address():
    global current_proxy
    random_proxy_address = random_ssl_proxy_address()
    current_proxy = random_proxy_address
    proxy_status = proxy_verification(random_proxy_address)
    if proxy_status is True:
        return
    else:
        print('Looking for a valid proxy address.')

        # this sleep timer is helping with some timeout issues
        # that were happening when querying
        sleep(randint(5, 10))

        get_proxy_address()


def fetch_resp(link, http_headers, proxy_url):
    try:
        response = requests_retry_session().get(link,
                                                headers=http_headers,
                                                allow_redirects=True,
                                                verify=True,
                                                proxies=proxy_url,
                                                timeout=(30, 45)
                                                )
        print(F'Current Proxy: {proxy_url}')
        print("status code", response.status_code)
        return response
    except requests.exceptions.ConnectTimeout as e:
        print('Error,Timeout!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.ConnectionError as e:
        print('Connection Error')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.HTTPError as e:
        print('HTTP ERROR!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.Timeout as e:
        print('Error! Connection Timeout!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except ProxySchemeUnknown as e:
        print('ERROR unknown Proxy Scheme!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.TooManyRedirects as e:
        print('ERROR! Too many redirects!')
        print(''.join(traceback.format_tb(e.__traceback__)))
       


def get_next_page(raw_soup, http_headers, proxy_urls):
    next_page_element = raw_soup.find('a', {'class': 'paginator-next arrow-next'})
    next_url = f"https://www.yellowpages.com{next_page_element['href']}"
    sub_response = fetch_resp(next_url, http_headers, proxy_urls)
    new_soup = BeautifulSoup(sub_response.text, "lxml")
    return new_soup


def get_content(link, http_headers, proxy_urls):
    res = fetch_resp(link, http_headers, proxy_urls)
    soup = BeautifulSoup(res.text, "lxml")
    info_sections = soup.find_all('li', {'class': 'business-card'})
    for info_section in info_sections:
        shop_name = info_section.find('h2', {'class': 'title business-name'})
         print(shop_name.text)
            categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
            print(categories)
            business_website = info_section.find('a', {'class': 'website listing-cta action'})
            if business_website is not None:
                print(business_website['href'])
            elif business_website is None:
                print('no website')

    # get page 2
    if soup.find('a', {'class': 'paginator-next arrow-next'}) is not None:
        soup_next_page = get_next_page(soup, http_headers, proxy_urls)
        info_sections = soup_next_page.find_all('li', {'class': 'business-card'})
        for info_section in info_sections:
            shop_name = info_section.find('h2', {'class': 'title business-name'})
             print(shop_name.text)
            categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
            print(categories)
            business_website = info_section.find('a', {'class': 'website listing-cta action'})
            if business_website is not None:
                print(business_website['href'])
            elif business_website is None:
                print('no website')


get_proxy_address()
if len(current_proxy) != 0:
    print(current_proxy)

    base_url = 'https://www.yellowpages.com{}'
    current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
    }

    PROXIES = {
        'https': f"http://{current_proxy}"
    }

    get_content(current_url, headers, PROXIES)

截短的输出

Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
1.Casa Bianca Pizza Pie
2.Palermo Italian Restaurant
... truncated


Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
31.Johnnie's New York Pizzeria
32.Amalfi Restaurant and Bar
... truncated

更新代码07-05-2022 14:07 GMT

我重新编写了我的代码发表于07-01-2022,以输出这些数据元素,业务名称,业务类别和商业网站。

1.Casa Bianca Pizza Pie
Pizza, Italian Restaurants, Restaurants
http://www.casabiancapizza.com

2.Palermo Italian Restaurant
Pizza, Restaurants, Italian Restaurants
no website

... truncated

更新代码07-01-2022

我指出,使用免费代理错误时。我添加了requests_retry_session函数来处理此功能。我没有刷新您的所有代码,但是我确实确保可以查询网站并使用免费代理产生结果。您应该能够将我的代码处理到您的代码中。

import random
import logging
import requests
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy

current_proxy = ''

def requests_retry_session(retries=5,
                            backoff_factor=0.5,
                            status_force_list=(500, 502, 504),
                            session=None,
                            ):
    session = session or requests.Session()

    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_force_list,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def random_ssl_proxy_address():
    try:
        # Obtain a list of HTTPS proxies
        # Suppress the console debugging output by setting the log level
        req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)

        # Obtain a random single proxy from the list of proxy addresses
        random_proxy = random.sample(req_proxy.get_proxy_list(), 1)

        return random_proxy[0].get_address()
    except AttributeError as e:
        pass


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    return proxy_status


def get_proxy_address():
    global current_proxy
    random_proxy_address = random_ssl_proxy_address()
    current_proxy = random_proxy_address
    proxy_status = proxy_verification(random_proxy_address)
    if proxy_status is True:
        return
    else:
        print('Looking for a valid proxy address.')

        # this sleep timer is helping with some timeout issues
        # that were happening when querying
        sleep(randint(5, 10))

        get_proxy_address()


def fetch_resp(link, http_headers, proxy_url):

    response = requests_retry_session().get(link,
                                            headers=http_headers,
                                            allow_redirects=True,
                                            verify=True,
                                            proxies=proxy_url,
                                            timeout=(30, 45)
                                                  )
    print("status code", response.status_code)
    return response


def get_content(link, headers, proxy_urls):
    res = fetch_resp(link, headers, proxy_urls)
    soup = BeautifulSoup(res.text, "lxml")
    info_sections = soup.find_all('li', {'class': 'business-card'})
    for info_section in info_sections:
        shop_name = info_section.find('h2', {'class': 'title business-name'})
        print(shop_name.text)
        categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
        print(categories)
        business_website = info_section.find('a', {'class': 'website listing-cta action'})
        if business_website is not None:
            print(business_website['href'])
        elif business_website is None:
            print('no website')

get_proxy_address()
if len(current_proxy) != 0:
    print(current_proxy)

    base_url = 'https://www.yellowpages.com{}'
    current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
    }

    PROXIES = {
        'https': f"http://{current_proxy}"
    }

    get_content(current_url, headers, PROXIES)

以前的答案

06-30-2022:

在某些测试中,我找到了一个错误,因此我更新了代码以处理该错误。

06-28-2022:

您可以使用代理法官,该法官用于测试代理服务器的性能和匿名状态。

下面的代码来自我的先前的答案。

import random
import logging
from time import sleep
from random import randint
from proxy_checking import ProxyChecker
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy


current_proxy = ''


def random_ssl_proxy_address():
    try:
        # Obtain a list of HTTPS proxies
        # Suppress the console debugging output by setting the log level
        req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)

        # Obtain a random single proxy from the list of proxy addresses
        random_proxy = random.sample(req_proxy.get_proxy_list(), 1)

        return random_proxy[0].get_address()
    except AttributeError as e:
        pass


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    return proxy_status


def get_proxy_address():
    global current_proxy
    random_proxy_address = random_ssl_proxy_address()
    current_proxy = random_proxy_address
    proxy_status = proxy_verification(random_proxy_address)
    if proxy_status is True:
        return
    else:
        print('Looking for a valid proxy address.')

        # this sleep timer is helping with some timeout issues
        # that were happening when querying
        sleep(randint(5, 10))

        get_proxy_address()


get_proxy_address()
if len(current_proxy) != 0:
    print(f'Valid proxy address: {current_proxy}')
    # output
    Valid proxy address: 157.100.12.138:999

我今天注意到Python软件包 http_request_randomizer 需要修改的问题,因为它们当前在HTTP_REQUEST_RANDOMIZER的1.3.2版中不起作用。

您需要在freeproxyparser.py中修改第27行:

table = soup.find("table", attrs={"class": "table table-striped table-bordered"})

您需要在sslproxyparser.py中修改27行27:

table = soup.find("table", attrs={"class": "table table-striped table-bordered"})

我发现需要修复另一个错误。这是在proxy_checking.py中,我必须添加行如果url!= none:

    def get_info(self, url=None, proxy=None):
        info = {}
        proxy_type = []
        judges = ['http://proxyjudge.us/azenv.php', 'http://azenv.net/', 'http://httpheader.net/azenv.php', 'http://mojeip.net.pl/asdfa/azenv.php']
        if url != None:
            try:
                response = requests.get(url, headers=headers, timeout=5)
                return response
            except:
                pass
        elif proxy != None:

Please let me point out that using free proxy IP addresses can be highly problematic. These type of proxies are notorious for having connections issues, such as timeouts related to latency. Plus these sites can also be intermittent, which means that they can go down at anytime. And sometimes these sites are being abused, so they can get blocked.

With that being said, below are multiple methods that can be used to accomplish your use case related to scraping content from the Yellow Pages.

UPDATE 07-11-2022 16:47 GMT

I tried a different proxy validation method this morning. It is slightly faster than the proxy judge method. The issue with both these methods is error handling. I have to catch all the errors below when validating a proxy IP address and passing a validated address to your function fetch_resp.

ConnectionResetError
requests.exceptions.ConnectTimeout
requests.exceptions.ProxyError
requests.exceptions.ConnectionError
requests.exceptions.HTTPError
requests.exceptions.Timeout 
requests.exceptions.TooManyRedirects
urllib3.exceptions.MaxRetryError
urllib3.exceptions.ProxySchemeUnknown
urllib3.exceptions.ProtocolError

Occasionally a proxy fails when extracting from a page, which causes a delay. There is nothing you can do to prevent these failures. The only thing you can do is catch the error and reprocess the request.

I was able to improve the extraction time by adding threading to function get_content.

Content Extraction Runtime: 0:00:03.475362
Total Runtime: 0:01:16.617862

The only way you can increase the speed of your code is to redesign it to query each page element at the same time. If you don't this is a timing bottleneck.

Here is the code that I used to validate the proxy addresses.

def check_proxy(proxy):
    try:
        session = requests.Session()
        session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
        session.max_redirects = 300
        proxy = proxy.split('\n', 1)[0]
        # print('Checking ' + proxy)
        req = session.get("http://google.com", proxies={'http':'http://' + proxy}, timeout=30, allow_redirects=True)
        if req.status_code == 200:
            return proxy

    except requests.exceptions.ConnectTimeout as e:
        return None
    except requests.exceptions.ConnectionError as e:
        return None
    except ConnectionResetError as e:
        # print('Error,ConnectionReset!')
        return None
    except requests.exceptions.HTTPError as e:
        return None
    except requests.exceptions.Timeout as e:
        return None
    except ProxySchemeUnknown as e:
        return None
    except ProtocolError as e:
        return None
    except requests.exceptions.ChunkedEncodingError as e:
        return None
    except requests.exceptions.TooManyRedirects as e:
        return None

UPDATE 07-10-2022 23:53 GMT

I did some more research into this question. I have noted that the website https://www.sslproxies.org provides a list of 100 HTTPS. Out of those less than 20% pass the proxy judge test. Even after obtaining this 20% some will fail when being passed to your function fetch_resp. They can fail for multiple reasons, which include ConnectTimeout, MaxRetryError, ProxyError, etc. When this happens you can rerun the function with the same link (url), headers and a new proxy. The best workaround for these errors is to use a commercial proxy service.

In my latest test I was able to obtain a list of potentially functional proxies and extract all the content for all 25 pages related to your search. Below is the timeDelta for this test:

Content Extraction Runtime: 0:00:34.176803
Total Runtime: 0:01:22.429338

I can speed this up if I use threading with the function fetch_resp.

Below is the current code that I'm using. I need to improve the error handling, but it currently works.

import time
import random
import requests
from datetime import timedelta
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import MaxRetryError, ProxySchemeUnknown
from concurrent.futures import ThreadPoolExecutor, as_completed

proxies_addresses = []
current_proxy = ''


def requests_retry_session(retries=5,
                           backoff_factor=0.5,
                           status_force_list=(500, 502, 503, 504),
                           session=None,
                           ):
    session = session or requests.Session()

    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_force_list,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def ssl_proxy_addresses():
    global proxies_addresses
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text, "lxml")
    proxies = []
    table = soup.find('tbody')
    table_rows = table.find_all('tr')
    for row in table_rows:
        ip_address = row.find_all('td')[0]
        port_number = row.find_all('td')[1]
        proxies.append(f'{ip_address.text}:{port_number.text}')
    proxies_addresses = proxies
    return proxies


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    if proxy_status is True:
        return current_proxy_address
    else:
        return None


def get_proxy_address():
    global proxies_addresses
    proxy_addresses = ssl_proxy_addresses()
    processes = []
    with ThreadPoolExecutor(max_workers=40) as executor:
        for proxy_address in proxy_addresses:
            processes.append(executor.submit(proxy_verification, proxy_address))

    proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
    proxies_addresses = proxies
    return proxies_addresses


def fetch_resp(link, http_headers, proxy_url):
    try:
        print(F'Current Proxy: {proxy_url}')

        response = requests_retry_session().get(link,
                                                headers=http_headers,
                                                allow_redirects=True,
                                                verify=True,
                                                proxies=proxy_url,
                                                timeout=(30, 45)
                                                )

        print("status code", response.status_code)
        if response.status_code == 200:
            return response
        else:
            current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
            fetch_resp(link, http_headers, current_proxy)

    except requests.exceptions.ConnectTimeout as e:
        print('Error,Timeout!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.ProxyError as e:
        print('ProxyError!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.ConnectionError as e:
        print('Connection Error')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.HTTPError as e:
        print('HTTP ERROR!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.Timeout as e:
        print('Error! Connection Timeout!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except ProxySchemeUnknown as e:
        print('ERROR unknown Proxy Scheme!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except MaxRetryError as e:
        print('MaxRetryError')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass
    except requests.exceptions.TooManyRedirects as e:
        print('ERROR! Too many redirects!')
        current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
        fetch_resp(link, http_headers, current_proxy)
        pass


def get_content(http_headers, proxy_url):
    start_time = time.time()
    results = []
    pages = int(25)
    for page_number in range(1, pages):
        print(page_number)
        next_url = f"https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los%20Angeles%2C%20CA" \
                   f"&page={page_number}"
        res = fetch_resp(next_url, http_headers, proxy_url)
        soup = BeautifulSoup(res.text, "lxml")
        info_sections = soup.find_all('li', {'class': 'business-card'})
        for info_section in info_sections:
            shop_name = info_section.find('h2', {'class': 'title business-name'})
            categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
            results.append({shop_name.text, categories})
    end_time = time.time() - start_time
    print(f'Content Extraction Runtime: {timedelta(seconds=end_time)}')
    return results


start_time = time.time()
get_proxy_address()

if len(proxies_addresses) != 0:
    print(proxies_addresses)
    print('\n')
    current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
    print(current_proxy)
    print('\n')

    base_url = 'https://www.yellowpages.com{}'
    current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
    }

    PROXIES = {
        'https': f"http://{current_proxy}"
    }

    results = get_content(headers, PROXIES)

end_time = time.time() - start_time
print(f'Total Runtime: {timedelta(seconds=end_time)}')

UPDATE 07-06-2022 11:02 GMT

This seems to be your core question:

How can I quickly select a functional proxy from a list of proxies?

First, all my previous code is able to validate that a proxy is working at a given moment in time. Once validated I'm able to query and extract data from your Yellow Pages search for pizza in Los Angeles.

Using my previous method I'm able to query and extract data for all 24 pages related to your search in 0:00:45.367209 seconds.

Back to your question.

The website https://www.sslproxies.org provides a list of 100 HTTPS proxies. There is zero guarantee that all 100 are currently operational. One of the ways to identify the working ones is using a Proxy Judge service.

In my previous code I continually selected a random proxy from the list of 100 and passed this proxy to a Proxy Judge for validation. Once a proxy is validated to be working it is used to query and extract data Yellow Pages.

The method above works, but I was wondering how many proxies out of the 100 pass the sniff test for the Proxy Judge service. I attempted to check using a basic for loop, which was deathly slow. I decided to concurrent.futures to speed up the validation.

The code below takes about 1 minute to obtain a list of HTTPS proxies and validate them using a Proxy Judge service.

This is the fastest way to obtain a list of free proxies that are functional at a specific moment in time.

import requests
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from concurrent.futures import ThreadPoolExecutor, as_completed

def ssl_proxy_addresses():
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text, "lxml")
    proxies = []
    table = soup.find('tbody')
    table_rows = table.find_all('tr')
    for row in table_rows:
        ip_address = row.find_all('td')[0]
        port_number = row.find_all('td')[1]
        proxies.append(f'{ip_address.text}:{port_number.text}')
    return proxies


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    if proxy_status is True:
        return current_proxy_address
    else:
        return None


def get_proxy_address():
    proxy_addresses = ssl_proxy_addresses()
    processes = []
    with ThreadPoolExecutor(max_workers=20) as executor:
        for proxy_address in proxy_addresses:
            processes.append(executor.submit(proxy_verification, proxy_address))

    proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
    print(len(proxies))
    13

    print(proxies)
    ['34.228.74.208:8080', '198.41.67.18:8080', '139.9.64.238:443', '216.238.72.163:59394', '64.189.24.250:3129', '62.193.108.133:1976', '210.212.227.68:3128', '47.241.165.133:443', '20.26.4.251:3128', '185.76.9.123:3128', '129.41.171.244:8000', '12.231.44.251:3128', '5.161.105.105:80']

UPDATE CODE 07-05-2022 17:07 GMT

I added a snippet of code below to query the second page. I did this to see if the proxy stayed the same, which it did. You still need to add some error handling.

In my testing I was able to query all 24 pages related to your search in 0:00:45.367209 seconds. I don't consider this query and extraction speed slow by any means.

Concerning performing a different search. I would do the same method as below, but I would request a new proxy for this search, because free proxies do have limitations, such as life time and performance degradation.

import random
import logging
import requests
import traceback
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import ProxySchemeUnknown
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy

current_proxy = ''


def requests_retry_session(retries=5,
                           backoff_factor=0.5,
                           status_force_list=(500, 502, 503, 504),
                           session=None,
                           ):
    session = session or requests.Session()

    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_force_list,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def random_ssl_proxy_address():
    try:
        # Obtain a list of HTTPS proxies
        # Suppress the console debugging output by setting the log level
        req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)

        # Obtain a random single proxy from the list of proxy addresses
        random_proxy = random.sample(req_proxy.get_proxy_list(), 1)

        return random_proxy[0].get_address()
    except AttributeError as e:
        pass


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    return proxy_status


def get_proxy_address():
    global current_proxy
    random_proxy_address = random_ssl_proxy_address()
    current_proxy = random_proxy_address
    proxy_status = proxy_verification(random_proxy_address)
    if proxy_status is True:
        return
    else:
        print('Looking for a valid proxy address.')

        # this sleep timer is helping with some timeout issues
        # that were happening when querying
        sleep(randint(5, 10))

        get_proxy_address()


def fetch_resp(link, http_headers, proxy_url):
    try:
        response = requests_retry_session().get(link,
                                                headers=http_headers,
                                                allow_redirects=True,
                                                verify=True,
                                                proxies=proxy_url,
                                                timeout=(30, 45)
                                                )
        print(F'Current Proxy: {proxy_url}')
        print("status code", response.status_code)
        return response
    except requests.exceptions.ConnectTimeout as e:
        print('Error,Timeout!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.ConnectionError as e:
        print('Connection Error')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.HTTPError as e:
        print('HTTP ERROR!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.Timeout as e:
        print('Error! Connection Timeout!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except ProxySchemeUnknown as e:
        print('ERROR unknown Proxy Scheme!')
        print(''.join(traceback.format_tb(e.__traceback__)))
    except requests.exceptions.TooManyRedirects as e:
        print('ERROR! Too many redirects!')
        print(''.join(traceback.format_tb(e.__traceback__)))
       


def get_next_page(raw_soup, http_headers, proxy_urls):
    next_page_element = raw_soup.find('a', {'class': 'paginator-next arrow-next'})
    next_url = f"https://www.yellowpages.com{next_page_element['href']}"
    sub_response = fetch_resp(next_url, http_headers, proxy_urls)
    new_soup = BeautifulSoup(sub_response.text, "lxml")
    return new_soup


def get_content(link, http_headers, proxy_urls):
    res = fetch_resp(link, http_headers, proxy_urls)
    soup = BeautifulSoup(res.text, "lxml")
    info_sections = soup.find_all('li', {'class': 'business-card'})
    for info_section in info_sections:
        shop_name = info_section.find('h2', {'class': 'title business-name'})
         print(shop_name.text)
            categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
            print(categories)
            business_website = info_section.find('a', {'class': 'website listing-cta action'})
            if business_website is not None:
                print(business_website['href'])
            elif business_website is None:
                print('no website')

    # get page 2
    if soup.find('a', {'class': 'paginator-next arrow-next'}) is not None:
        soup_next_page = get_next_page(soup, http_headers, proxy_urls)
        info_sections = soup_next_page.find_all('li', {'class': 'business-card'})
        for info_section in info_sections:
            shop_name = info_section.find('h2', {'class': 'title business-name'})
             print(shop_name.text)
            categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
            print(categories)
            business_website = info_section.find('a', {'class': 'website listing-cta action'})
            if business_website is not None:
                print(business_website['href'])
            elif business_website is None:
                print('no website')


get_proxy_address()
if len(current_proxy) != 0:
    print(current_proxy)

    base_url = 'https://www.yellowpages.com{}'
    current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
    }

    PROXIES = {
        'https': f"http://{current_proxy}"
    }

    get_content(current_url, headers, PROXIES)

truncated output

Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
1.Casa Bianca Pizza Pie
2.Palermo Italian Restaurant
... truncated


Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
31.Johnnie's New York Pizzeria
32.Amalfi Restaurant and Bar
... truncated

UPDATE CODE 07-05-2022 14:07 GMT

I reworked my code posted on 07-01-2022 to output these data elements, business name, business categories and business website.

1.Casa Bianca Pizza Pie
Pizza, Italian Restaurants, Restaurants
http://www.casabiancapizza.com

2.Palermo Italian Restaurant
Pizza, Restaurants, Italian Restaurants
no website

... truncated

UPDATE CODE 07-01-2022

I noted that when using the free proxies errors were being thrown. I added the requests_retry_session function to handle this. I didn't rework all your code, but I did make sure that I could query the site and produce results using a free proxy. You should be able to work my code into yours.

import random
import logging
import requests
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy

current_proxy = ''

def requests_retry_session(retries=5,
                            backoff_factor=0.5,
                            status_force_list=(500, 502, 504),
                            session=None,
                            ):
    session = session or requests.Session()

    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_force_list,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session


def random_ssl_proxy_address():
    try:
        # Obtain a list of HTTPS proxies
        # Suppress the console debugging output by setting the log level
        req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)

        # Obtain a random single proxy from the list of proxy addresses
        random_proxy = random.sample(req_proxy.get_proxy_list(), 1)

        return random_proxy[0].get_address()
    except AttributeError as e:
        pass


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    return proxy_status


def get_proxy_address():
    global current_proxy
    random_proxy_address = random_ssl_proxy_address()
    current_proxy = random_proxy_address
    proxy_status = proxy_verification(random_proxy_address)
    if proxy_status is True:
        return
    else:
        print('Looking for a valid proxy address.')

        # this sleep timer is helping with some timeout issues
        # that were happening when querying
        sleep(randint(5, 10))

        get_proxy_address()


def fetch_resp(link, http_headers, proxy_url):

    response = requests_retry_session().get(link,
                                            headers=http_headers,
                                            allow_redirects=True,
                                            verify=True,
                                            proxies=proxy_url,
                                            timeout=(30, 45)
                                                  )
    print("status code", response.status_code)
    return response


def get_content(link, headers, proxy_urls):
    res = fetch_resp(link, headers, proxy_urls)
    soup = BeautifulSoup(res.text, "lxml")
    info_sections = soup.find_all('li', {'class': 'business-card'})
    for info_section in info_sections:
        shop_name = info_section.find('h2', {'class': 'title business-name'})
        print(shop_name.text)
        categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
        print(categories)
        business_website = info_section.find('a', {'class': 'website listing-cta action'})
        if business_website is not None:
            print(business_website['href'])
        elif business_website is None:
            print('no website')

get_proxy_address()
if len(current_proxy) != 0:
    print(current_proxy)

    base_url = 'https://www.yellowpages.com{}'
    current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'

    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
    }

    PROXIES = {
        'https': f"http://{current_proxy}"
    }

    get_content(current_url, headers, PROXIES)

PREVIOUS ANSWERS

06-30-2022:

During some testing I found a bug, so I updated my code to handle the bug.

06-28-2022:

You could use a proxy judge, which is used for testing the performance and the anonymity status of a proxy server.

The code below is from one of my previous answers.

import random
import logging
from time import sleep
from random import randint
from proxy_checking import ProxyChecker
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy


current_proxy = ''


def random_ssl_proxy_address():
    try:
        # Obtain a list of HTTPS proxies
        # Suppress the console debugging output by setting the log level
        req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)

        # Obtain a random single proxy from the list of proxy addresses
        random_proxy = random.sample(req_proxy.get_proxy_list(), 1)

        return random_proxy[0].get_address()
    except AttributeError as e:
        pass


def proxy_verification(current_proxy_address):
    checker = ProxyChecker()
    proxy_judge = checker.check_proxy(current_proxy_address)
    proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
    return proxy_status


def get_proxy_address():
    global current_proxy
    random_proxy_address = random_ssl_proxy_address()
    current_proxy = random_proxy_address
    proxy_status = proxy_verification(random_proxy_address)
    if proxy_status is True:
        return
    else:
        print('Looking for a valid proxy address.')

        # this sleep timer is helping with some timeout issues
        # that were happening when querying
        sleep(randint(5, 10))

        get_proxy_address()


get_proxy_address()
if len(current_proxy) != 0:
    print(f'Valid proxy address: {current_proxy}')
    # output
    Valid proxy address: 157.100.12.138:999

I noted today that the Python package HTTP_Request_Randomizer has a couple of Beautiful Soup path problems that need to be modified, because they currently don't work in version 1.3.2 of HTTP_Request_Randomizer.

You need to modified line 27 in FreeProxyParser.py to this:

table = soup.find("table", attrs={"class": "table table-striped table-bordered"})

You need to modified line 27 in SslProxyParser.py to this:

table = soup.find("table", attrs={"class": "table table-striped table-bordered"})

I found another bug that needs to be fixed. This one is in the proxy_checking.py I had to add the line if url != None:

    def get_info(self, url=None, proxy=None):
        info = {}
        proxy_type = []
        judges = ['http://proxyjudge.us/azenv.php', 'http://azenv.net/', 'http://httpheader.net/azenv.php', 'http://mojeip.net.pl/asdfa/azenv.php']
        if url != None:
            try:
                response = requests.get(url, headers=headers, timeout=5)
                return response
            except:
                pass
        elif proxy != None:
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文