如何使用旋转代理?

发布于 2025-02-06 20:15:57 字数 2029 浏览 2 评论 0原文

我正在尝试在此脚本中使用旋转代理。但是我对如何使用它没有适当的想法。我已经检查了有关此问题的先前问题,并试图实施它。但是它检测到代理,要求登录并防止获取数据。我已经使用硒 +硒传播的脚本制定了低于下提到的脚本。我还尝试了爬行蜘蛛,但得到了相同的结果。

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth
import time


class RsSpider(scrapy.Spider):
      name = 'rs'
      allowed_domains = ['www.sahibinden.com']
      def start_requests(self):
          options = webdriver.ChromeOptions()
          options.add_argument("start-maximized")
          options.add_experimental_option("excludeSwitches", ["enable-automation"])
          options.add_experimental_option('useAutomationExtension', False)

          driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
          driver.set_window_size(1920, 1080)

          stealth(driver,
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
          )

          driver.get("https://www.sahibinden.com/satilik/istanbul-eyupsultan?pagingOffset=0")
          time.sleep(5)

          links = driver.find_elements(By.XPATH, "//td[@class='searchResultsTitleValue ']/a")
    
          for link in links:
              href= link.get_attribute('href')
              yield SeleniumRequest(
                url = href,
                callback= self.parse,
                meta={'proxy': 'username:password@server:2000'},
                wait_time=1
          )

      driver.quit()
      return super().start_requests()

      def parse(self, response):
          yield {
                'URL': response.url,
                'City': response.xpath("normalize-space(//div[@class='classifiedInfo 
                 ']/h2/a[1]/text())").get(),
                 }

I am trying to use a rotating proxy here in this script. But I don't have a proper idea of how to use it. I have checked out the previous issues regarding this and tried to implement it. But it detects the proxy, asks for login, and prevents getting data. I have developed the below-mentioned script using selenium + selenium-stealth. I also tried with crawl spider but got the same result.

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth
import time


class RsSpider(scrapy.Spider):
      name = 'rs'
      allowed_domains = ['www.sahibinden.com']
      def start_requests(self):
          options = webdriver.ChromeOptions()
          options.add_argument("start-maximized")
          options.add_experimental_option("excludeSwitches", ["enable-automation"])
          options.add_experimental_option('useAutomationExtension', False)

          driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
          driver.set_window_size(1920, 1080)

          stealth(driver,
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
          )

          driver.get("https://www.sahibinden.com/satilik/istanbul-eyupsultan?pagingOffset=0")
          time.sleep(5)

          links = driver.find_elements(By.XPATH, "//td[@class='searchResultsTitleValue ']/a")
    
          for link in links:
              href= link.get_attribute('href')
              yield SeleniumRequest(
                url = href,
                callback= self.parse,
                meta={'proxy': 'username:password@server:2000'},
                wait_time=1
          )

      driver.quit()
      return super().start_requests()

      def parse(self, response):
          yield {
                'URL': response.url,
                'City': response.xpath("normalize-space(//div[@class='classifiedInfo 
                 ']/h2/a[1]/text())").get(),
                 }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

任谁 2025-02-13 20:15:57

如果将代理添加到请求参数不起作用,则

#1

可以添加代理中间件管道并将其添加到项目设置中。 (更好,更安全的选项)

这是中间件 - 设置文件的工作代码

from w3lib.http import basic_auth_header
from scrapy.utils.project import get_project_settings


class ProxyMiddleware(object):
    def process_request(self, request, spider):
        settings = get_project_settings()
        request.meta['proxy'] = settings.get('PROXY_HOST') + ':' + settings.get('PROXY_PORT')
        request.headers["Proxy-Authorization"] = basic_auth_header(settings.get('PROXY_USER'), settings.get('PROXY_PASSWORD'))
        spider.log('Proxy : %s' % request.meta['proxy'])

(activate downloader_middlewares) -

import os
from dotenv import load_dotenv

load_dotenv()
....
....

# Proxy setup

PROXY_HOST = os.environ.get("PROXY_HOST")
PROXY_PORT = os.environ.get("PROXY_PORT")
PROXY_USER = os.environ.get("PROXY_USER")
PROXY_PASSWORD = os.environ.get("PROXY_PASSWORD")
.....
.....
.....

DOWNLOADER_MIDDLEWARES = {
   # 'project.middlewares.projectDownloaderMiddleware': 543,
    'project.proxy_middlewares.ProxyMiddleware': 350,
}

.env file -

PROXY_HOST=127.0.0.1
PROXY_PORT=6666
PROXY_USER=proxy_user
PROXY_PASSWORD=proxy_password

#2

请查看此中间件 - < a href =“ https://github.com/teamhg-memex/scrapy-rotating-proxies” rel =“ nofollow noreferrer”> scrapy-rotating-proxies

If adding proxy to request parameters does not work then

#1

You can add a proxy middleware pipeline and add that to the project setting. (better, safer option)

Here is a working code for the middleware -

from w3lib.http import basic_auth_header
from scrapy.utils.project import get_project_settings


class ProxyMiddleware(object):
    def process_request(self, request, spider):
        settings = get_project_settings()
        request.meta['proxy'] = settings.get('PROXY_HOST') + ':' + settings.get('PROXY_PORT')
        request.headers["Proxy-Authorization"] = basic_auth_header(settings.get('PROXY_USER'), settings.get('PROXY_PASSWORD'))
        spider.log('Proxy : %s' % request.meta['proxy'])

settings file (activate DOWNLOADER_MIDDLEWARES) -

import os
from dotenv import load_dotenv

load_dotenv()
....
....

# Proxy setup

PROXY_HOST = os.environ.get("PROXY_HOST")
PROXY_PORT = os.environ.get("PROXY_PORT")
PROXY_USER = os.environ.get("PROXY_USER")
PROXY_PASSWORD = os.environ.get("PROXY_PASSWORD")
.....
.....
.....

DOWNLOADER_MIDDLEWARES = {
   # 'project.middlewares.projectDownloaderMiddleware': 543,
    'project.proxy_middlewares.ProxyMiddleware': 350,
}

.env file -

PROXY_HOST=127.0.0.1
PROXY_PORT=6666
PROXY_USER=proxy_user
PROXY_PASSWORD=proxy_password

#2

Have a look at this middleware - scrapy-rotating-proxies

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文