将 Tor 与 scrapy 框架结合使用

发布于 2024-12-15 03:08:44 字数 1823 浏览 2 评论 0原文

我正在尝试抓取网站，该网站足够复杂以阻止机器人，我的意思是它只允许几个请求，之后 Scrapy 挂起。

问题1：有没有办法，如果Scrapy挂起，我可以从同一点重新启动我的爬行过程。为了摆脱这个问题，我写了这样的设置文件，

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

这是我的程序：

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

问题是我可以在哪里编写http代理，我是否必须导入任何与tor相关的类，我是Scrapy的新手，因为这个组我学到了很多，现在我正在尝试学习“如何使用 ip 轮换或 tor”

正如我们的一位成员所建议的，我启动了 tor 并将 HTTP_PROXY 设置为

set http_proxy=http://localhost:8118

但它抛出了一些错误，

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

所以我将 http_proxy 更改为

set http_proxy=http://localhost:9051

现在错误是

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

我查了一下firefox网络设置，在那里我看不到任何http代理，但它使用的是SOCKSV5，它显示的是127.0.0.1:9051（在TOR之前它不需要代理）请帮助我，我仍然不明白如何通过 Scrapy 使用 TOR。我应该使用哪个 TOR 包以及如何使用？我希望我的两个问题都能得到解决

如果scrapy爬虫由于某种原因挂起（连接失败），我想从那里恢复服务
如何在Scrapy中使用旋转IP

原文

I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.

Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point.
To get rid of this problem, I wrote my settings file like this

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

This is my program:

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'

As one of our member suggested, I started tor and I set HTTP_PROXY to

set http_proxy=http://localhost:8118

but it is throwing some errors,

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

So i changed http_proxy to

set http_proxy=http://localhost:9051

Now the error is

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy.
Which bundle of TOR I am supposed to use and how?
I hope that both of my questions will be resolved