将 Tor 与 scrapy 框架结合使用

发布于 2024-12-15 03:08:44 字数 1823 浏览 2 评论 0原文

我正在尝试抓取网站,该网站足够复杂以阻止机器人,我的意思是它只允许几个请求,之后 Scrapy 挂起。

问题1:有没有办法,如果Scrapy挂起,我可以从同一点重新启动我的爬行过程。 为了摆脱这个问题,我写了这样的设置文件,

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

这是我的程序:

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

问题是我可以在哪里编写http代理,我是否必须导入任何与tor相关的类,我是Scrapy的新手,因为这个组我学到了很多,现在我正在尝试学习“如何使用 ip 轮换或 tor”

正如我们的一位成员所建议的,我启动了 tor 并将 HTTP_PROXY 设置为

set http_proxy=http://localhost:8118

但它抛出了一些错误,

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

所以我将 http_proxy 更改为

set http_proxy=http://localhost:9051

现在错误是

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

我查了一下firefox网络设置,在那里我看不到任何http代理,但它使用的是SOCKSV5,它显示的是127.0.0.1:9051(在TOR之前它不需要代理)请帮助我,我仍然不明白如何通过 Scrapy 使用 TOR。 我应该使用哪个 TOR 包以及如何使用? 我希望我的两个问题都能得到解决

  1. 如果scrapy爬虫由于某种原因挂起(连接失败),我想从那里恢复服务
  2. 如何在Scrapy中使用旋转IP

I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.

Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point.
To get rid of this problem, I wrote my settings file like this

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

This is my program:

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'

As one of our member suggested, I started tor and I set HTTP_PROXY to

set http_proxy=http://localhost:8118

but it is throwing some errors,

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

So i changed http_proxy to

set http_proxy=http://localhost:9051

Now the error is

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy.
Which bundle of TOR I am supposed to use and how?
I hope that both of my questions will be resolved

  1. If a scrapy crawler hangs for some reason (Connection failure), I would like to resume the service from there itself
  2. How to use rotating IPs in Scrapy

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

再可℃爱ぅ一点好了 2024-12-22 03:08:44

TOR 本身不是 http 代理,端口 8118 和连接拒绝错误表明您没有正确运行 privoxy[1]。尝试正确设置 privoxy,然后使用环境变量 http_proxy=http://localhost:8118 重试。

我已经使用 privoxy 和 scrapy 成功完成了 TOR 爬行。

[1] http://www.privoxy.org/

TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable http_proxy=http://localhost:8118.

I have done crawling through TOR using privoxy with scrapy successfully.

[1] http://www.privoxy.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文