将 Tor 与 scrapy 框架结合使用
我正在尝试抓取网站,该网站足够复杂以阻止机器人,我的意思是它只允许几个请求,之后 Scrapy 挂起。
问题1:有没有办法,如果Scrapy挂起,我可以从同一点重新启动我的爬行过程。 为了摆脱这个问题,我写了这样的设置文件,
BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'
SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'
这是我的程序:
class ypSpider(CrawlSpider):
name = "yp"
start_urls = [
SOME URL
]
rules=(
#These are some rules
)
def parse_item(self, response):
####################################################################
#cleaning the html page by removing scripts html tags
#######################################################
hxs=HtmlXPathSelector(response)
问题是我可以在哪里编写http代理,我是否必须导入任何与tor相关的类,我是Scrapy的新手,因为这个组我学到了很多,现在我正在尝试学习“如何使用 ip 轮换或 tor”
正如我们的一位成员所建议的,我启动了 tor 并将 HTTP_PROXY 设置为
set http_proxy=http://localhost:8118
但它抛出了一些错误,
failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError' Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.
所以我将 http_proxy 更改为
set http_proxy=http://localhost:9051
现在错误是
failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.
我查了一下firefox网络设置,在那里我看不到任何http代理,但它使用的是SOCKSV5,它显示的是127.0.0.1:9051(在TOR之前它不需要代理)请帮助我,我仍然不明白如何通过 Scrapy 使用 TOR。 我应该使用哪个 TOR 包以及如何使用? 我希望我的两个问题都能得到解决
- 如果scrapy爬虫由于某种原因挂起(连接失败),我想从那里恢复服务
- 如何在Scrapy中使用旋转IP
I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.
Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point.
To get rid of this problem, I wrote my settings file like this
BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'
SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'
This is my program:
class ypSpider(CrawlSpider):
name = "yp"
start_urls = [
SOME URL
]
rules=(
#These are some rules
)
def parse_item(self, response):
####################################################################
#cleaning the html page by removing scripts html tags
#######################################################
hxs=HtmlXPathSelector(response)
The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'
As one of our member suggested, I started tor and I set HTTP_PROXY to
set http_proxy=http://localhost:8118
but it is throwing some errors,
failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError' Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.
So i changed http_proxy to
set http_proxy=http://localhost:9051
Now the error is
failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.
I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy.
Which bundle of TOR I am supposed to use and how?
I hope that both of my questions will be resolved
- If a scrapy crawler hangs for some reason (Connection failure), I would like to resume the service from there itself
- How to use rotating IPs in Scrapy
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
TOR 本身不是 http 代理,端口 8118 和连接拒绝错误表明您没有正确运行 privoxy[1]。尝试正确设置 privoxy,然后使用环境变量
http_proxy=http://localhost:8118
重试。我已经使用 privoxy 和 scrapy 成功完成了 TOR 爬行。
[1] http://www.privoxy.org/
TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable
http_proxy=http://localhost:8118
.I have done crawling through TOR using privoxy with scrapy successfully.
[1] http://www.privoxy.org/