代理导致Scrapy无法访问网站?

发布于 2022-09-03 14:53:36 字数 4415 浏览 14 评论 0

今天用文档学习Scrapy的时候,第一个程序爬取stackoverflow的时候出现了问题。
猜测原因是用了Lanten作为代理,因为在Scrapy所打印的信息中有一句是:
“ [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023”
“ DEBUG: Crawled (403) <GET http://stackoverflow.com/ques... (referer: None)”
我重新用urllib2,不改变User-Agent,仅仅进行抓取,一切正常
请问如何设置Scrapy不使用代理?怎样防止下次运行Lanten的时候Scrapy再次使用Lantern的代理?
谢谢


运行代码:

import scrapy
class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']
    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
                full_url = response.urljoin(href.extract())
        yield scrapy.Request(full_url, callback=self.parse_question)
    def parse_question(self, response):
        yield {
        'title': response.css('h1 a::text').extract_first(),
        'votes': response.css('.question .vote-count-post::text').extract_first(),
        'body': response.css('.question .post-text').extract_first(),
        'tags': response.css('.question .post-tag::text').extract(),
        'link': response.url,
        }

打印的信息:
2016-09-07 23:09:22 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-09-07 23:09:22 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'top-stackoverflow-questions.json'}
2016-09-07 23:09:22 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-09-07 23:09:23 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-09-07 23:09:23 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-09-07 23:09:23 [scrapy] INFO: Enabled item pipelines:
[]
2016-09-07 23:09:23 [scrapy] INFO: Spider opened
2016-09-07 23:09:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-07 23:09:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-07 23:09:23 [scrapy] DEBUG: Crawled (403) <GET http://stackoverflow.com/ques... (referer: None)
2016-09-07 23:09:23 [scrapy] DEBUG: Ignoring response <403 http://stackoverflow.com/ques... HTTP status code is not handled or not allowed
2016-09-07 23:09:23 [scrapy] INFO: Closing spider (finished)
2016-09-07 23:09:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1913,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 9, 7, 15, 9, 23, 955000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 9, 7, 15, 9, 23, 175000)}
2016-09-07 23:09:23 [scrapy] INFO: Spider closed (finished)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

七分※倦醒 2022-09-10 14:53:36

答案源地址:CSDN博客:Scrapy使用问题整理

答案中最后提到了代理问题导致有些网站无法访问,解决方法是在settings.py中修改:
DOWNLOADER_MIDDLEWARES = {

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,

}

经测试修改后程序能够运行

======

2017.9.17
原博客已变成404
附上一篇转载:Scrapy使用问题整理(转载)

哆啦不做梦 2022-09-10 14:53:36

把Lanten关掉~

悲喜皆因你 2022-09-10 14:53:36

应该是Lanten改变了系统的代理,导致你的爬虫也被迫使用这个代理ip去访问

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文