在 Scrapy 中使用经过身份验证的会话进行爬网
在我的上一个问题中,我对我的问题不是很具体(使用Scrapy进行经过身份验证的会话进行抓取),在希望能够从更一般的答案中推导出解决方案。我可能更应该使用“爬行”这个词。
所以,这是到目前为止我的代码:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/login/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
if not "Hi Herman" in response.body:
return self.login(response)
else:
return self.parse_item(response)
def login(self, response):
return [FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.parse)]
def parse_item(self, response):
i['url'] = response.url
# ... do more things
return i
如您所见,我访问的第一个页面是登录页面。如果我尚未通过身份验证(在 parse
函数中),我会调用自定义 login
函数,该函数会发布到登录表单。然后,如果我已通过身份验证,我想继续抓取。
问题是我尝试重写以登录的 parse
函数现在不再进行必要的调用来抓取任何其他页面(我假设)。而且我不知道如何保存我创建的项目。
以前有人做过类似的事情吗? (使用 CrawlSpider
进行身份验证,然后抓取)任何帮助将不胜感激。
In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling
.
So, here is my code so far:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/login/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html
As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse
function), I call my custom login
function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.
The problem is that the parse
function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.
Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider
) Any help would be appreciated.
), callback='parse_item', follow=True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
if not "Hi Herman" in response.body:
return self.login(response)
else:
return self.parse_item(response)
def login(self, response):
return [FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.parse)]
def parse_item(self, response):
i['url'] = response.url
# ... do more things
return i
As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse
function), I call my custom login
function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.
The problem is that the parse
function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.
Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider
) Any help would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
不要重写
CrawlSpider
中的parse
函数:当您使用
CrawlSpider
时,不应重写解析
函数。这里的CrawlSpider
文档中有一个警告:http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule这是因为使用
CrawlSpider
,parse
(任何请求的默认回调)发送要由Rule
处理的响应。爬行前登录:
为了在蜘蛛开始爬行之前进行某种初始化,您可以使用
InitSpider
(它继承自CrawlSpider
),并覆盖init_request
函数。该函数将在蜘蛛初始化时、开始爬行之前被调用。为了让蜘蛛开始爬行,您需要调用
self.initialized
。您可以在此处(它有有用的文档字符串)。
示例:
保存项目:
您的 Spider 返回的项目将被传递到管道,管道负责对数据执行您想要执行的任何操作。我建议您阅读文档: http://doc.scrapy.org/ en/0.14/topics/item-pipeline.html
如果您对
Item
有任何问题/疑问,请随时提出新问题,我会做的我尽力提供帮助。Do not override the
parse
function in aCrawlSpider
:When you are using a
CrawlSpider
, you shouldn't override theparse
function. There's a warning in theCrawlSpider
documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.RuleThis is because with a
CrawlSpider
,parse
(the default callback of any request) sends the response to be processed by theRule
s.Logging in before crawling:
In order to have some kind of initialisation before a spider starts crawling, you can use an
InitSpider
(which inherits from aCrawlSpider
), and override theinit_request
function. This function will be called when the spider is initialising, and before it starts crawling.In order for the Spider to begin crawling, you need to call
self.initialized
.You can read the code that's responsible for this here (it has helpful docstrings).
An example:
Saving items:
Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html
If you have any problems/questions in regards to
Item
s, don't hesitate to pop open a new question and I'll do my best to help.为了使上述解决方案发挥作用,我必须通过在 scrapy 源代码上更改以下内容,使 CrawlSpider 继承自 InitSpider,而不再继承自 BaseSpider。在文件 scrapy/contrib/spiders/crawl.py 中:
from scrapy.contrib.spiders.init import InitSpider
class CrawlSpider(BaseSpider)
更改为class CrawlSpider (InitSpider)
否则蜘蛛不会调用
init_request
方法。还有其他更简单的方法吗?
In order for the above solution to work, I had to make CrawlSpider inherit from InitSpider, and no longer from BaseSpider by changing, on the scrapy source code, the following. In file scrapy/contrib/spiders/crawl.py:
from scrapy.contrib.spiders.init import InitSpider
class CrawlSpider(BaseSpider)
toclass CrawlSpider(InitSpider)
Otherwise the spider wouldn't call the
init_request
method.Is there any other easier way?
如果您需要的是 Http 身份验证,请使用提供的中间件挂钩。
在
settings.py
和
spider class
中添加属性If what you need is Http Authentication use the provided middleware hooks.
in
settings.py
and in your
spider class
add properties只是添加到上面橡子的答案中。
使用他的方法,我的脚本在登录后没有解析 start_urls。
在 check_login_response 中成功登录后退出。
不过我可以看到我有发电机。
我需要使用
然后调用解析函数。
Just adding to Acorn's answer above.
Using his method my script was not parsing the start_urls after the login.
It was exiting after a successful login in check_login_response.
I could see I had the generator though.
I needed to to use
then the parse function was called.
对我来说,情况完全不同。我发现我在登录后从网站收到了身份验证令牌和用户令牌作为cookie。
这就是为什么我寻找一个快速简单的解决方案,就是像这样将 cookie 传递给请求。确保 cookie 不会很快过期。就我而言,它会在一个月后到期。调用此请求后,我将自动登录到该页面:
For me it was quite different. I saw that I am receiving an auth token and an user token as cookies from the website after my login.
Thats why I searched for a quick and easy solution which was to just pass over the cookies to the Request like that. Make sure that the cookie does not expire quickly. In my case it expires in a month. This request is called and I will be automatically logged in to the page: