Scrapy DOWNLOADMIDDLEWARE(selenium+PhantomJS)无法获取Cooike
计划利用Scrapy+selenium+PhantomJS的方式实现某论坛的数据抓取,其中涉及登陆用Scrapy的FormRequest.from_response提交请求,未自定义中间件时登陆正常,但是自定义dowanloadmiddleware后中,request无法获取cookies,代码如下:
#spider.py
class mTeamSpider(CrawlSpider):
cookie_jar = CookieJar()
name = '*'
allow_domain = ['*']
start_urls=['*']
rules = (
Rule(LinkExtractor(allow=(r'details.php\?id=\d+')), callback='parse_detial_item'),
)
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2,ja;q=0.2",
"Cache - Control": "max - age = 0",
"Connection": "keep-alive",
"Content - Length": "35",
"Content-Type":" application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
}
def start_requests(self):
return [Request("https://tp.m-team.cc/adult.php", meta={'cookiejar': self.cookie_jar}, callback=self.post_login)]
def post_login(self, response):
print('Preparing login')
return [FormRequest.from_response(response,
url='*/takelogin.php',
meta = {'cookiejar' : response.meta['cookiejar'],
},
# headers = self.headers,
formdata = {
'username': settings['FROM_USERNAME'],
'password': settings['FROM_PASSWORD'],
},
callback = self.after_login,
dont_filter=True
)]
def after_login(self, response) :
if '*' in str(response.body):print('Success')
else:print('login fails')
with open('filename.html', 'wb') as f:
f.write(response.body)
for url in self.start_urls :
yield scrapy.Request(url,meta = {'cookiejar': response.meta['cookiejar']},dont_filter=True)
#middleware.py
class JSMiddleware(object):
if spider.name=="*":
print("PhantomJS is starting...")
driver=webdriver.PhantomJS(executable_path=r"./phantomjs/bin/phantomjs",
desired_capabilities=dcap)
url = str(request.url)
driver.get(url)
content=driver.page_source
driver.close()
return HtmlResponse(request.url, body=content, encoding='utf-8', request=request )
根据COOKIES_DEBUG的显示,如果开启中间件,就无法获取cookies了,尝试利用extract_cookies的方法提取cookies也失败了。
cookie_jar = response.meta['cookiejar']
cookie_jar.extract_cookies(response, response.request)
重载cookie.py的process_request方法也不奏效,请问有没有什么别的方法,还是说登陆必须用selenium的方法
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论