无法让 Scrapy 跟踪链接
我正在尝试抓取一个网站,但无法使用 scrapy 来跟踪链接,也没有收到任何 Python 错误,而且 Wireshark 也没有看到任何情况。我认为这可能是正则表达式,但我尝试使用“.*”来尝试跟踪任何链接,但它也不起作用。 “parse”方法确实有效,但我需要遵循“sinopsis.aspx”和回调 parse_peliculas。
编辑:注释解析方法使规则正常工作... parse_peliculas 运行,我现在要做的是将解析方法更改为另一个名称并使用回调制定规则,但我仍然无法让它工作。
这是我的蜘蛛代码:
import re
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem
class CinemarkSpider(CrawlSpider):
name = 'cinemark'
allowed_domains = ['cinemark.com.mx']
start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']
rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)
def parse(self, response):
item = CinemarkItem()
hxs = HtmlXPathSelector(response)
cine = hxs.select('(//td[@class="title2"])[1]')
direccion = hxs.select('(//td[@class="title2"])[2]')
item['nombre'] = cine.select('text()').extract()
item['direccion'] = direccion.select('text()').extract()
return item
def parse_peliculas(self, response):
item = PeliculasItem()
hxs = HtmlXPathSelector(response)
titulo = hxs.select('//td[@class="pop_up_title"]')
item['titulo'] = titulo.select('text()').extract()
return item
谢谢
I am trying to scrape a website but I can't get scrapy to follow links and I don't get any Python errors and I see nothing going on with Wireshark. I thought it could be the regex but I tried ".*" to try to follow any link but it doesn't work either. The method "parse" does work though but I need to follow "sinopsis.aspx" and callback parse_peliculas.
Edit: Commenting the parse method gets rules working... parse_peliculas gets run, what I have todo now is change parse method to another name and make a rule with a callback but I can't still get it to work.
This is my spider code:
import re
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem
class CinemarkSpider(CrawlSpider):
name = 'cinemark'
allowed_domains = ['cinemark.com.mx']
start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']
rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)
def parse(self, response):
item = CinemarkItem()
hxs = HtmlXPathSelector(response)
cine = hxs.select('(//td[@class="title2"])[1]')
direccion = hxs.select('(//td[@class="title2"])[2]')
item['nombre'] = cine.select('text()').extract()
item['direccion'] = direccion.select('text()').extract()
return item
def parse_peliculas(self, response):
item = PeliculasItem()
hxs = HtmlXPathSelector(response)
titulo = hxs.select('//td[@class="pop_up_title"]')
item['titulo'] = titulo.select('text()').extract()
return item
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
http://readthedocs.org/docs/scrapy /en/latest/topics/spiders.html
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html