无法让 Scrapy 跟踪链接

发布于 2024-11-29 06:01:55 字数 1600 浏览 0 评论 0原文

我正在尝试抓取一个网站，但无法使用 scrapy 来跟踪链接，也没有收到任何 Python 错误，而且 Wireshark 也没有看到任何情况。我认为这可能是正则表达式，但我尝试使用“.*”来尝试跟踪任何链接，但它也不起作用。 “parse”方法确实有效，但我需要遵循“sinopsis.aspx”和回调 parse_peliculas。

编辑：注释解析方法使规则正常工作... parse_peliculas 运行，我现在要做的是将解析方法更改为另一个名称并使用回调制定规则，但我仍然无法让它工作。

这是我的蜘蛛代码：

import re

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem

class CinemarkSpider(CrawlSpider):
    name = 'cinemark'
    allowed_domains = ['cinemark.com.mx']
    start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
                  'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']


    rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)

    def parse(self, response):
        item = CinemarkItem()
        hxs = HtmlXPathSelector(response)
        cine = hxs.select('(//td[@class="title2"])[1]')
        direccion = hxs.select('(//td[@class="title2"])[2]')

        item['nombre'] = cine.select('text()').extract()
        item['direccion'] = direccion.select('text()').extract()
        return item

    def parse_peliculas(self, response):
        item = PeliculasItem()
        hxs = HtmlXPathSelector(response)
        titulo = hxs.select('//td[@class="pop_up_title"]')
        item['titulo'] = titulo.select('text()').extract()
        return item

谢谢

原文

I am trying to scrape a website but I can't get scrapy to follow links and I don't get any Python errors and I see nothing going on with Wireshark. I thought it could be the regex but I tried ".*" to try to follow any link but it doesn't work either. The method "parse" does work though but I need to follow "sinopsis.aspx" and callback parse_peliculas.

Edit: Commenting the parse method gets rules working... parse_peliculas gets run, what I have todo now is change parse method to another name and make a rule with a callback but I can't still get it to work.

This is my spider code:

import re

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem

class CinemarkSpider(CrawlSpider):
    name = 'cinemark'
    allowed_domains = ['cinemark.com.mx']
    start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
                  'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']


    rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)

    def parse(self, response):
        item = CinemarkItem()
        hxs = HtmlXPathSelector(response)
        cine = hxs.select('(//td[@class="title2"])[1]')
        direccion = hxs.select('(//td[@class="title2"])[2]')

        item['nombre'] = cine.select('text()').extract()
        item['direccion'] = direccion.select('text()').extract()
        return item

    def parse_peliculas(self, response):
        item = PeliculasItem()
        hxs = HtmlXPathSelector(response)
        titulo = hxs.select('//td[@class="pop_up_title"]')
        item['titulo'] = titulo.select('text()').extract()
        return item

Thanks

分享到QQ

分享到微博