无法让 Scrapy 跟踪链接

发布于 2024-11-29 06:01:55 字数 1600 浏览 0 评论 0原文

我正在尝试抓取一个网站,但无法使用 scrapy 来跟踪链接,也没有收到任何 Python 错误,而且 Wireshark 也没有看到任何情况。我认为这可能是正则表达式,但我尝试使用“.*”来尝试跟踪任何链接,但它也不起作用。 “parse”方法确实有效,但我需要遵循“sinopsis.aspx”和回调 parse_peliculas。

编辑:注释解析方法使规则正常工作... parse_peliculas 运行,我现在要做的是将解析方法更改为另一个名称并使用回调制定规则,但我仍然无法让它工作。

这是我的蜘蛛代码:

import re

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem

class CinemarkSpider(CrawlSpider):
    name = 'cinemark'
    allowed_domains = ['cinemark.com.mx']
    start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
                  'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']


    rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)

    def parse(self, response):
        item = CinemarkItem()
        hxs = HtmlXPathSelector(response)
        cine = hxs.select('(//td[@class="title2"])[1]')
        direccion = hxs.select('(//td[@class="title2"])[2]')

        item['nombre'] = cine.select('text()').extract()
        item['direccion'] = direccion.select('text()').extract()
        return item

    def parse_peliculas(self, response):
        item = PeliculasItem()
        hxs = HtmlXPathSelector(response)
        titulo = hxs.select('//td[@class="pop_up_title"]')
        item['titulo'] = titulo.select('text()').extract()
        return item

谢谢

I am trying to scrape a website but I can't get scrapy to follow links and I don't get any Python errors and I see nothing going on with Wireshark. I thought it could be the regex but I tried ".*" to try to follow any link but it doesn't work either. The method "parse" does work though but I need to follow "sinopsis.aspx" and callback parse_peliculas.

Edit: Commenting the parse method gets rules working... parse_peliculas gets run, what I have todo now is change parse method to another name and make a rule with a callback but I can't still get it to work.

This is my spider code:

import re

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem

class CinemarkSpider(CrawlSpider):
    name = 'cinemark'
    allowed_domains = ['cinemark.com.mx']
    start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
                  'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']


    rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)

    def parse(self, response):
        item = CinemarkItem()
        hxs = HtmlXPathSelector(response)
        cine = hxs.select('(//td[@class="title2"])[1]')
        direccion = hxs.select('(//td[@class="title2"])[2]')

        item['nombre'] = cine.select('text()').extract()
        item['direccion'] = direccion.select('text()').extract()
        return item

    def parse_peliculas(self, response):
        item = PeliculasItem()
        hxs = HtmlXPathSelector(response)
        titulo = hxs.select('//td[@class="pop_up_title"]')
        item['titulo'] = titulo.select('text()').extract()
        return item

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

芯好空 2024-12-06 06:01:55

编写爬行蜘蛛规则时,避免使用解析作为回调,因为
CrawlSpider 使用 parse 方法本身来实现其逻辑。
因此,如果您重写解析方法,爬行蜘蛛将不再
工作。

http://readthedocs.org/docs/scrapy /en/latest/topics/spiders.html

When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.

http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文