Scrapy 正在跟踪并抓取不允许的链接

发布于 2024-12-21 17:35:30 字数 1480 浏览 2 评论 0原文

我有一个 CrawlSpider 设置为跟踪某些链接并抓取一本新闻杂志,其中每期的链接都遵循以下 URL 方案:

http ://example.com/YYYY/DDDD/index.htm,其中 YYYY 是年份,DDDD 是三位数或四位数的发行号。

我只想要第 928 期以后的问题,并有下面的规则。我在连接到网站、抓取链接或提取项目时没有任何问题(因此我没有包含其余的代码)。蜘蛛似乎决心遵循不允许的链接。它正在尝试抓取问题 377、398 等,并跟踪“culture.htm”和“feature.htm”链接。这会引发很多错误,并不是非常重要,但它需要大量的数据清理。关于出了什么问题有什么建议吗?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]

rules = (
        Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
        Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
        Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
    )

编辑:我使用更简单的正则表达式 fot 2009、2010、2011 修复了这个问题,但我仍然很好奇,如果有人有任何建议,为什么上面的方法不起作用。

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme:

http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number.

I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]

rules = (
        Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
        Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
        Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
    )

EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

灯下孤影 2024-12-28 17:35:30

您需要将 deny 参数传递给 SgmlLinkExtractor,它会收集指向 follow 的链接。如果它们调用一个函数 parse_item,您就不需要创建这么多 Rule。我会将您的代码编写为:

rules = (
        Rule(SgmlLinkExtractor(
                    allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
                    deny = ('culture\.htm', 'feature\.htm'),
                    ), 
            follow = True
        ),
        Rule(SgmlLinkExtractor(
                allow = (
                    'fr[0-9].htm', 
                    'eg[0-9]*.htm',
                    'ec[0-9]*.htm',
                    'op[0-9]*.htm',
                    'sc[0-9]*.htm',
                    're[0-9]*.htm',
                    'in[0-9]*.htm',
                    )
                ), 
                callback = 'parse_item',
        ),
    )

如果您用于 parse_item 的规则中存在真实的 url 模式,则可以将其简化为:

 Rule(SgmlLinkExtractor(
                allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), 
                callback = 'parse_item',
        ),
 )

You need to pass deny arguments to SgmlLinkExtractor which collects links to follow. And you don't need to create so many Rule's if they call one function parse_item. I would write your code as:

rules = (
        Rule(SgmlLinkExtractor(
                    allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
                    deny = ('culture\.htm', 'feature\.htm'),
                    ), 
            follow = True
        ),
        Rule(SgmlLinkExtractor(
                allow = (
                    'fr[0-9].htm', 
                    'eg[0-9]*.htm',
                    'ec[0-9]*.htm',
                    'op[0-9]*.htm',
                    'sc[0-9]*.htm',
                    're[0-9]*.htm',
                    'in[0-9]*.htm',
                    )
                ), 
                callback = 'parse_item',
        ),
    )

If it's real url patterns in rules you are using to parse_item, it can be simplified to this:

 Rule(SgmlLinkExtractor(
                allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), 
                callback = 'parse_item',
        ),
 )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文