Scrapy 正在跟踪并抓取不允许的链接
我有一个 CrawlSpider 设置为跟踪某些链接并抓取一本新闻杂志,其中每期的链接都遵循以下 URL 方案:
http ://example.com/YYYY/DDDD/index.htm,其中 YYYY 是年份,DDDD 是三位数或四位数的发行号。
我只想要第 928 期以后的问题,并有下面的规则。我在连接到网站、抓取链接或提取项目时没有任何问题(因此我没有包含其余的代码)。蜘蛛似乎决心遵循不允许的链接。它正在尝试抓取问题 377、398 等,并跟踪“culture.htm”和“feature.htm”链接。这会引发很多错误,并不是非常重要,但它需要大量的数据清理。关于出了什么问题有什么建议吗?
class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]
rules = (
Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
)
编辑:我使用更简单的正则表达式 fot 2009、2010、2011 修复了这个问题,但我仍然很好奇,如果有人有任何建议,为什么上面的方法不起作用。
I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme:
http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number.
I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?
class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]
rules = (
Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
)
EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要将
deny
参数传递给SgmlLinkExtractor
,它会收集指向follow
的链接。如果它们调用一个函数parse_item
,您就不需要创建这么多Rule
。我会将您的代码编写为:如果您用于
parse_item
的规则中存在真实的 url 模式,则可以将其简化为:You need to pass
deny
arguments toSgmlLinkExtractor
which collects links tofollow
. And you don't need to create so manyRule
's if they call one functionparse_item
. I would write your code as:If it's real url patterns in rules you are using to
parse_item
, it can be simplified to this: