Scrapy 正在跟踪并抓取不允许的链接

发布于 2024-12-21 17:35:30 字数 1480 浏览 2 评论 0原文

我有一个 CrawlSpider 设置为跟踪某些链接并抓取一本新闻杂志，其中每期的链接都遵循以下 URL 方案：

http ://example.com/YYYY/DDDD/index.htm，其中 YYYY 是年份，DDDD 是三位数或四位数的发行号。

我只想要第 928 期以后的问题，并有下面的规则。我在连接到网站、抓取链接或提取项目时没有任何问题（因此我没有包含其余的代码）。蜘蛛似乎决心遵循不允许的链接。它正在尝试抓取问题 377、398 等，并跟踪“culture.htm”和“feature.htm”链接。这会引发很多错误，并不是非常重要，但它需要大量的数据清理。关于出了什么问题有什么建议吗？

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]

rules = (
        Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
        Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
        Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
    )

编辑：我使用更简单的正则表达式 fot 2009、2010、2011 修复了这个问题，但我仍然很好奇，如果有人有任何建议，为什么上面的方法不起作用。

原文

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme:

http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number.

I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]

rules = (
        Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
        Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
        Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
    )

EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灯下孤影 2024-12-28 17:35:30

您需要将 deny 参数传递给 SgmlLinkExtractor，它会收集指向 follow 的链接。如果它们调用一个函数 parse_item，您就不需要创建这么多 Rule。我会将您的代码编写为：

rules = (
        Rule(SgmlLinkExtractor(
                    allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
                    deny = ('culture\.htm', 'feature\.htm'),
                    ), 
            follow = True
        ),
        Rule(SgmlLinkExtractor(
                allow = (
                    'fr[0-9].htm', 
                    'eg[0-9]*.htm',
                    'ec[0-9]*.htm',
                    'op[0-9]*.htm',
                    'sc[0-9]*.htm',
                    're[0-9]*.htm',
                    'in[0-9]*.htm',
                    )
                ), 
                callback = 'parse_item',
        ),
    )

如果您用于 parse_item 的规则中存在真实的 url 模式，则可以将其简化为：

 Rule(SgmlLinkExtractor(
                allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), 
                callback = 'parse_item',
        ),
 )

You need to pass deny arguments to SgmlLinkExtractor which collects links to follow. And you don't need to create so many Rule's if they call one function parse_item. I would write your code as:

rules = (
        Rule(SgmlLinkExtractor(
                    allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
                    deny = ('culture\.htm', 'feature\.htm'),
                    ), 
            follow = True
        ),
        Rule(SgmlLinkExtractor(
                allow = (
                    'fr[0-9].htm', 
                    'eg[0-9]*.htm',
                    'ec[0-9]*.htm',
                    'op[0-9]*.htm',
                    'sc[0-9]*.htm',
                    're[0-9]*.htm',
                    'in[0-9]*.htm',
                    )
                ), 
                callback = 'parse_item',
        ),
    )

If it's real url patterns in rules you are using to parse_item, it can be simplified to this:

 Rule(SgmlLinkExtractor(
                allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), 
                callback = 'parse_item',
        ),
 )

回复收藏 0 原文

~没有更多了~

关于作者

下雨或天晴

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Scrapy 正在跟踪并抓取不允许的链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

你列表最软的妹

晚安先生.

究竟谁懂我的在乎

mmi23

梦中的蝴蝶

skjfmsvd

友情链接

Scrapy 正在跟踪并抓取不允许的链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

你列表最软的妹

晚安先生.

究竟谁懂我的在乎

mmi23

梦中的蝴蝶

skjfmsvd

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。