scrapy拒绝规则不被忽略

发布于 2024-12-26 05:04:49 字数 2125 浏览 0 评论 0原文

我有一些从数据库动态抓取的规则并将它们添加到我的蜘蛛中:

        self.name =  exSettings['site']
        self.allowed_domains = [exSettings['root']]
        self.start_urls = ['http://' + exSettings['root']]

        self.rules =  [Rule(SgmlLinkExtractor(allow=(exSettings['root'] + '$',)), follow= True)]
        denyRules = []

        for rule in exSettings['settings']:
            linkRegex = rule['link_regex']

            if rule['link_type'] == 'property_url':
                propertyRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True, callback='parseProperty')
                self.rules.insert(0, propertyRule)
                self.listingEx.append({'link_regex': linkRegex, 'extraction': rule['extraction']})

            elif rule['link_type'] == 'project_url':
                projectRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True) #not set to crawl yet due to conflict if same links appear for both
                self.rules.insert(0, projectRule)

            elif rule['link_type'] == 'favorable_url':
                favorableRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True)
                self.rules.append(favorableRule)

            elif rule['link_type'] == 'ignore_url':
                denyRules.append(linkRegex)

        #somehow all urls will get ignored if allow is empty and put as the first rule
        d = Rule(SgmlLinkExtractor(allow=('testingonly',), deny=tuple(denyRules)), follow=True)

        #self.rules.insert(0,d) #I have tried with both status but same results
        self.rules.append(d)

我的数据库中有以下规则:

link_regex: /listing/\d+/.+  (property_url)
link_regex: /project-listings/.+    (favorable_url)
link_regex: singapore-property-listing/   (favorable_url)
link_regex: /mrt/  (ignore_url)

我在日志中看到了这一点:

 http://www.propertyguru.com.sg/singapore-property-listing/property-for-sale/mrt/125/ang-mo-kio-mrt-station> (referer: http://www.propertyguru.com.sg/listing/8277630/for-sale-thomson-grand-6-star-development-)

不应该 /mrt/被拒绝?为什么我仍然抓取到上面的链接?

I have some rules that I dynamically grabbed from the database and add them in my spider:

        self.name =  exSettings['site']
        self.allowed_domains = [exSettings['root']]
        self.start_urls = ['http://' + exSettings['root']]

        self.rules =  [Rule(SgmlLinkExtractor(allow=(exSettings['root'] + '

And I have the following rules in my database:

link_regex: /listing/\d+/.+  (property_url)
link_regex: /project-listings/.+    (favorable_url)
link_regex: singapore-property-listing/   (favorable_url)
link_regex: /mrt/  (ignore_url)

And I see this in my log:

 http://www.propertyguru.com.sg/singapore-property-listing/property-for-sale/mrt/125/ang-mo-kio-mrt-station> (referer: http://www.propertyguru.com.sg/listing/8277630/for-sale-thomson-grand-6-star-development-)

Aren't /mrt/ supposed to be denied? Why do I still have the above link crawled?

,)), follow= True)] denyRules = [] for rule in exSettings['settings']: linkRegex = rule['link_regex'] if rule['link_type'] == 'property_url': propertyRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True, callback='parseProperty') self.rules.insert(0, propertyRule) self.listingEx.append({'link_regex': linkRegex, 'extraction': rule['extraction']}) elif rule['link_type'] == 'project_url': projectRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True) #not set to crawl yet due to conflict if same links appear for both self.rules.insert(0, projectRule) elif rule['link_type'] == 'favorable_url': favorableRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True) self.rules.append(favorableRule) elif rule['link_type'] == 'ignore_url': denyRules.append(linkRegex) #somehow all urls will get ignored if allow is empty and put as the first rule d = Rule(SgmlLinkExtractor(allow=('testingonly',), deny=tuple(denyRules)), follow=True) #self.rules.insert(0,d) #I have tried with both status but same results self.rules.append(d)

And I have the following rules in my database:

And I see this in my log:

Aren't /mrt/ supposed to be denied? Why do I still have the above link crawled?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

财迷小姐 2025-01-02 05:04:49

据我所知,deny 参数必须位于相同的 SgmlLinkExtractor 中,它具有 allow 模式。

在您的情况下,您创建了 SgmlLinkExtractor ,它允许 favorable_url ('singapore-property-listing/')。但此提取器没有任何 deny 模式,因此它也会提取 /mrt/

要解决此问题,您应该向相应的 SgmlLinkExtractor 添加 deny 模式。另请参阅相关问题

也许有一些方法可以定义全局拒绝模式,但我还没有看到它们。

As far as I know deny arguments must be in the same SgmlLinkExtractor, which has allow patterns.

In your case you created SgmlLinkExtractor which allows favorable_url ('singapore-property-listing/'). But this extractor doesn't have any deny patterns, so it extracts /mrt/ too.

To fix this you should add deny patterns to correspondent SgmlLinkExtractors. Also, see related question.

Maybe there is some ways to define global deny patterns, but I haven't seen them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文