Python Scrapy：allowed_domains从数据库添加新域

发布于 2024-11-14 19:45:13 字数 579 浏览 2 评论 0原文

我需要向 allowed_domains 添加更多域，因此我没有收到“已过滤的异地请求”。

我的应用程序获取从数据库获取的网址，因此我无法手动添加它们。

我尝试覆盖蜘蛛init，

就像start_urls 一样

 def __init__(self):
        super( CrawlSpider, self ).__init__()
        self.start_urls = []
        for destination in Phpbb.objects.filter(disable=False):
                self.start_urls.append(destination.forum_link)

            self.allowed_domains.append(destination.link)

，这是我要解决的第一个问题。但allow_domains 没有影响。

我需要更改一些配置才能禁用域检查？我不想要这个，因为我只想要数据库中的那些，但它现在可以帮助我禁用域检查。

谢谢！！

原文

I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ".

My app gets urls to fetch from a database, so I cant add them manually.

I tried to override the spider init

like this

 def __init__(self):
        super( CrawlSpider, self ).__init__()
        self.start_urls = []
        for destination in Phpbb.objects.filter(disable=False):
                self.start_urls.append(destination.forum_link)

            self.allowed_domains.append(destination.link)

start_urls was fine, this was my first issue to solve. but the allow_domains makes no affect.

I need to change some configuration in order to disable domain checking? I dont want this since I only want the ones from the database, but It could help me for now to disable domain check.

thanks!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

明月夜 2024-11-21 19:45:13

'allowed_domains' 参数是可选的。首先，您可以跳过它以禁用域过滤

在 scrapy/contrib/spidermiddleware/offsite.py 中，您可以为自定义域过滤功能覆盖此函数：

def get_host_regex(self, Spider):
    """重写此方法以实施不同的异地策略"""
    allowed_domains = getattr(spider, 'allowed_domains', None)
    如果不允许_domains：
        return re.compile('') # 默认允许所有
    域 = [d.replace('.', r'\.') for d in allowed_domains]
    正则表达式 = r'^(.*\.)?(%s)

% '|'.join(domains) 返回重新编译（正则表达式）

'allowed_domains' parameter is optional. To get started, you can skip it to disable domain filtering

In scrapy/contrib/spidermiddleware/offsite.py you can override this function for your custom domain filtering function :

def get_host_regex(self, spider):
    """Override this method to implement a different offsite policy"""
    allowed_domains = getattr(spider, 'allowed_domains', None)
    if not allowed_domains:
        return re.compile('') # allow all by default
    domains = [d.replace('.', r'\.') for d in allowed_domains]
    regex = r'^(.*\.)?(%s)

% '|'.join(domains) return re.compile(regex)

回复收藏 0 原文

~没有更多了~