Python Scrapy:allowed_domains从数据库添加新域
我需要向 allowed_domains 添加更多域,因此我没有收到“已过滤的异地请求”。
我的应用程序获取从数据库获取的网址,因此我无法手动添加它们。
我尝试覆盖蜘蛛init,
就像start_urls 一样
def __init__(self):
super( CrawlSpider, self ).__init__()
self.start_urls = []
for destination in Phpbb.objects.filter(disable=False):
self.start_urls.append(destination.forum_link)
self.allowed_domains.append(destination.link)
,这是我要解决的第一个问题。但allow_domains 没有影响。
我需要更改一些配置才能禁用域检查?我不想要这个,因为我只想要数据库中的那些,但它现在可以帮助我禁用域检查。
谢谢!!
I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ".
My app gets urls to fetch from a database, so I cant add them manually.
I tried to override the spider init
like this
def __init__(self):
super( CrawlSpider, self ).__init__()
self.start_urls = []
for destination in Phpbb.objects.filter(disable=False):
self.start_urls.append(destination.forum_link)
self.allowed_domains.append(destination.link)
start_urls was fine, this was my first issue to solve. but the allow_domains makes no affect.
I need to change some configuration in order to disable domain checking? I dont want this since I only want the ones from the database, but It could help me for now to disable domain check.
thanks!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
'allowed_domains'
参数是可选的。首先,您可以跳过它以禁用域过滤在
scrapy/contrib/spidermiddleware/offsite.py
中,您可以为自定义域过滤功能覆盖此函数:% '|'.join(domains) 返回重新编译(正则表达式)
'allowed_domains'
parameter is optional. To get started, you can skip it to disable domain filteringIn
scrapy/contrib/spidermiddleware/offsite.py
you can override this function for your custom domain filtering function :% '|'.join(domains) return re.compile(regex)