使用Scrapy抓取网页中的url
我正在使用 scrapy 从某些网站提取数据。问题是我的蜘蛛只能抓取初始 start_urls 的网页,它无法抓取网页中的 url。 我完全复制了同一个蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from nextlink.items import NextlinkItem
class Nextlink_Spider(BaseSpider):
name = "Nextlink"
allowed_domains = ["Nextlink"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/@href')
for site in sites:
relative_url = site.extract()
url = self._urljoin(response,relative_url)
yield Request(url, callback = self.parsetext)
def parsetext(self, response):
log = open("log.txt", "a")
log.write("test if the parsetext is called")
hxs = HtmlXPathSelector(response)
items = []
texts = hxs.select('//div').extract()
for text in texts:
item = NextlinkItem()
item['text'] = text
items.append(item)
log = open("log.txt", "a")
log.write(text)
return items
def _urljoin(self, response, url):
"""Helper to convert relative urls to absolute"""
return urljoin_rfc(response.url, url, response.encoding)
我使用 log.txt 来测试是否调用了解析文本。但是,在运行我的蜘蛛之后,log.txt 中没有任何内容。
I'm using scrapy to extract data from certain websites.The problem is that my spider can only crawl the webpage of initial start_urls , it can't crawl the urls in the webpage.
I copied the same spider exactly:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from nextlink.items import NextlinkItem
class Nextlink_Spider(BaseSpider):
name = "Nextlink"
allowed_domains = ["Nextlink"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/@href')
for site in sites:
relative_url = site.extract()
url = self._urljoin(response,relative_url)
yield Request(url, callback = self.parsetext)
def parsetext(self, response):
log = open("log.txt", "a")
log.write("test if the parsetext is called")
hxs = HtmlXPathSelector(response)
items = []
texts = hxs.select('//div').extract()
for text in texts:
item = NextlinkItem()
item['text'] = text
items.append(item)
log = open("log.txt", "a")
log.write(text)
return items
def _urljoin(self, response, url):
"""Helper to convert relative urls to absolute"""
return urljoin_rfc(response.url, url, response.encoding)
I use the log.txt to test if the parsetext is called.However, after I runned my spider, there is nothing in the log.txt.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请参阅此处:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html?highlight=allowed_domains#scrapy.spider.BaseSpider.allowed_domains
因此,只要您没有在设置中激活 OffsiteMiddleware,就没有关系,您可以完全忽略
allowed_domains
。检查settings.py OffsiteMiddleware是否激活。如果您想让您的蜘蛛在任何域上爬行,则不应激活它。
See here:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html?highlight=allowed_domains#scrapy.spider.BaseSpider.allowed_domains
So, as long as you didn't activate the OffsiteMiddleware in your settings, it doesn't matter and you can leave
allowed_domains
completely out.Check the settings.py whether the OffsiteMiddleware is activated or not. It shouldn't be activated if you want to allow your Spider to crawl on any domain.
我认为问题是,你没有告诉 Scrapy 跟踪每个爬行的 URL。对于我自己的博客,我实现了一个 CrawlSpider,它使用基于 LinkExtractor 的规则从我的博客页面中提取所有相关链接:
在 https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/我已经详细描述了如何实现网站爬虫来预热我的 Wordpress 全页缓存。
I think the problem is, that you didn't tell Scrapy to follow each crawled URL. For my own blog I've implemented a CrawlSpider that uses LinkExtractor-based Rules to extract all relevant links from my blog pages:
On https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/ I've described detailed how I've implemented a website crawler to warm up my Wordpress fullpage cache.
我的猜测是这一行:
这不是像domain.tld这样的域,所以它会拒绝任何链接。
如果您采用文档中的示例:
allowed_domains = [ “dmoz.org”]
My guess would be this line:
This isn't a domain like domain.tld, so it would reject any links.
If you take the example from the documentation:
allowed_domains = ["dmoz.org"]