Scrapy start_urls
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
items.append(item)
return items
但为什么它只抓取这两个网页呢?我看到 allowed_domains = ["dmoz.org"]
但这两个页面还包含指向 dmoz.org
域内其他页面的链接!为什么它不把它们也刮掉呢?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
start_urls
类属性包含起始 url - 仅此而已。如果您已经提取了要抓取的其他页面的 url,则使用 [another] 回调从parse
回调相应的请求中产生收益:如果您仍想自定义启动请求创建,请重写方法 BaseSpider.start_requests()
start_urls
class attribute contains start urls - nothing more. If you have extracted urls of other pages you want to scrape - yield fromparse
callback corresponding requests with [another] callback:If you still want to customize start requests creation, override method BaseSpider.start_requests()
start_urls 包含蜘蛛开始爬行的链接。
如果您想递归爬行,您应该使用crawlspider并为其定义规则。
http://doc.scrapy.org/en/latest/topics/spiders.html
例如,看看那里。
start_urls contain those links from which the spider start crawling.
If you want crawl recursively you should use crawlspider and define rules for that.
http://doc.scrapy.org/en/latest/topics/spiders.html
look there for example.
该类没有
rules
属性。看看 http://readthedocs.org/docs/scrapy/en /latest/intro/overview.html 并搜索“rules”以查找示例。The class does not have a
rules
property. Have a look at http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html and search for "rules" to find an example.如果您使用
BaseSpider
,在回调中,您必须自己提取所需的 url 并返回Request
对象。如果您使用
CrawlSpider
,链接提取将由规则和与规则关联的SgmlLinkExtractor 负责。If you use
BaseSpider
, inside the callback, you have to extract out your desired urls yourself and return aRequest
object.If you use
CrawlSpider
, links extraction would be taken care of by the rules and the SgmlLinkExtractor associated with the rules.如果您使用规则来跟踪链接(已经在 scrapy 中实现),蜘蛛也会抓取它们。我希望有帮助...
If you use an rule to follow links (that is already implemented in scrapy), the spider will scrape them too. I hope have helped...
你没有编写函数来处理你想要得到的url。所以有两种解决方法。1.使用规则(crawlspider)2:编写函数来处理新的url。并将它们放入回调函数中。
you didn't write the function to deal the urls what you want to get.so two way to reslolve.1.use the the rule (crawlspider) 2:write the function to deal the new urls.and put them in the callback function.