Scrapy网络爬虫无法抓取链接
我对 Scrapy 很陌生。我的蜘蛛在这里爬行扭曲的网络。
class TwistedWebSpider(BaseSpider):
name = "twistedweb3"
allowed_domains = ["twistedmatrix.com"]
start_urls = [
"http://twistedmatrix.com/documents/current/web/howto/",
]
rules = (
Rule(SgmlLinkExtractor(),
'parse',
follow=True,
),
)
def parse(self, response):
print response.url
filename = response.url.split("/")[-1]
filename = filename or "index.html"
open(filename, 'wb').write(response.body)
当我运行 scrapy-ctl.pycrawltwistedweb3
时,它仅获取。
获取 index.html
内容,我尝试使用 SgmlLinkExtractor
,它按我的预期提取链接,但无法跟踪这些链接。
你能告诉我哪里出错了吗?
假设我想获取css、javascript文件。我该如何实现这一目标?我的意思是获得完整的网站?
I'm very new to Scrapy. Here my spider to crawl twistedweb.
class TwistedWebSpider(BaseSpider):
name = "twistedweb3"
allowed_domains = ["twistedmatrix.com"]
start_urls = [
"http://twistedmatrix.com/documents/current/web/howto/",
]
rules = (
Rule(SgmlLinkExtractor(),
'parse',
follow=True,
),
)
def parse(self, response):
print response.url
filename = response.url.split("/")[-1]
filename = filename or "index.html"
open(filename, 'wb').write(response.body)
When I run scrapy-ctl.py crawl twistedweb3
, it fetched only.
Getting the index.html
content, I tried using SgmlLinkExtractor
, it extract links as I expected but these links can not be followed.
Can you show me where I am going wrong?
Suppose I want to get css, javascript file. How do I achieve this? I mean get full website?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
rules
属性属于CrawlSpider
。使用class MySpider(CrawlSpider)
。另外,当您使用
CrawlSpider
时,您不得覆盖parse
方法,而是使用
parse_response
或其他类似的名称。rules
attribute belongs toCrawlSpider
.Useclass MySpider(CrawlSpider)
.Also, when you use
CrawlSpider
you must not overrideparse
method,instead use
parse_response
or other similar name.