Python爬行蜘蛛
我一直在学习如何使用 scrapy,尽管我一开始对 python 的经验很少。我开始学习如何使用 BaseSpider 进行抓取。现在我正在尝试抓取网站,但我遇到了一个让我非常困惑的问题。以下是来自官方网站 http://doc.scrapy.org/topics/spiders 的示例代码.html。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),)
def parse_item(self, response):
print "WHY WONT YOU WORK!!!!!!!!"
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = TestItem()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
return item
我所做的唯一更改是语句:
print "WHY WONT YOU WORK!!!!!!!!"
但是由于我在运行时没有看到此打印语句,因此我担心无法访问此函数。这是我直接从scrapy官方网站获取的代码。我做错了什么或误解了什么?
I've been learning how to use scrapy though I had minimal experience in python to begin with. I started learning how to scrape using the BaseSpider. Now I'm trying to crawl websites but I've encountered a problem that has really confuzzled me. Here is the example code from the official site at http://doc.scrapy.org/topics/spiders.html.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),)
def parse_item(self, response):
print "WHY WONT YOU WORK!!!!!!!!"
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = TestItem()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
return item
The only change I made is the statement:
print "WHY WONT YOU WORK!!!!!!!!"
But since I'm not seeing this print statement at runtime, I fear that this function isn't being reached. This is the code I took directly from the official scrapy site. What am I doing wrong or misunderstanding?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
example.com
没有任何类别或项目的链接。这只是抓取的网站 URL 的一个示例。这是文档中的非工作示例。
example.com
doesn't have any links for categories or items. This is just an example of what a scraped site URL might be.This is a non-working example in the documentation.
您可以尝试制作一个您知道可以工作的蜘蛛,并查看 print 语句是否可以在您拥有的地方执行任何操作。我想我记得很久以前就尝试过做同样的事情,即使代码被执行,它们也不会出现。
You might try making a spider that you know works, and see if print statements do anything where you have them. I think I remember trying to do the same thing a long time ago, and that they wont show up, even if the code is executed.