Scrapy 雅虎集团蜘蛛
试图刮一个Y!小组和我可以从一页获取数据,但仅此而已。我有一些基本规则,但显然它们是不正确的。有人已经解决了这个问题吗?
class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
"http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]
rules = (
Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
item = Item()
for site in sites:
item = YgroupItem()
item['title'] = site.select('//title').extract()
item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
return item
Trying to scrape a Y! Group and I can get data from one page but that's it. I've got some basic rules but clearly they aren't right. Anyone already solved this one?
class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
"http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]
rules = (
Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
item = Item()
for site in sites:
item = YgroupItem()
item['title'] = site.select('//title').extract()
item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
return item
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来你几乎不知道自己在做什么。我对 Scrapy 还很陌生,但我想你会想要有类似的东西
规则(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )),callback='parse_item'),
尝试编写与您想要的完整链接 URL 匹配的正则表达式。另外,看起来您只需要一条规则。将回调添加到第一个回调。链接提取器匹配与允许中的正则表达式匹配的每个链接,并从中排除与拒绝匹配的链接,然后从那里加载每个剩余页面并将其传递给
parse_item
。我说这一切时并不真正了解您正在数据挖掘的页面以及您想要的数据的性质。您需要这种蜘蛛程序来访问包含您想要的数据的页面的链接。
Looks like you have almost no idea what you are doing. I'm pretty new to Scrapy but I think you will want to have something like
Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )), callback='parse_item'),
Try to write a regular expression that matches the complete link URL you want. Also, it looks like you only need one rule. Add the callback to the first one. The link extractor matches every link matched with the regular expression in allow and from those excludes those matched by deny, and from there each of the remaining pages will be loaded and passed to
parse_item
.I'm saying all this without knowing really anything about the page you are data mining and the nature of the data you want. You want this sort of spider for a page which has links to the pages that have the data you want.