Scrapy 雅虎集团蜘蛛

发布于 2024-10-25 22:17:12 字数 884 浏览 3 评论 0原文

试图刮一个Y！小组和我可以从一页获取数据，但仅此而已。我有一些基本规则，但显然它们是不正确的。有人已经解决了这个问题吗？

class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
    Rule(SgmlLinkExtractor(), callback='parse_item'),
)


def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html')
    item = Item()
    for site in sites:
        item = YgroupItem()
        item['title'] = site.select('//title').extract()
        item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
        item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
    return item

原文

Trying to scrape a Y! Group and I can get data from one page but that's it. I've got some basic rules but clearly they aren't right. Anyone already solved this one?

class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
    Rule(SgmlLinkExtractor(), callback='parse_item'),
)


def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html')
    item = Item()
    for site in sites:
        item = YgroupItem()
        item['title'] = site.select('//title').extract()
        item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
        item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
    return item

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帅气尐潴 2024-11-01 22:17:12

看起来你几乎不知道自己在做什么。我对 Scrapy 还很陌生，但我想你会想要有类似的东西
规则(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )),callback='parse_item'),
尝试编写与您想要的完整链接 URL 匹配的正则表达式。另外，看起来您只需要一条规则。将回调添加到第一个回调。链接提取器匹配与允许中的正则表达式匹配的每个链接，并从中排除与拒绝匹配的链接，然后从那里加载每个剩余页面并将其传递给parse_item。

我说这一切时并不真正了解您正在数据挖掘的页面以及您想要的数据的性质。您需要这种蜘蛛程序来访问包含您想要的数据的页面的链接。

回复收藏 0 原文

~没有更多了~