Scrapy 雅虎集团蜘蛛

发布于 2024-10-25 22:17:12 字数 884 浏览 3 评论 0原文

试图刮一个Y!小组和我可以从一页获取数据,但仅此而已。我有一些基本规则,但显然它们是不正确的。有人已经解决了这个问题吗?

class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
    Rule(SgmlLinkExtractor(), callback='parse_item'),
)


def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html')
    item = Item()
    for site in sites:
        item = YgroupItem()
        item['title'] = site.select('//title').extract()
        item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
        item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
    return item

Trying to scrape a Y! Group and I can get data from one page but that's it. I've got some basic rules but clearly they aren't right. Anyone already solved this one?

class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
    Rule(SgmlLinkExtractor(), callback='parse_item'),
)


def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html')
    item = Item()
    for site in sites:
        item = YgroupItem()
        item['title'] = site.select('//title').extract()
        item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
        item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
    return item

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

帅气尐潴 2024-11-01 22:17:12

看起来你几乎不知道自己在做什么。我对 Scrapy 还很陌生,但我想你会想要有类似的东西
规则(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )),callback='parse_item'),
尝试编写与您想要的完整链接 URL 匹配的正则表达式。另外,看起来您只需要一条规则。将回调添加到第一个回调。链接提取器匹配与允许中的正则表达式匹配的每个链接,并从中排除与拒绝匹配的链接,然后从那里加载每个剩余页面并将其传递给parse_item

我说这一切时并不真正了解您正在数据挖掘的页面以及您想要的数据的性质。您需要这种蜘蛛程序来访问包含您想要的数据的页面的链接。

Looks like you have almost no idea what you are doing. I'm pretty new to Scrapy but I think you will want to have something like
Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )), callback='parse_item'),
Try to write a regular expression that matches the complete link URL you want. Also, it looks like you only need one rule. Add the callback to the first one. The link extractor matches every link matched with the regular expression in allow and from those excludes those matched by deny, and from there each of the remaining pages will be loaded and passed to parse_item.

I'm saying all this without knowing really anything about the page you are data mining and the nature of the data you want. You want this sort of spider for a page which has links to the pages that have the data you want.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文