Scrapy - 解析页面以提取项目 - 然后跟踪并存储项目 url 内容

发布于 2024-11-04 03:58:41 字数 1180 浏览 5 评论 0原文

我有一个关于如何在 scrapy 中做这件事的问题。我有一个蜘蛛,可以爬行以获取项目列表页面。 每次找到带有项目的列表页面时,都会调用 parse_item() 回调来提取项目数据并生成项目。到目前为止一切都很好,一切都很好。

但每个项目除其他数据外,还包含一个 url,其中包含该项目的更多详细信息。我想跟踪该 url 并将该项目 url 的获取内容存储在另一个项目字段 (url_contents) 中。

我不确定如何组织代码来实现这一目标,因为两个链接(列表链接和一个特定项目链接)的遵循方式不同,在不同时间调用回调,但我必须在同一项目处理中将它们关联起来。

到目前为止我的代码如下所示:

class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?q=example",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
    )


    def parse_item(self, response):
        main_selector = HtmlXPathSelector(response)
        xpath = '//h2[@class="title"]'

        sub_selectors = main_selector.select(xpath)

        for sel in sub_selectors:
            item = ExampleItem()
            l = ExampleLoader(item = item, selector = sel)
            l.add_xpath('title', 'a[@title]/@title')
            ......
            yield l.load_item()

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.

But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.

And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.

My code so far looks like this:

class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?q=example",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
    )


    def parse_item(self, response):
        main_selector = HtmlXPathSelector(response)
        xpath = '//h2[@class="title"]'

        sub_selectors = main_selector.select(xpath)

        for sel in sub_selectors:
            item = ExampleItem()
            l = ExampleLoader(item = item, selector = sel)
            l.add_xpath('title', 'a[@title]/@title')
            ......
            yield l.load_item()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

栩栩如生 2024-11-11 03:58:41

经过一些测试和思考,我发现这个解决方案适合我。
我们的想法是仅使用第一条规则,该规则为您提供项目列表,而且非常重要的是,向该规则添加 follow=True 。

在 parse_item() 中,您必须生成一个请求而不是一个项目,但在加载该项目之后。该请求是针对商品详细信息 url。并且您必须将加载的项目发送到该请求回调。您根据响应完成您的工作,然后就可以生成项目。

所以 parse_item() 的完成将如下所示:

itemloaded = l.load_item()

# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded

yield request

然后 parse_url_contents() 将如下所示:

def parse_url_contents(self, response):
    item = response.request.meta['item']
    item['url_contents'] = response.body
    yield item

如果有人有其他(更好的)方法,请告诉我们。

斯特凡

After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.

And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.

So the finish of parse_item() will look like this:

itemloaded = l.load_item()

# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded

yield request

And then parse_url_contents() will look like this:

def parse_url_contents(self, response):
    item = response.request.meta['item']
    item['url_contents'] = response.body
    yield item

If anyone has another (better) approach, let us know.

Stefan

决绝 2024-11-11 03:58:41

我遇到了完全相同的问题,并且由于两天没有人回答您的问题,我认为唯一的解决方案是从您的 parse_item 中手动跟踪该 URL功能。

我是 Scrapy 的新手,所以我不会尝试这样做(尽管我确信这是可能的),但我的解决方案是使用 urllib 和 BeatifulSoup 手动加载第二页,我自己提取该信息,并将其保存为项目的一部分。是的,比 Scrapy 进行正常解析要麻烦得多,但它应该以最少的麻烦完成工作。

I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.

I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文