Scrapy :: JSON 导出问题

发布于 2024-12-10 18:41:46 字数 2529 浏览 0 评论 0 原文

因此,我花了相当多的时间浏览 Scrapy 文档和教程,此后我一直致力于一个非常基本的爬虫。但是,我无法将输出放入 JSON 文件中。我觉得我错过了一些明显的东西,但在查看了许多其他示例并尝试了几种不同的方法后,我无法找到任何东西。

为了彻底起见,我将包含所有相关代码。我想要在这里获取的是一些特定的商品及其相关价格。价格会经常变化,而物品变化的频率要低得多。

这是我的 items.py :

class CartItems(Item):
    url = Field()
    name = Field()
    price = Field()

这是蜘蛛:

from scrapy.selector import HtmlXPathSelector                                                                                                                                        
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field

from Example.items import CartItems

class DomainSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/path/to/desired/page']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        cart = CartItems()
        cart['url'] = hxs.select('//title/text()').extract()
        cart['name'] = hxs.select('//td/text()').extract()[1]
        cart['price'] = hxs.select('//td/text()').extract()[2]
        return cart

例如,如果我从 URL http://www.example.com/path/to/desired/page,然后我得到以下回复:

u'Text field I am trying to download'

编辑:

好的,所以我按照我在 wiki 中找到的一个管道编写了一个管道(最近几天我在研究这个问题时不知何故错过了这一部分),只是更改为使用 JSON 而不是 XML。

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonItemExporter

class JsonExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

这确实输出了一个文件“example.com_items.json”,但它包含的只是“[]”。所以,我这里还是有些不对劲。是蜘蛛的问题,还是管道没有正确完成?显然我在这里遗漏了一些东西,所以如果有人可以将我推向正确的方向,或者给我链接任何可能有帮助的例子,那将是非常感激的。

So, I have spent quite a bit of time going through the Scrapy documentation and tutorials, and I have since been plugging away at a very basic crawler. However, I am not able to get the output into a JSON file. I feel like I am missing something obvious, but I haven't been able to turn anything up after looking at a number of other examples, and trying several different things out.

To be thorough, I will include all of the relevant code. What I am trying to get here is some specific items and their associated prices. The prices will change fairly often, and the items will change with much lower frequency.

Here is my items.py :

class CartItems(Item):
    url = Field()
    name = Field()
    price = Field()

And here is the spider :

from scrapy.selector import HtmlXPathSelector                                                                                                                                        
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field

from Example.items import CartItems

class DomainSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/path/to/desired/page']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        cart = CartItems()
        cart['url'] = hxs.select('//title/text()').extract()
        cart['name'] = hxs.select('//td/text()').extract()[1]
        cart['price'] = hxs.select('//td/text()').extract()[2]
        return cart

If for example I run hxs.select('//td/text()').extract()[1] from the Scrapy shell on the URL http://www.example.com/path/to/desired/page, then I get the following response:

u'Text field I am trying to download'

EDIT:

Okay, so I wrote a pipeline that follows one I found in the wiki (I somehow missed this section when I was digging through this the last few days), just altered to use JSON instead of XML.

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonItemExporter

class JsonExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

This does output a file "example.com_items.json", but all it contains is "[]". So, I something is still not right here. Is the issue with the spider, or is the pipeline not done correctly? Clearly I am missing something here, so if someone could nudge me in the right direction, or link me any examples that might help out, that would be most appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧伤慢歌 2024-12-17 18:41:46

JsonItemExporter 相当简单:

class JsonItemExporter(JsonLinesItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write("[")

    def finish_exporting(self):
        self.file.write("]")

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict))

所以,我有两个结论:

  1. 文件已创建 - 您的管道处于活动状态并挂钩 spider_openedspider_close 事件.

  2. process_item 永远不会被调用。也许没有项目被抓取,因此没有项目传递到此管道?

另外,我认为代码中存在一个错误:

def spider_opened(self, spider):
    file = open('%s_items.json' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = JsonItemExporter(file)
    self.exporter.start_exporting()

self.exporter = JsonItemExporter(file) - 这是否意味着只有一个导出器始终处于活动状态?打开蜘蛛后,您将创建一个导出器。当该蜘蛛处于活动状态时,另一个蜘蛛可以打开,并且 self.exporter 将被新的导出器覆盖。

JsonItemExporter is fairly simple:

class JsonItemExporter(JsonLinesItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write("[")

    def finish_exporting(self):
        self.file.write("]")

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict))

So, i have two conclusions:

  1. File is created - your pipeline is active and hooks spider_opened and spider_closed events.

  2. process_item is never called. Maybe no item is scraped, so no item is passed to this pipeline?

Also, i think there is a bug in the code:

def spider_opened(self, spider):
    file = open('%s_items.json' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = JsonItemExporter(file)
    self.exporter.start_exporting()

self.exporter = JsonItemExporter(file) - doesn't this mean that there is only one exporter is active all the time? Once a spider is opened you create an exporter. While that spider is active another one can open, and self.exporter will be overwritten by a new exporter.

慵挽 2024-12-17 18:41:46

我从 JsonExportPipeline 复制了您的代码并在我的机器上进行了测试。
它与我的蜘蛛配合得很好。

所以我认为你应该检查该页面。

start_urls = ['http://www.example.com/path/to/desired/page']

也许您的解析函数在提取内容时出现问题。这是下面的函数:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    cart = CartItems()
    cart['url'] = hxs.select('//title/text()').extract()
    cart['name'] = hxs.select('//td/text()').extract()[1]
    cart['price'] = hxs.select('//td/text()').extract()[2]
    return cart

I copied your code from JsonExportPipeline and tested on my machine.
It works fine with my spider.

So I think you should check the page.

start_urls = ['http://www.example.com/path/to/desired/page']

Maybe your parse function has something wrong of extracting the content. Which is the function below:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    cart = CartItems()
    cart['url'] = hxs.select('//title/text()').extract()
    cart['name'] = hxs.select('//td/text()').extract()[1]
    cart['price'] = hxs.select('//td/text()').extract()[2]
    return cart
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文