Scrapy 自定义导出器
我正在定义一个项目导出器,它将项目推送到消息队列。下面是代码。
from scrapy.contrib.exporter import JsonLinesItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy import log
from scrapy.conf import settings
from carrot.connection import BrokerConnection, Exchange
from carrot.messaging import Publisher
log.start()
class QueueItemExporter(JsonLinesItemExporter):
def __init__(self, **kwargs):
log.msg("Initialising queue exporter", level=log.DEBUG)
self._configure(kwargs)
host_name = settings.get('BROKER_HOST', 'localhost')
port = settings.get('BROKER_PORT', 5672)
userid = settings.get('BROKER_USERID', "guest")
password = settings.get('BROKER_PASSWORD', "guest")
virtual_host = settings.get('BROKER_VIRTUAL_HOST', "/")
self.encoder = settings.get('MESSAGE_Q_SERIALIZER', ScrapyJSONEncoder)(**kwargs)
log.msg("Connecting to broker", level=log.DEBUG)
self.q_connection = BrokerConnection(hostname=host_name, port=port,
userid=userid, password=password,
virtual_host=virtual_host)
self.exchange = Exchange("scrapers", type="topic")
log.msg("Connected", level=log.DEBUG)
def start_exporting(self):
spider_name = "test"
log.msg("Initialising publisher", level=log.DEBUG)
self.publisher = Publisher(connection=self.q_connection,
exchange=self.exchange, routing_key="scrapy.spider.%s" % spider_name)
log.msg("done", level=log.DEBUG)
def finish_exporting(self):
self.publisher.close()
def export_item(self, item):
log.msg("In export item", level=log.DEBUG)
itemdict = dict(self._get_serialized_fields(item))
self.publisher.send({"scraped_data": self.encoder.encode(itemdict)})
log.msg("sent to queue - scrapy.spider.naukri", level=log.DEBUG)
我有一些问题。这些项目未提交到队列中。我已将以下内容添加到我的设置中:
FEED_EXPORTERS = {
"queue": 'scrapers.exporters.QueueItemExporter'
}
FEED_FORMAT = "queue"
LOG_STDOUT = True
代码不会引发任何错误,而且我也看不到任何日志消息。我对如何调试这个一无所知。
任何帮助将不胜感激。
I am defining an item exporter that pushes items to a message queue. Below is the code.
from scrapy.contrib.exporter import JsonLinesItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy import log
from scrapy.conf import settings
from carrot.connection import BrokerConnection, Exchange
from carrot.messaging import Publisher
log.start()
class QueueItemExporter(JsonLinesItemExporter):
def __init__(self, **kwargs):
log.msg("Initialising queue exporter", level=log.DEBUG)
self._configure(kwargs)
host_name = settings.get('BROKER_HOST', 'localhost')
port = settings.get('BROKER_PORT', 5672)
userid = settings.get('BROKER_USERID', "guest")
password = settings.get('BROKER_PASSWORD', "guest")
virtual_host = settings.get('BROKER_VIRTUAL_HOST', "/")
self.encoder = settings.get('MESSAGE_Q_SERIALIZER', ScrapyJSONEncoder)(**kwargs)
log.msg("Connecting to broker", level=log.DEBUG)
self.q_connection = BrokerConnection(hostname=host_name, port=port,
userid=userid, password=password,
virtual_host=virtual_host)
self.exchange = Exchange("scrapers", type="topic")
log.msg("Connected", level=log.DEBUG)
def start_exporting(self):
spider_name = "test"
log.msg("Initialising publisher", level=log.DEBUG)
self.publisher = Publisher(connection=self.q_connection,
exchange=self.exchange, routing_key="scrapy.spider.%s" % spider_name)
log.msg("done", level=log.DEBUG)
def finish_exporting(self):
self.publisher.close()
def export_item(self, item):
log.msg("In export item", level=log.DEBUG)
itemdict = dict(self._get_serialized_fields(item))
self.publisher.send({"scraped_data": self.encoder.encode(itemdict)})
log.msg("sent to queue - scrapy.spider.naukri", level=log.DEBUG)
I'm having a few problems. The items are not being submitted to the queue. Ive added the following to my settings:
FEED_EXPORTERS = {
"queue": 'scrapers.exporters.QueueItemExporter'
}
FEED_FORMAT = "queue"
LOG_STDOUT = True
The code does not raise any errors, and neither can I see any of the logging messages. Im at my wits end on how to debug this.
Any help would be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
“Feed Exporters”是调用某些“标准”项目导出器的快速(但有些肮脏)快捷方式。不要从设置中设置 feed 导出器,而是将自定义项导出器硬连接到自定义管道,如下所述 http://doc.scrapy.org/en/0.14/topics/exporters.html#using-item-exporters :
"Feed Exporters" are quick (and somehow dirty) shortcuts to call some "standard" item exporters. Instead of setting up a feed exporter from settings, hard wire your custom item exporter to your custom pipeline, as explained here http://doc.scrapy.org/en/0.14/topics/exporters.html#using-item-exporters :
提示:可以从 将项目写入 MongoDb 来自官方文档。
我也做过类似的事情。我创建了一个管道,将每个项目放入类似 S3 的服务中(我在这里使用 Minio,但您明白了)。它为每个蜘蛛创建一个新的存储桶,并将每个项目放入具有随机名称的对象中。
完整的源代码可以在我的存储库中找到。
从教程中的简单引号蜘蛛开始:
在
settings.py
中:在
scrapy_quotes/pipelines.py
中创建一个管道和一个项目导出器:Tip: Good example to start from is Writing Items to MongoDb from official docs.
I've done a similar thing. I've created a pipeline that places each item into an S3-like service (I'm using Minio here, but you get the idea). It creates a new bucket for each spider and places each item into an object with a random name.
Full source code can by found in my repo.
Starting with simple quotes spider from the tutorial:
In
settings.py
:In
scrapy_quotes/pipelines.py
create a pipeline and an item exporter:我遇到了同样的问题,尽管可能是一些版本更新。为我解决这个问题的一个小细节是在
settings.py
中设置FEED_URI = "something"
。如果没有这个,FEED_EXPORTERS
中的条目根本不会受到尊重。I hit the same problem, although being probably some version updates ahead. The little detail that solved it for me was setting
FEED_URI = "something"
insettings.py
. Without this, the entry inFEED_EXPORTERS
wasn't respected at all.