在scrapy中修改CSV导出

发布于 2024-10-21 04:03:46 字数 259 浏览 5 评论 0原文

我似乎缺少一些非常简单的东西。我想做的就是使用 ; 作为 CSV 导出器中的分隔符而不是 ,

我知道 CSV 导出器将 kwargs 传递给 csv 编写器,但我似乎不能 弄清楚如何传递这个分隔符。

我这样称呼我的蜘蛛:

scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv 

I seem to be missing something very simple. All i want to do is use ; as a
delimiter in the CSV exporter instead of ,.

I know the CSV exporter passes kwargs to csv writer, but i cant seem to
figure out how to pass this the delimiter.

I am calling my spider like so:

scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凶凌 2024-10-28 04:03:46

contrib/feedexport.py 中,

class FeedExporter(object):

    ...

    def open_spider(self, spider):
        file = TemporaryFile(prefix='feed-')
        exp = self._get_exporter(file)  # <-- this is where the exporter is instantiated
        exp.start_exporting()
        self.slots[spider] = SpiderSlot(file, exp)

    def _get_exporter(self, *a, **kw):
        return self.exporters[self.format](*a, **kw)  # <-- not passed in :(

您需要自己制作,这里有一个示例:

from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter


class CsvOptionRespectingItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        delimiter = settings.get('CSV_DELIMITER', ',')
        kwargs['delimiter'] = delimiter
        super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)

在爬虫目录的 settings.py 文件中,添加以下内容:

FEED_EXPORTERS = {
    'csv': 'importable.path.to.CsvOptionRespectingItemExporter',
}

现在,您可以按如下方式执行你的蜘蛛:

scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv --set CSV_DELIMITER=';'

HTH。

In contrib/feedexport.py,

class FeedExporter(object):

    ...

    def open_spider(self, spider):
        file = TemporaryFile(prefix='feed-')
        exp = self._get_exporter(file)  # <-- this is where the exporter is instantiated
        exp.start_exporting()
        self.slots[spider] = SpiderSlot(file, exp)

    def _get_exporter(self, *a, **kw):
        return self.exporters[self.format](*a, **kw)  # <-- not passed in :(

You will need to make your own, here's an example:

from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter


class CsvOptionRespectingItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        delimiter = settings.get('CSV_DELIMITER', ',')
        kwargs['delimiter'] = delimiter
        super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)

In the settings.py file of your crawler directory, add this:

FEED_EXPORTERS = {
    'csv': 'importable.path.to.CsvOptionRespectingItemExporter',
}

Now, you can execute your spider as follows:

scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv --set CSV_DELIMITER=';'

HTH.

李不 2024-10-28 04:03:46

scraper/exporters.py

from scrapy.exporters import CsvItemExporter
from scraper.settings import CSV_SEP


class CsvCustomSeperator(CsvItemExporter):
    def __init__(self, *args, **kwargs):
        kwargs['encoding'] = 'utf-8'
        kwargs['delimiter'] = CSV_SEP
        super(CsvCustomSeperator, self).__init__(*args, **kwargs)

scraper/settings.py

CSV_SEP = '|'
FEED_EXPORTERS = {
    'csv': 'scraper.exporters.CsvCustomSeperator'
}

在终端中

$ scrapy crawl spider -o file.csv -s CSV_SEP=<delimiter>

scraper/exporters.py

from scrapy.exporters import CsvItemExporter
from scraper.settings import CSV_SEP


class CsvCustomSeperator(CsvItemExporter):
    def __init__(self, *args, **kwargs):
        kwargs['encoding'] = 'utf-8'
        kwargs['delimiter'] = CSV_SEP
        super(CsvCustomSeperator, self).__init__(*args, **kwargs)

scraper/settings.py

CSV_SEP = '|'
FEED_EXPORTERS = {
    'csv': 'scraper.exporters.CsvCustomSeperator'
}

In terminal

$ scrapy crawl spider -o file.csv -s CSV_SEP=<delimiter>
标点 2024-10-28 04:03:46

我也尝试过这也可以工作:

第1步:修改C:\Python27\Lib\site-packages\scrapy\exporters.py第21行

__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
           'CsvItemExporter', 'TxtItemExporter', 'XmlItemExporter', 
           'JsonLinesItemExporter', 'JsonItemExporter', 'MarshalItemExporter']

添加'TxtItemExporter' 到原始 __all__ 列表。

第2步:将名为TxtItemExporter的新类添加到C:\Python27\Lib\site-packages\scrapy\exporters.py

class TxtItemExporter(BaseItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        self._configure(kwargs, dont_fail=True)
        if not self.encoding:
            self.encoding = 'utf-8'
        self.include_headers_line = include_headers_line
        self.stream = io.TextIOWrapper(
            file,
            line_buffering=False,
            write_through=True,
            encoding=self.encoding
        ) if six.PY3 else file
        self.csv_writer = csv.writer(self.stream, delimiter='\t', **kwargs)
        self._headers_not_written = True
        self._join_multivalued = join_multivalued

    def serialize_field(self, field, name, value):
        serializer = field.get('serializer', self._join_if_needed)
        return serializer(value)

    def _join_if_needed(self, value):
        if isinstance(value, (list, tuple)):
            try:
                return self._join_multivalued.join(value)
            except TypeError:  # list in value may not contain strings
                pass
        return value

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))
        self.csv_writer.writerow(values)

    def _build_row(self, values):
        for s in values:
            try:
                yield to_native_str(s, self.encoding)
            except TypeError:
                yield s

    def _write_headers_and_set_fields_to_export(self, item):
        if self.include_headers_line:
            if not self.fields_to_export:
                if isinstance(item, dict):
                    # for dicts try using fields of the first item
                    self.fields_to_export = list(item.keys())
                else:
                    # use fields declared in Item
                    self.fields_to_export = list(item.fields.keys())
            row = list(self._build_row(self.fields_to_export))
            self.csv_writer.writerow(row)

新类是从CsvItemExporter复制的,并且仅将 delimiter='\t' 添加到 csv.writer()

Step3: 将以下设置添加到 settings.py

FEED_EXPORTERS = {
                 'txt': 'scrapy.contrib.exporter.TxtItemExporter',
                 }
FEED_FORMAT = 'txt'
FEED_URI = "your_output_file.txt"

第4步:运行scrapy scrapy your_spider,然后您可以在您的spider项目目录中找到输出txt。

I also tried this can work too:

Step 1: Modify C:\Python27\Lib\site-packages\scrapy\exporters.py line 21 to

__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
           'CsvItemExporter', 'TxtItemExporter', 'XmlItemExporter', 
           'JsonLinesItemExporter', 'JsonItemExporter', 'MarshalItemExporter']

This add 'TxtItemExporter' to original __all__ list.

Step 2: Add a new class named TxtItemExporter to C:\Python27\Lib\site-packages\scrapy\exporters.py:

class TxtItemExporter(BaseItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        self._configure(kwargs, dont_fail=True)
        if not self.encoding:
            self.encoding = 'utf-8'
        self.include_headers_line = include_headers_line
        self.stream = io.TextIOWrapper(
            file,
            line_buffering=False,
            write_through=True,
            encoding=self.encoding
        ) if six.PY3 else file
        self.csv_writer = csv.writer(self.stream, delimiter='\t', **kwargs)
        self._headers_not_written = True
        self._join_multivalued = join_multivalued

    def serialize_field(self, field, name, value):
        serializer = field.get('serializer', self._join_if_needed)
        return serializer(value)

    def _join_if_needed(self, value):
        if isinstance(value, (list, tuple)):
            try:
                return self._join_multivalued.join(value)
            except TypeError:  # list in value may not contain strings
                pass
        return value

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))
        self.csv_writer.writerow(values)

    def _build_row(self, values):
        for s in values:
            try:
                yield to_native_str(s, self.encoding)
            except TypeError:
                yield s

    def _write_headers_and_set_fields_to_export(self, item):
        if self.include_headers_line:
            if not self.fields_to_export:
                if isinstance(item, dict):
                    # for dicts try using fields of the first item
                    self.fields_to_export = list(item.keys())
                else:
                    # use fields declared in Item
                    self.fields_to_export = list(item.fields.keys())
            row = list(self._build_row(self.fields_to_export))
            self.csv_writer.writerow(row)

The new class is copied from CsvItemExporter and only add delimiter='\t' to csv.writer()

Step3: Add following settings to settings.py

FEED_EXPORTERS = {
                 'txt': 'scrapy.contrib.exporter.TxtItemExporter',
                 }
FEED_FORMAT = 'txt'
FEED_URI = "your_output_file.txt"

Step 4: Run scrapy crawl your_spider and then you can find the output txt in your spider project directory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文