Python DictWriter 写入 UTF-8 编码的 CSV 文件

发布于 2024-11-04 08:25:50 字数 520 浏览 0 评论 0原文

我有一个包含 unicode 字符串的字典列表。
csv.DictWriter 可以将字典列表写入 CSV 文件。
我希望 CSV 文件以 UTF8 编码。
csv 模块无法处理将 unicode 字符串转换为 UTF8。

csv 模块文档有一个将所有内容转换为 UTF8 的示例：

def utf_8_encoder(unicode_csv_data):
    对于 unicode_csv_data 中的行：
        产量行.encode('utf-8')

它还有一个 UnicodeWriter 类。

但是...我如何让 DictWriter 与这些一起工作？难道他们不需要将自己注入到其中，以捕获反汇编的字典并在将它们写入文件之前对其进行编码吗？我不明白。

原文

I have a list of dictionaries containing unicode strings.
csv.DictWriter can write a list of dictionaries into a CSV file.
I want the CSV file to be encoded in UTF8.
The csv module cannot handle converting unicode strings into UTF8.

The csv module documentation has an example for converting everything to UTF8:

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

It also has a UnicodeWriter class.

But... how do I make DictWriter work with these? Wouldn't they have to inject themselves in the middle of it, to catch the disassembled dictionaries and encode them before it writes them to the file? I don't get it.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无所的.畏惧 2024-11-11 08:26:05

您可以根据需要使用一些代理类对 dict 值进行编码，如下所示：

# -*- coding: utf-8 -*- 
import csv
d = {'a':123,'b':456, 'c':u'Non-ASCII: проверка'}

class DictUnicodeProxy(object):
    def __init__(self, d):
        self.d = d
    def __iter__(self):
        return self.d.__iter__()
    def get(self, item, default=None):
        i = self.d.get(item, default)
        if isinstance(i, unicode):
            return i.encode('utf-8')
        return i

with open('some.csv', 'wb') as f:
    writer = csv.DictWriter(f, ['a', 'b', 'c'])
    writer.writerow(DictUnicodeProxy(d))

You can use some proxy class to encode dict values as needed, like this:

# -*- coding: utf-8 -*- 
import csv
d = {'a':123,'b':456, 'c':u'Non-ASCII: проверка'}

class DictUnicodeProxy(object):
    def __init__(self, d):
        self.d = d
    def __iter__(self):
        return self.d.__iter__()
    def get(self, item, default=None):
        i = self.d.get(item, default)
        if isinstance(i, unicode):
            return i.encode('utf-8')
        return i

with open('some.csv', 'wb') as f:
    writer = csv.DictWriter(f, ['a', 'b', 'c'])
    writer.writerow(DictUnicodeProxy(d))

回复收藏 0 原文

睫毛溺水了 2024-11-11 08:26:05

当您使用内容调用 csv.writer 时，其想法是通过 utf_8_encoder 传递内容，因为它会为您提供 (utf-8) 编码的内容。

回复收藏 0 原文

陪你搞怪i 2024-11-11 08:26:05

我的解决方案有点不同。虽然上述所有解决方案都专注于拥有 unicode 兼容的 dict，但我的解决方案使 DictWriter 与 unicode 兼容。 python 文档中甚至建议使用这种方法 (1) 。

UTF8Recoder、UnicodeReader、UnicodeWriter 类取自 python 文档。 UnicodeWriter->writerow 也做了一些改变。

将其用作常规 DictWriter/DictReader。

这是代码：

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

    def __iter__(self):
        return self

    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([unicode(s).encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

class UnicodeDictWriter(csv.DictWriter, object):
    def __init__(self, f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds):
        super(UnicodeDictWriter, self).__init__(f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds)
        self.writer = UnicodeWriter(f, dialect, **kwds)

My solution is a bit different. While all solutions above are focusing on having unicode compatible dict, my solutions makes DictWriter compatible with unicode. This approach is even suggested in python docs (1).

Classes UTF8Recoder, UnicodeReader, UnicodeWriter are taken from python docs. UnicodeWriter->writerow was changed a little bit too.

Use it as regular DictWriter/DictReader.

Here is the code:

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

    def __iter__(self):
        return self

    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([unicode(s).encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

class UnicodeDictWriter(csv.DictWriter, object):
    def __init__(self, f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds):
        super(UnicodeDictWriter, self).__init__(f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds)
        self.writer = UnicodeWriter(f, dialect, **kwds)

回复收藏 0 原文

空宴 2024-11-11 08:26:04

当您将字典传递给 DictWriter.writerow() 时，您可以将这些值即时转换为 UTF-8。例如：

import csv

rows = [
    {'name': u'Anton\xedn Dvo\u0159\xe1k','country': u'\u010cesko'},
    {'name': u'Bj\xf6rk Gu\xf0mundsd\xf3ttir', 'country': u'\xcdsland'},
    {'name': u'S\xf8ren Kierkeg\xe5rd', 'country': u'Danmark'}
    ]

# implement this wrapper on 2.6 or lower if you need to output a header
class DictWriterEx(csv.DictWriter):
    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)

out = open('foo.csv', 'wb')
writer = DictWriterEx(out, fieldnames=['name','country'])
# DictWriter.writeheader() was added in 2.7 (use class above for <= 2.6)
writer.writeheader()
for row in rows:
    writer.writerow(dict((k, v.encode('utf-8')) for k, v in row.iteritems()))
out.close()

输出foo.csv：

name,country
Antonín Dvořák,Česko
Björk Guðmundsdóttir,Ísland
Søren Kierkegård,Danmark

You can convert the values to UTF-8 on the fly as you pass the dict to DictWriter.writerow(). For example:

import csv

rows = [
    {'name': u'Anton\xedn Dvo\u0159\xe1k','country': u'\u010cesko'},
    {'name': u'Bj\xf6rk Gu\xf0mundsd\xf3ttir', 'country': u'\xcdsland'},
    {'name': u'S\xf8ren Kierkeg\xe5rd', 'country': u'Danmark'}
    ]

# implement this wrapper on 2.6 or lower if you need to output a header
class DictWriterEx(csv.DictWriter):
    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)

out = open('foo.csv', 'wb')
writer = DictWriterEx(out, fieldnames=['name','country'])
# DictWriter.writeheader() was added in 2.7 (use class above for <= 2.6)
writer.writeheader()
for row in rows:
    writer.writerow(dict((k, v.encode('utf-8')) for k, v in row.iteritems()))
out.close()

Output foo.csv:

name,country
Antonín Dvořák,Česko
Björk Guðmundsdóttir,Ísland
Søren Kierkegård,Danmark

回复收藏 0 原文

青芜 2024-11-11 08:26:03

更新：第3方unicodecsv模块实现了这个7年前的答案为你。此代码下面的示例。还有一个不需要第三方模块的 Python 3 解决方案。

原始Python 2答案

如果使用Python 2.7或更高版本，请在传递给DictWriter之前使用字典理解将字典重新映射为utf-8：

# coding: utf-8
import csv

D = {'name': u'马克', 'pinyin': u'mǎkè'}

f = open('out.csv', 'wb')
f.write(u'\ufeff'.encode('utf8'))  # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f, sorted(D.keys()))
w.writeheader()
w.writerow({k:v.encode('utf8') for k, v in D.items()})
f.close()

您可以使用此想法来更新UnicodeWriter到 DictUnicodeWriter：

# coding: utf-8
import csv
import cStringIO
import codecs

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, D):
        self.writer.writerow({k:v.encode("utf-8") for k, v in D.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for D in rows:
            self.writerow(D)

    def writeheader(self):
        self.writer.writeheader()

D1 = {'name': u'马克', 'pinyin': u'Mǎkè'}
D2 = {'name': u'美国', 'pinyin': u'Měiguó'}
f = open('out.csv', 'wb')
f.write(u'\ufeff'.encode('utf8'))  # BOM (optional...Excel needs it to open UTF-8 file properly)
w = DictUnicodeWriter(f, sorted(D.keys()))
w.writeheader()
w.writerows([D1, D2])
f.close()

Python 2 unicodecsv 示例：

# coding: utf-8
import unicodecsv as csv

D = {u'name': u'马克', u'pinyin': u'mǎkè'}

with open('out.csv','wb') as f:
    w = csv.DictWriter(f, fieldnames=sorted(D.keys()), encoding='utf-8-sig')
    w.writeheader()
    w.writerow(D)

Python 3：

此外，Python 3 的内置 csv 模块本身支持 Unicode：

import csv

D = {'name': '马克', 'pinyin': 'mǎkè'}

# Use 'w' and newline='' instead of 'wb' in Python 3.
# Use 'utf-8-sig' for UTF-8 w/ BOM for Excel to read as UTF-8 properly.
# Use 'utf8' for UTF-8 (no BOM) otherwise.
with open('out.csv', 'w', encoding='utf-8-sig', newline='') as f: 
    w = csv.DictWriter(f, fieldnames=sorted(D))
    w.writeheader()
    w.writerow(D)

UPDATE: The 3rd party unicodecsv module implements this 7-year old answer for you. Example below this code. There's also a Python 3 solution that doesn't required a 3rd party module.

Original Python 2 Answer

If using Python 2.7 or later, use a dict comprehension to remap the dictionary to utf-8 before passing to DictWriter:

# coding: utf-8
import csv

D = {'name': u'马克', 'pinyin': u'mǎkè'}

f = open('out.csv', 'wb')
f.write(u'\ufeff'.encode('utf8'))  # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f, sorted(D.keys()))
w.writeheader()
w.writerow({k:v.encode('utf8') for k, v in D.items()})
f.close()

You can use this idea to update UnicodeWriter to DictUnicodeWriter:

# coding: utf-8
import csv
import cStringIO
import codecs

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, D):
        self.writer.writerow({k:v.encode("utf-8") for k, v in D.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for D in rows:
            self.writerow(D)

    def writeheader(self):
        self.writer.writeheader()

D1 = {'name': u'马克', 'pinyin': u'Mǎkè'}
D2 = {'name': u'美国', 'pinyin': u'Měiguó'}
f = open('out.csv', 'wb')
f.write(u'\ufeff'.encode('utf8'))  # BOM (optional...Excel needs it to open UTF-8 file properly)
w = DictUnicodeWriter(f, sorted(D.keys()))
w.writeheader()
w.writerows([D1, D2])
f.close()

Python 2 unicodecsv Example:

# coding: utf-8
import unicodecsv as csv

D = {u'name': u'马克', u'pinyin': u'mǎkè'}

with open('out.csv','wb') as f:
    w = csv.DictWriter(f, fieldnames=sorted(D.keys()), encoding='utf-8-sig')
    w.writeheader()
    w.writerow(D)

Python 3:

Additionally, Python 3's built-in csv module supports Unicode natively:

import csv

D = {'name': '马克', 'pinyin': 'mǎkè'}

# Use 'w' and newline='' instead of 'wb' in Python 3.
# Use 'utf-8-sig' for UTF-8 w/ BOM for Excel to read as UTF-8 properly.
# Use 'utf8' for UTF-8 (no BOM) otherwise.
with open('out.csv', 'w', encoding='utf-8-sig', newline='') as f: 
    w = csv.DictWriter(f, fieldnames=sorted(D))
    w.writeheader()
    w.writerow(D)

回复收藏 0 原文