带有 UTF-8 数据的 Python CSV DictReader

发布于 2024-10-17 15:02:39 字数 282 浏览 1 评论 0原文

AFAIK,Python (v2.6) csv 模块默认无法处理 unicode 数据,对吗?在 Python 文档中,有一个关于如何读取 UTF-8 编码文件的示例。但此示例仅将 CSV 行作为列表返回。 我想按名称访问行列,因为它是由 csv.DictReader 完成的,但使用 UTF-8 编码的 CSV 输入文件。

谁能告诉我如何有效地做到这一点?我必须处理大小为 100 MB 的 CSV 文件。

AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I'd like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.

Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

丘比特射中我 2024-10-24 15:02:39

我自己想出了一个答案:

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}

注意:这已更新,因此密钥将根据评论中的建议进行解码

I came up with an answer myself:

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}

Note: This has been updated so keys are decoded per the suggestion in the comments

迷迭香的记忆 2024-10-24 15:02:39

对我来说,关键不在于操作 csv DictReader 参数,而是文件打开器本身。这成功了:

with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
    csv_reader = csv.DictReader(csv_file)

不需要特殊的课程。现在我可以打开带或不带 BOM 的文件而不会崩溃。

For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:

with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
    csv_reader = csv.DictReader(csv_file)

No special class required. Now I can open files either with or without BOM without crashing.

别想她 2024-10-24 15:02:39

@LMatter 答案的基于类的方法,通过这种方法,您仍然可以获得 DictReader 的所有好处,例如获取字段名和获取行号,再加上它处理 UTF-8

import csv

class UnicodeDictReader(csv.DictReader, object):

    def next(self):
        row = super(UnicodeDictReader, self).next()
        return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}

A classed based approach to @LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8

import csv

class UnicodeDictReader(csv.DictReader, object):

    def next(self):
        row = super(UnicodeDictReader, self).next()
        return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}
春花秋月 2024-10-24 15:02:39

首先,使用2.6版本的文档。每个版本都可能会发生变化。它明确表示它不支持 Unicode,但支持 UTF-8。 从技术上讲,这些不是一回事。正如文档所说:

csv模块不直接支持读写Unicode,但是它是8位干净保存的,可以解决ASCII NUL字符的一些问题。因此,只要避免像 UTF-16 这样使用 NUL 的编码,您就可以编写函数或类来处理编码和解码。建议使用 UTF-8。

下面的示例(来自文档)展示了如何创建两个函数,将 UTF-8 文本正确读取为 CSV。您应该知道 csv.reader() 始终返回 DictReader 对象。

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:

The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]
深爱成瘾 2024-10-24 15:02:39

使用 unicodecsv 包可以轻松实现这一点。

# pip install unicodecsv
import unicodecsv as csv

with open('your_file.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)

That's easy with the unicodecsv package.

# pip install unicodecsv
import unicodecsv as csv

with open('your_file.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)
极度宠爱 2024-10-24 15:02:39

csvw还有其他功能(用于元数据丰富的 CSV对于 Web),但它定义了一个 UnicodeDictReader 类来包裹它的 UnicodeReader 类,它的核心正是这样做的:

class UnicodeReader(Iterator):
    """Read Unicode data from a csv file."""
    […]

    def _next_row(self):
        self.lineno += 1
        return [
            s if isinstance(s, text_type) else s.decode(self._reader_encoding)
            for s in next(self.reader)]

它确实让我失望了几次,但是 < code>csvw.UnicodeDictReader 真的,真的需要在 with 块中使用,否则会中断。除此之外,该模块非常通用并且与 py2 和 py3 兼容。

The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:

class UnicodeReader(Iterator):
    """Read Unicode data from a csv file."""
    […]

    def _next_row(self):
        self.lineno += 1
        return [
            s if isinstance(s, text_type) else s.decode(self._reader_encoding)
            for s in next(self.reader)]

It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.

尛丟丟 2024-10-24 15:02:39

答案没有 DictWriter 方法,所以这里是更新的类:

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        self.fieldnames = fieldnames    # list of keys for the dict
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)

The answer doesn't have the DictWriter methods, so here is the updated class:

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        self.fieldnames = fieldnames    # list of keys for the dict
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文