当前位置：文江博客话题详情

python 模块，如 csv-DictReader，具有完整的 utf8 支持

发布于 2024-10-27 18:23:08 字数 81 浏览 1 评论 0原文

我需要从项目中的 csv 导入数据，并且需要像 DictReader 这样的对象，但是具有完整的 utf8 支持，有人知道有这样的模块或应用程序吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜唯美灬不弃 2024-11-03 18:23:09

您的数据未采用 UTF-8 编码。它（大部分）以 cp1252 编码。该数据似乎包括西班牙名字。最常见的非 ASCII 字符是“\xd1”（即带波形符的拉丁大写字母 N）——这是导致异常的字符。

文件中的非 ASCII 字符之一是“\x8d”。它不在 cp1252 中。它出现在名称 VASQUEZ 中字母 A 应该出现的位置。其中，“\x94”（cp1252 中的双引号）出现在名称中间。其余的也可能代表错误。

我建议您运行这个小代码片段来打印其中包含可疑字符的行：

for lino, line in enumerate(open('sampleresults.csv')):
    if any(c in line for c in '\x8d\x94\xc1\xcf\xd3'): print "%d %r\n" % (lino+1, line)

并修复数据。

那么您需要一个具有完整和通用解码支持的csv DictReader。完整意味着解码字段名（又名字典键）以及数据。广义意味着没有对编码进行硬编码。

import csv

def UnicodeDictReader(str_data, encoding, **kwargs):
    csv_reader = csv.DictReader(str_data, **kwargs)
    # Decode the keys once
    keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
    for row in csv_reader:
        yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())

dozedata = ['\xd1,\xff', '\xd2,\xfe', '3,4']
print list(UnicodeDictReader(dozedata, 'cp1252'))

输出：

[{u'\xd1': u'\xd2', u'\xff': u'\xfe'}, {u'\xd1': u'3', u'\xff': u'4'}]

以下是示例文件的结果（仅限第一个数据行，Python 2.7.1，Windows 7）：

>>> import csv
>>> from pprint import pprint as pp
>>> def UnicodeDictReader(str_data, encoding, **kwargs):
...     csv_reader = csv.DictReader(str_data, **kwargs)
...     # Decode the keys once
...     keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
...     for row in csv_reader:
...         yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())
...
>>> f = open('sampleresults.csv', 'rb')
>>> drdr = UnicodeDictReader(f, 'cp1252')
>>> pp(drdr.next())
{u'APELLIDO': u'=== family names redacted ===',
 u'CATEGORIA': u'ABIERTA',
 u'CEDULA': u'10000640',
 u'DELAY': u' 0:20',
 u'EDAD': u'25',
 u'EMAIL': u'mimail640',
 u'NO.': u'640',
 u'NOMBRE': u'=== given names redacted ===',
 u'POSICION CATEGORIA': u'1',
 u'POSICION CATEGORIA EN KM.5': u'11',
 u'POSICION GENERAL CHIP': u'1',
 u'POSICION GENERAL EN KM.5': u'34',
 u'POSICION GENERAL GUN': u'1',
 u'POSICION GENERO': u'1',
 u'PRIMEROS 5KM.': u'0:32:55',
 u'PROMEDIO/KM.': u' 5:44',
 u'SEGUNDOS KM.': u'0:24:05',
 u'SEX': u'M',
 u'TIEMPO CHIP': u'0:56:59',
 u'TIEMPO GUN': u'0:57:19'}
>>>

Your data is NOT encoded in UTF-8. It is (mostly) encoded in cp1252. The data appears to include Spanish names. The most prevalent non-ASCII character is '\xd1` (i.e. Latin capital letter N with tilde) -- this is the character that caused the exception.

One of the non-ASCII characters in the file is '\x8d'. It is NOT in cp1252. It appears where the letter A should appear in the name VASQUEZ. Of the others, '\x94' (curly double quote in cp1252) appears in the middle of a name. The remaining ones may also represent errors.

I suggest that you run this little code fragment to print lines with suspicious characters in them:

for lino, line in enumerate(open('sampleresults.csv')):
    if any(c in line for c in '\x8d\x94\xc1\xcf\xd3'): print "%d %r\n" % (lino+1, line)

and fix up the data.

Then you need a csv DictReader with full and generalised decoding support. Full means decoding the fieldnames aka dict keys as well as the data. Generalised means no hardcoding of the encoding.

import csv

def UnicodeDictReader(str_data, encoding, **kwargs):
    csv_reader = csv.DictReader(str_data, **kwargs)
    # Decode the keys once
    keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
    for row in csv_reader:
        yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())

dozedata = ['\xd1,\xff', '\xd2,\xfe', '3,4']
print list(UnicodeDictReader(dozedata, 'cp1252'))

Output:

[{u'\xd1': u'\xd2', u'\xff': u'\xfe'}, {u'\xd1': u'3', u'\xff': u'4'}]

and here is what you get with your sample file (first data row only, Python 2.7.1, Windows 7):

>>> import csv
>>> from pprint import pprint as pp
>>> def UnicodeDictReader(str_data, encoding, **kwargs):
...     csv_reader = csv.DictReader(str_data, **kwargs)
...     # Decode the keys once
...     keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
...     for row in csv_reader:
...         yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())
...
>>> f = open('sampleresults.csv', 'rb')
>>> drdr = UnicodeDictReader(f, 'cp1252')
>>> pp(drdr.next())
{u'APELLIDO': u'=== family names redacted ===',
 u'CATEGORIA': u'ABIERTA',
 u'CEDULA': u'10000640',
 u'DELAY': u' 0:20',
 u'EDAD': u'25',
 u'EMAIL': u'mimail640',
 u'NO.': u'640',
 u'NOMBRE': u'=== given names redacted ===',
 u'POSICION CATEGORIA': u'1',
 u'POSICION CATEGORIA EN KM.5': u'11',
 u'POSICION GENERAL CHIP': u'1',
 u'POSICION GENERAL EN KM.5': u'34',
 u'POSICION GENERAL GUN': u'1',
 u'POSICION GENERO': u'1',
 u'PRIMEROS 5KM.': u'0:32:55',
 u'PROMEDIO/KM.': u' 5:44',
 u'SEGUNDOS KM.': u'0:24:05',
 u'SEX': u'M',
 u'TIEMPO CHIP': u'0:56:59',
 u'TIEMPO GUN': u'0:57:19'}
>>>

回复收藏 0 原文

白鸥掠海 2024-11-03 18:23:09

正如这篇文章的答案所说：

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

你可以请参阅下面我的示例代码。我正在使用你的 csv 文件（请参阅评论）。

import csv

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

f = open('sampleresults.csv', 'r')
a = UnicodeDictReader(f)
for i in a:
    if i['NOMBRE'] == 'GUIDO ALEJANDRO':
        print i['APELLIDO']

输出：

MUÑOZ RENGIFO

您可以看到“Ñ”已正确编码。

As the answer to this post said :

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

You can see below my example code. I'm using your csv file (see comments).

import csv

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

f = open('sampleresults.csv', 'r')
a = UnicodeDictReader(f)
for i in a:
    if i['NOMBRE'] == 'GUIDO ALEJANDRO':
        print i['APELLIDO']

Ouput: