针对非 ascii 的弹性、实际工作的 CSV 实现?

发布于 2024-10-17 14:47:59 字数 2804 浏览 1 评论 0原文

[更新] 感谢所有的答案和输入,但工作代码将是最受欢迎的。如果您可以提供可以读取示例文件的代码,那么您就是国王(或女王)。

[更新2] 感谢您的精彩回答和讨论。我需要做的就是读入它们,解析它们,并将它们的一部分保存在 Django 模型实例中。我相信这意味着将它们从本机编码转换为 unicode,以便 Django 可以处理它们,对吗?

几个 Stackoverflow 上的问题已经涉及非 ascii python CSV 读取的主题,但是那里和 python 文档中显示的解决方案不适用于我正在尝试的输入文件。

该解决方案的要点似乎是对 CSV 读取器的输入进行编码('utf-8'),并对读取器的输出进行 unicode(item,'utf-8')编码。然而,这会遇到 UnicodeDecodeError 问题(参见上面的问题):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected

输入文件不一定是 utf8;它可以是 ISO-8859-1、cp1251 或其他任何值。

那么问题来了:在 Python 中读取 CSV 文件的弹性、交叉编码能力是什么?

问题的根源似乎是 CSV 模块是 C 扩展;有没有纯Python的CSV读取模块?

如果没有,是否有一种方法可以自信地检测输入文件的编码以便可以对其进行处理?

基本上我正在寻找一种防弹方式来读取(并希望写入)任何编码的 CSV 文件。

以下是两个示例文件:欧洲俄语

这是推荐的失败解决方案:

Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
...     # csv.py doesn't do Unicode; encode temporarily as UTF-8:
...     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
...                             dialect=dialect, **kwargs)
...     for row in csv_reader:
...         # decode UTF-8 back to Unicode, cell by cell:
...         yield [unicode(cell, 'utf-8') for cell in row]
...
>>> def utf_8_encoder(unicode_csv_data):
...     for line in unicode_csv_data:
...         yield line.encode('utf-8')
...
>>> r = unicode_csv_reader(file('sample-euro.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in unicode_csv_reader
  File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 14: ordinal not in range(128)
>>> r = unicode_csv_reader(file('sample-russian.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in unicode_csv_reader
  File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 28: ordinal not in range(128)

[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are king (or queen).

[Update 2] Thanks for the excellent answers and discussion. What I need to do with these is to read them in, parse them, and save parts of them in Django model instances. I believe that means converting them from their native encoding to unicode so Django can deal with them, right?

There are several questions on Stackoverflow already on the subject of non-ascii python CSV reading, but the solutions shown there and in the python documentation don't work with the input files I'm trying.

The gist of the solution seems to be to encode('utf-8') the input to the CSV reader and unicode(item, 'utf-8') the output of the reader. However, this runs into UnicodeDecodeError issues (see above questions):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected

The input file is not necessarily in utf8; it can be ISO-8859-1, cp1251, or just about anything else.

So, the question: what's a resilient, cross-encoding capable way to read CSV files in Python?

The root of the issue seems to be that the CSV module is a C extension; is there a pure-python CSV reading module?

If not, is there a way to confidently detect the encoding of the input file so that it can be processed?

Basically I'm looking for a bullet proof way to read (and hopefully write) CSV files in any encoding.

Here are two sample files: European, Russian.

And here's the recommended solution failing:

Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
...     # csv.py doesn't do Unicode; encode temporarily as UTF-8:
...     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
...                             dialect=dialect, **kwargs)
...     for row in csv_reader:
...         # decode UTF-8 back to Unicode, cell by cell:
...         yield [unicode(cell, 'utf-8') for cell in row]
...
>>> def utf_8_encoder(unicode_csv_data):
...     for line in unicode_csv_data:
...         yield line.encode('utf-8')
...
>>> r = unicode_csv_reader(file('sample-euro.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in unicode_csv_reader
  File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 14: ordinal not in range(128)
>>> r = unicode_csv_reader(file('sample-russian.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in unicode_csv_reader
  File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 28: ordinal not in range(128)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

花落人断肠 2024-10-24 14:47:59

您正在尝试将解决方案应用于不同问题。请注意:

def utf_8_encoder(unicode_csv_data)

您正在为其提供 str 对象。

读取非 ASCII CSV 文件的问题是您不知道编码,也不知道分隔符。如果您确实知道编码(并且它是基于 ASCII 的编码(例如 cp125x、任何东亚编码、UTF-8、不是 UTF-16、不是 UTF-32) )) 和分隔符,这将起作用:

for row in csv.reader("foo.csv", delimiter=known_delimiter):
   row = [item.decode(encoding) for item in row]

您的sample_euro.csv 看起来像带逗号分隔符的 cp1252。俄语的看起来像带有分号分隔符的 cp1251。顺便说一句,从内容来看,您似乎还需要确定所使用的日期格式,也许还需要确定货币——俄罗斯的示例有金额,后跟一个空格和“卢布”的西里尔字母缩写。

请仔细注意:不要试图说服您您拥有采用 ISO-8859-1 编码的文件。它们以 cp1252 编码。

更新以回应评论“”“如果我明白你在说什么,我必须知道编码才能使其工作?在一般情况下,我不会知道编码并基于其他答案猜测编码非常困难,所以我运气不好?"""

您必须知道任何文件读取练习的编码才能起作用。

对于任何大小的文件中的任何编码,始终正确猜测编码并不是很困难——这是不可能的。然而,将范围限制为以用户区域设置的默认编码从 Excel 或 Open Office 中保存的 csv 文件,并且大小合理,这并不是一项艰巨的任务。我建议尝试 chardet ;它会猜测 windows-1252 为您的欧元文件,windows-1251 为您的俄罗斯文件 - 鉴于它们的尺寸很小,这是一个了不起的成就。

更新 2 响应“”“工作代码将是最受欢迎的”“”

工作代码 (Python 2.x):

from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

# Exercise for the reader: replace the above with a class

import csv    
import sys
from pprint import pprint

pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)

with open(pathname, 'rb') as f:
    cd_result = charset_detect(f)
    encoding = cd_result['encoding']
    confidence = cd_result['confidence']
    print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
    # insert actions contingent on encoding and confidence here
    f.seek(0)
    csv_reader = csv.reader(f, delimiter=delim)
    for bytes_row in csv_reader:
        unicode_row = [x.decode(encoding) for x in bytes_row]
        pprint(unicode_row)

输出 1:

delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
 u'Overf\xf8rsel utland',
 u'UTLBET; ID 9710032001647082',
 u'1990.00',
 u'']
[u'31-01-11',
 u'Overf\xf8ring',
 u'OVERF\xd8RING MELLOM EGNE KONTI',
 u'5750.00',
 u';']

输出 2:

delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
 u'04.02.2011 23:20',
 u'300,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041c\u0422\u0421',
 u'']
[u'-',
 u'04.02.2011 23:15',
 u'450,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
 u'']
[u'-',
 u'13.01.2011 02:05',
 u'100,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041c\u0422\u0421 kolombina',
 u'']

更新3这些文件的来源是什么?如果它们是从 Excel、OpenOffice Calc 或 Gnumeric“另存为 CSV”,您可以通过将它们另存为“Excel 97-2003 Workbook (*.xls)”并使用 xlrd 来阅读它们。这还可以省去检查每个 csv 文件以确定分隔符(逗号与分号)、日期格式(31-01-11 与 04.02.2011)和“小数点”(5750.00 与 450,00)的麻烦 - - 所有这些差异可能是通过另存为 CSV 造成的。 [Dis]声明者:我是 xlrd 的作者。

You are attempting to apply a solution to a different problem. Note this:

def utf_8_encoder(unicode_csv_data)

You are feeding it str objects.

The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:

for row in csv.reader("foo.csv", delimiter=known_delimiter):
   row = [item.decode(encoding) for item in row]

Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".

Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.

Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""

You must know the encoding for ANY file-reading exercise to work.

Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252 for your euro file and windows-1251 for your Russian file -- a fantastic achievement given their tiny size.

Update 2 in response to """working code would be most welcome"""

Working code (Python 2.x):

from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

# Exercise for the reader: replace the above with a class

import csv    
import sys
from pprint import pprint

pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)

with open(pathname, 'rb') as f:
    cd_result = charset_detect(f)
    encoding = cd_result['encoding']
    confidence = cd_result['confidence']
    print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
    # insert actions contingent on encoding and confidence here
    f.seek(0)
    csv_reader = csv.reader(f, delimiter=delim)
    for bytes_row in csv_reader:
        unicode_row = [x.decode(encoding) for x in bytes_row]
        pprint(unicode_row)

Output 1:

delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
 u'Overf\xf8rsel utland',
 u'UTLBET; ID 9710032001647082',
 u'1990.00',
 u'']
[u'31-01-11',
 u'Overf\xf8ring',
 u'OVERF\xd8RING MELLOM EGNE KONTI',
 u'5750.00',
 u';']

Output 2:

delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
 u'04.02.2011 23:20',
 u'300,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041c\u0422\u0421',
 u'']
[u'-',
 u'04.02.2011 23:15',
 u'450,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
 u'']
[u'-',
 u'13.01.2011 02:05',
 u'100,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041c\u0422\u0421 kolombina',
 u'']

Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd.

帅的被狗咬 2024-10-24 14:47:59

我不知道你是否已经尝试过这个,但是在 示例 csv 模块的官方 Python 文档部分,您将找到一对类; UnicodeReaderUnicodeWriter。到目前为止,它们对我来说工作得很好。

正确检测文件的编码似乎是一个非常困难的问题。您可以阅读 此 StackOverflow 中的讨论线程

I don't know if you've already tried this, but in the example section for the official Python documentation for the csv module, you'll find a pair of classes; UnicodeReader and UnicodeWriter. They worked fine for me so far.

Correctly detecting the encoding of a file seems to be a very hard problem. You can read the discussion in this StackOverflow thread.

悸初 2024-10-24 14:47:59

您尝试 .encode('utf-8') 在代码中做了错误的事情,您应该对其进行解码。顺便说一句, unicode(bytestr, 'utf-8') == bytestr.decode('utf-8')

但最重要的是,你为什么要尝试解码字符串?

听起来有点荒谬,但实际上您可以使用这些 CSV,而不必关心它们是 cp1251、cp1252 还是 utf-8。这一切的美妙之处在于区域字符也是 >0x7F 和 utf-8,使用 >0x7F 字符序列来表示非 ASCII 符号。

由于 CSV 关心的分隔符(无论是 、 或 ; 或 \n)都在 ASCII 内,所以它的工作不会受到所使用的编码的影响(只要它是单字节或 utf-8!)。

需要注意的重要一点是,您应该给以 binary 模式打开的 Python 2.x csv 模块文件 - 即“rb”或“wb” - 因为它的实施方式很奇特。

You are doing the wrong thing in your code by trying to .encode('utf-8'), you should be decoding it instead. And btw, unicode(bytestr, 'utf-8') == bytestr.decode('utf-8')

But most importantly, WHY are you trying to decode the strings?

Sounds a bit absurd but you can actually work with those CSV without caring whether they are cp1251, cp1252 or utf-8. The beauty of it all is that the regional characters are >0x7F and utf-8 too, uses sequences of >0x7F characters to represent non-ASCII symbols.

Since the separators CSV cares about (be it , or ; or \n) are within ASCII, its work won't be affected by the encoding used (as long as it is one-byte or utf-8!).

Important thing to note is that you should give to Python 2.x csv module files opened in binary mode - that is either 'rb' or 'wb' - because of the peculiar way it was implemented.

聆听风音 2024-10-24 14:47:59

你问的是不可能的。任何语言都无法编写接受未知编码输入并将其正确转换为 Unicode 内部表示形式的程序。

您必须找到一种方法来告诉应用程序要使用哪种编码。

可以识别许多(但不是全部)encodingshardet,但这实际上取决于文件的内容是什么以及是否有足够的数据点。这类似于在网络服务器上正确解码文件名的问题。当在网络服务器上创建文件时,无法告诉服务器使用什么编码,因此,如果您有一个文件夹的名称采用多种编码,那么它们对于某些(如果不是全部)用户和不同的用户来说肯定看起来很奇怪文件会显得很奇怪。

但是,不要放弃。尝试此问题中提到的 chardet 编码检测器: https://serverfault.com/questions/82821/how-to-tell-the-language-encoding-of-a-filename-on-linux
如果你幸运的话,你不会经历很多失败。

What you are asking is impossible. There is no way to write a program in any language that will accept input in an unknown encoding and correctly convert it to Unicode internal representation.

You have to find a way to tell the application which encoding to use.

It is possible to recognize many, but not all, encodingshardet but it really depends on what the content of the files is and whether there are enough data points. This is similar to the issue of correctly decoding filenames on network servers. When a file is created on a network server, there is no way to tell the server what encoding is used, so if you have a folder with names in multiple encodings they are guaranteed to look odd to some, if not all, users and different files will seem odd.

However, don't give up. Try the chardet encoding detector mentioned in this question: https://serverfault.com/questions/82821/how-to-tell-the-language-encoding-of-a-filename-on-linux
and if you are lucky, you won't get many failures.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文