带有 UTF-8 数据的 Python CSV DictReader
AFAIK,Python (v2.6) csv 模块默认无法处理 unicode 数据,对吗?在 Python 文档中,有一个关于如何读取 UTF-8 编码文件的示例。但此示例仅将 CSV 行作为列表返回。 我想按名称访问行列,因为它是由 csv.DictReader 完成的,但使用 UTF-8 编码的 CSV 输入文件。
谁能告诉我如何有效地做到这一点?我必须处理大小为 100 MB 的 CSV 文件。
AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I'd like to access the row columns by name as it is done by csv.DictReader
but with UTF-8 encoded CSV input file.
Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我自己想出了一个答案:
注意:这已更新,因此密钥将根据评论中的建议进行解码
I came up with an answer myself:
Note: This has been updated so keys are decoded per the suggestion in the comments
对我来说,关键不在于操作 csv DictReader 参数,而是文件打开器本身。这成功了:
不需要特殊的课程。现在我可以打开带或不带 BOM 的文件而不会崩溃。
For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:
No special class required. Now I can open files either with or without BOM without crashing.
@LMatter 答案的基于类的方法,通过这种方法,您仍然可以获得 DictReader 的所有好处,例如获取字段名和获取行号,再加上它处理 UTF-8
A classed based approach to @LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8
首先,使用2.6版本的文档。每个版本都可能会发生变化。它明确表示它不支持 Unicode,但支持 UTF-8。 从技术上讲,这些不是一回事。正如文档所说:
下面的示例(来自文档)展示了如何创建两个函数,将 UTF-8 文本正确读取为 CSV。您应该知道 csv.reader() 始终返回 DictReader 对象。
First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:
The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that
csv.reader()
always returns a DictReader object.使用 unicodecsv 包可以轻松实现这一点。
That's easy with the unicodecsv package.
csvw
包还有其他功能(用于元数据丰富的 CSV对于 Web),但它定义了一个UnicodeDictReader
类来包裹它的UnicodeReader
类,它的核心正是这样做的:它确实让我失望了几次,但是 < code>csvw.UnicodeDictReader 真的,真的需要在
with
块中使用,否则会中断。除此之外,该模块非常通用并且与 py2 和 py3 兼容。The
csvw
package has other functionality as well (for metadata-enriched CSV for the Web), but it defines aUnicodeDictReader
class wrapping around itsUnicodeReader
class, which at its core does exactly that:It did catch me off a few times, but
csvw.UnicodeDictReader
really, really needs to be used in awith
block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.答案没有
DictWriter
方法,所以这里是更新的类:The answer doesn't have the
DictWriter
methods, so here is the updated class: