使用encode('utf-8')从Python中的Excel读取字符串的缺点

发布于 2024-12-09 16:55:21 字数 786 浏览 0 评论 0原文

我正在从 Excel 电子表格中读取大量数据，其中我使用以下一般结构从电子表格中读取（并重新格式化和重写）：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

其中 x 和 y 在本例中是任意单元格，其中 x 不那么任意且包含 utf -8 个字符

到目前为止，我只在单元格中使用 .encode('utf-8') ，否则我知道会出现错误，或者在不使用 utf-8 的情况下预见到错误。

我的问题基本上是这样的：即使没有必要，在所有单元格上使用 .encode('utf-8') 是否有缺点？效率不是问题。主要问题是，即使在不应该出现的地方有 utf-8 字符，它仍然可以工作。如果我只是将“.encode('utf-8')”集中到读取的每个单元格上不会发生错误，那么我可能最终会这样做。

原文

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters

So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.

My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幻梦 2024-12-16 16:55:21

XLRD 文档明确指出：“从 Excel 97 开始，Excel 电子表格中的文本已经存储为 Unicode。”。由于您可能正在读取 97 以上的文件，因此它们无论如何都包含 Unicode 代码点。因此，有必要在 Python 中将这些单元格的内容保留为 Unicode，并且不要将它们转换为 ASCII（使用 str() 函数进行转换）。使用下面的代码：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

回复收藏 0 原文

嘴硬脾气大 2024-12-16 16:55:21

这个答案实际上是对已接受答案的一些温和评论，但它们需要比 SO 评论工具提供的更好的格式。

(1) 避免使用水平滚动条可以提高人们阅读你的代码的机会。尝试换行，例如：

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) 假设您正在使用 unicode() 将浮点数和整数转换为 unicode；它对于已经是 unicode 的值没有任何作用。请注意，unicode() 与 str() 一样，只能为浮点数提供 12 位精度：

>>> unicode(123456.78901234567)
u'123456.789012'

如果这很麻烦，您可能想尝试这样的方法：

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3) xlrd 在需要时动态构建 Cell 对象。

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.

(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:

>>> unicode(123456.78901234567)
u'123456.789012'

If that is a bother, you might like to try something like this:

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3) xlrd builds Cell objects on the fly when demanded.

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

回复收藏 0 原文

~没有更多了~

关于作者

忆伤

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

使用encode('utf-8')从Python中的Excel读取字符串的缺点

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

使用encode('utf-8')从Python中的Excel读取字符串的缺点

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。