如何使用 python xlrd 处理日语单词

发布于 2024-11-08 14:50:11 字数 424 浏览 4 评论 0原文

这是我的代码:

#!/usr/bin/python   
#-*-coding:utf-8-*-   

import xlrd,sys,re

data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
    if 9<i<20:
        #stage
        print a.row_values(i)[1].decode('shift_jis')+'\n'

但它显示:

????
????????
??????
????
????
????
????????

那么我能做什么,

谢谢

this is my code:

#!/usr/bin/python   
#-*-coding:utf-8-*-   

import xlrd,sys,re

data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
    if 9<i<20:
        #stage
        print a.row_values(i)[1].decode('shift_jis')+'\n'

but it show :

????
????????
??????
????
????
????
????????

so what can i do ,

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

皇甫轩 2024-11-15 14:50:11

背景:在“现代”(Excel 97-2003) XLS 文件中,文本实际上存储为 Unicode。在较旧的文件中,文本存储为 8 位字符串,“代码页”记录说明其编码方式,例如整数 1252 对应于 cp1252windows-1252< 的编码。 /代码>。无论哪种情况,xlrd 都会将提取的文本呈现为 unicode 对象。

请将这一行插入您的代码中:

print data.biff_version, data.codepage, data.encoding

如果您有新文件,您应该看到

80 1200 utf_16_le

无论如何,请编辑您的问题以报告结果。

问题 1:仅当文件是旧文件并且您知道/怀疑代码页记录被省略或错误时才需要 encoding_override。如果文件是新文件,则忽略它。您真的知道该文件是 Excel-97 之前的版本并且文本采用 UTF-8 编码吗?如果是这样,它只能是由一些严重欺骗的第三方软件创建的,如果你尝试用Excel打开它,Excel就会崩溃;带着棒球棒拜访作者。否则,不要使用encoding_override。

问题 2:您应该拥有unicode 对象。要显示它们,您需要使用合适的编码将它们从 unicode 编码(不解码)为 str。令人惊讶的是 print unicode_object.decode('shift-jis') 没有引发异常并打印问号。

为了帮助理解这一点,请将您的代码更改为如下所示:

text = a.rowvalues(i)[1]
print i, repr(text)
print repr(text.decode('shift-jis'))

并报告结果。

这样我们就可以帮助您选择合适的编码(如果有),请告诉我们您正在使用的操作系统的版本,以及以下显示的内容:

print sys.stdout.encoding
import locale
print locale.getpreferredencoding()

进一步阅读:

(1)xlrd 文档(关于 Unicode 的部分,右上)前面)...包含在发行版中,或者获取最新提交

(2) Python Unicode HOWTO

Background: In a "modern" (Excel 97-2003) XLS file, text is effectively stored as Unicode. In older files, text is stored as 8-bit strings, and a "codepage" record tells how it is encoded e.g. the integer 1252 corresponds to the encoding known as cp1252 or windows-1252. In either case, xlrd presents extracted text as unicode objects.

Please insert this line into your code:

print data.biff_version, data.codepage, data.encoding

If you have a new file, you should see

80 1200 utf_16_le

In any case, please edit your question to report the outcome.

Problem 1: encoding_override is required ONLY if the file is an old file AND you know/suspect that the codepage record is omitted or wrong. It is ignored if the file is a new file. Do you really know that the file is pre-Excel-97 and the text is encoded in UTF-8? If so, it can only have been created by some seriously deluded 3rd-party software, and Excel will blow up if you try to open it with Excel; visit the author with a baseball bat. Otherwise, don't use encoding_override.

Problem 2: You should have unicode objects. To display them, you need to encode (not decode) them from unicode to str using a suitable encoding. It is very suprising that print unicode_object.decode('shift-jis') doesn't raise an exception and prints question marks.

To help understand this, please change your code to be like this:

text = a.rowvalues(i)[1]
print i, repr(text)
print repr(text.decode('shift-jis'))

and report the outcome.

So that we can help you choose an appropriate encoding (if any), tell us what version of what operating system you are using, and what the following display:

print sys.stdout.encoding
import locale
print locale.getpreferredencoding()

Further reading:

(1) the xlrd documentation (section on Unicode, right up the front) ... included in the distribution, or get the latest commit here.

(2) the Python Unicode HOWTO.

少女净妖师 2024-11-15 14:50:11

为什么你的编码没有覆盖开放的shift-jis?

data = xlrd.open_workbook('a.xls',encoding_override="shift-jis")

如果文件确实是 shift-JIS,则有很多代码点(坦率地说,几乎所有代码点)与有效的 UTF-8 代码点不重叠。如果您收到非法字符(?)并且您的文件实际上是 UTF-8 并且您想输出 Shift-JIS,我是否会建议您的输出 shell(用于打印 - 可能一个文件就可以了)无法处理编码。

Why isn't your encoding override on open shift-jis?

data = xlrd.open_workbook('a.xls',encoding_override="shift-jis")

If the file is really shift-JIS, there are lots of code points (well frankly, almost all of them) that don't overlap with valid UTF-8 code points. If you are getting illegal characters (?) and your file is really UTF-8 and you want to output Shift-JIS, might I suggest that your output shell (for print - probably a file would be fine) can't handle the encoding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文