在 Python 中处理 UTF-8 数字
假设我正在读取一个包含 3 个逗号分隔数字的文件。该文件是用未知编码保存的,到目前为止我正在处理 ANSI 和 UTF-8。如果文件采用 UTF-8 格式,并且有 1 行值为 115,113,12,则:
with open(file) as f:
a,b,c=map(int,f.readline().split(','))
将抛出此错误:
invalid literal for int() with base 10: '\xef\xbb\xbf115'
第一个数字始终与这些 '\xef\xbb\xbf' 字符混合。对于其余 2 个数字,转换效果很好。如果我手动将 '\xef\xbb\xbf' 替换为 '',然后进行 int 转换,它将起作用。
对于任何类型的编码文件,是否有更好的方法来执行此操作?
Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:
with open(file) as f:
a,b,c=map(int,f.readline().split(','))
would throw this:
invalid literal for int() with base 10: '\xef\xbb\xbf115'
The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.
Is there a better way of doing this for any type of encoded file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这适用于 Python 2.6.4。
codecs.open
调用打开文件并以 unicode 形式返回数据,从 UTF-8 解码并忽略初始 BOM。This works in Python 2.6.4. The
codecs.open
call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.您看到的是 UTF-8 编码的 BOM,或“字节顺序标记” 。 BOM 通常不用于 UTF-8 文件,因此处理它的最佳方法可能是使用 UTF-8 编解码器打开文件,并跳过
U+FEFF
字符(如果存在)。What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the
U+FEFF
character if present.