在 python 中查找 utf-16 文件。如何?
由于某种原因,我无法查找我的 utf16 文件。它会产生“UnicodeException:UTF-16 流不以 BOM 开头”。我的代码:
f = codecs.open(ai_file, 'r', 'utf-16')
seek = self.ai_map[self._cbClass.Text] #seek is valid int
f.seek(seek)
while True:
ln = f.readline().strip()
我尝试了随机的东西,例如首先从流中读取某些内容,但没有帮助。我检查了使用十六进制编辑器寻求的偏移量 - 字符串从字符开始,而不是空字节(我猜这是个好兆头,对吧?) 那么如何在python中寻找utf-16呢?
For some reason i can not seek my utf16 file. It produces 'UnicodeException: UTF-16 stream does not start with BOM'. My code:
f = codecs.open(ai_file, 'r', 'utf-16')
seek = self.ai_map[self._cbClass.Text] #seek is valid int
f.seek(seek)
while True:
ln = f.readline().strip()
I tried random stuff like first reading something from stream, didnt help. I checked offset that is seeked to using hex editor - string starts at character, not null byte (i guess its good sign, right?)
So how to seek utf-16 in python?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,错误消息告诉您原因:它没有读取字节顺序标记。字节顺序标记位于文件的开头。如果没有读取字节顺序标记,UTF-16 解码器就无法知道字节的顺序。显然,它是在您第一次读取时懒惰地执行此操作,而不是在您打开文件时执行此操作 - 否则它会假设
seek()
正在启动一个新的 UTF-16 流。如果您的文件没有 BOM,这肯定是问题所在,您应该在打开文件时指定字节顺序(请参见下面的#2)。否则,我看到两种可能的解决方案:
在查找之前读取文件的前两个字节以获取 BOM。您似乎说这不起作用,表明它可能在查找后等待一个新的 UTF-16 流,因此:
使用
utf-16-le
显式指定字节顺序或utf-16-be
作为打开文件时的编码。Well, the error message is telling you why: it's not reading a byte order mark. The byte order mark is at the beginning of the file. Without having read the byte order mark, the UTF-16 decoder can't know what order the bytes are in. Apparently it does this lazily, the first time you read, instead of when you open the file -- or else it is assuming that the
seek()
is starting a new UTF-16 stream.If your file doesn't have a BOM, that's definitely the problem and you should specify the byte order when opening the file (see #2 below). Otherwise, I see two potential solutions:
Read the first two bytes of the file to get the BOM before you seek. You seem to say this didn't work, indicating that perhaps it's expecting a fresh UTF-16 stream after the seek, so:
Specify the byte order explicitly by using
utf-16-le
orutf-16-be
as the encoding when you open the file.