Ruby 1.8 Iconv UTF-16 到 UTF-8 失败并显示“\000” (图标::无效字符)
我在处理 Windows 计算机上生成的表格数据文本文件时遇到问题。 我正在使用 Ruby 1.8。处理文件中的第二行时,以下内容给出错误(“\000”(Iconv::InvalidCharacter))。第一行已正确转换。
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
line = conv.iconv(line.strip) # FAILS HERE
puts line
# DO MORE STUFF HERE
end
奇怪的是,它读取并转换文件中的第一行没有任何问题。 我在 Iconv 构造函数中有 //IGNORE 标志 - 我认为这应该抑制这种错误。
我已经兜圈子有一段时间了。任何建议将不胜感激。
谢谢!
编辑: 霍布斯解决方案解决了这个问题。谢谢。 只需将代码更改为:
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
line = conv.iconv(line.strip) # NO LONGER FAILS HERE
# DOES MORE STUFF HERE
end
现在我只需要找到一种方法来自动确定要使用哪个获取分隔符。
I am having trouble handling text files of tabulated data generated on a windows machine.
I'm working in Ruby 1.8. The following gives an error ("\000" (Iconv::InvalidCharacter)) when processing the SECOND line from the file. The first line is converted properly.
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
line = conv.iconv(line.strip) # FAILS HERE
puts line
# DO MORE STUFF HERE
end
The strange thing is that it reads and converts the first line in the file with no problem.
I have the //IGNORE flag in the Iconv constructor -- I thought this was supposed to suppress this kind of error.
I've been going in circles for a while. Any advice would be highly appreciated.
Thanks!
EDIT:
hobbs solution fixes this. Thank you.
Simply change the code to:
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
line = conv.iconv(line.strip) # NO LONGER FAILS HERE
# DOES MORE STUFF HERE
end
Now I'll just need to find a way to automatically determine which gets separator to use.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
错误消息相当模糊,但我认为它对在一行上发现奇数个字节这一事实感到不满,因为 UTF-16 中的每个字符都是两个(或偶尔是四个)字节。我认为 的原因是您使用
gets
- 文件中的行由 UTF-16le 换行符分隔,即0x0a 0x00< /code>,但
gets
仅在0x0a
上进行拆分(并且strip
正在删除)。举例说明:假设该文件包含
以 UTF-16le 编码的文件。
gets 读取到第一个
0x0a
,strip
删除该行,因此读取的第一行是0x61 0x00 0x62 0x00
, iconv 很乐意接受并编码为 UTF-8 为0x61 0x62
—“ab”。gets
然后读取下一个0x0a
,strip
再次删除该0x0a
,因此第二次line
获取0x00 0x63 0x00 0x64 0x00
现在一切都搞砸了 - 我们不同步一个字节,需要转换奇数个字节,并且iconv
崩溃了,因为这是不兼容的与你要求它做什么。如果没有实际的工作文件编码/解码层,我认为您想要的是将
gets
分隔符从"\n"
("\x0a") 到
,这样您就不会添加额外的行尾(因为您将转换已有的行尾)。"\x0a\x00"
,放弃所有strip
的使用,因为它不是干净的编码,并使用print
而不是 < code>puts如果您使用的是 Windows 文件,则 UTF-16le 中的 Windows CRLF 为
"\x0d\x00\x0a\x00"
。The error message is pretty vague, but I think it's unhappy about the fact that it's found an odd number of bytes on a line, since every character in UTF-16 is two (or occasionally four) bytes. And I think the reason for that is your use of
gets
-- the lines in your file are separated by a UTF-16le newline, which is0x0a 0x00
, butgets
is splitting on (andstrip
is removing)0x0a
only.To illustrate: suppose the file contains
encoded in UTF-16le. That's
gets
reads up to the first0x0a
, whichstrip
removes, so the first line read is0x61 0x00 0x62 0x00
, which iconv happily accepts and encodes to UTF-8 as0x61 0x62
— "ab".gets
then reads up to the next0x0a
, whichstrip
again removes, so the second timeline
gets0x00 0x63 0x00 0x64 0x00
and now everything is screwed up — we're out of sync by one byte and there's an odd number of bytes to convert, andiconv
blows up because that's incompatible with what you asked it to do.Absent an actual working file encoding/decoding layer, I think what you want is to change the
gets
separator from"\n"
("\x0a"
) to"\x0a\x00"
, abandon all use ofstrip
since it's not encoding-clean, and useprint
instead ofputs
so that you don't add extra line-ends (since you'll be converting the ones you've already got).If you're working with windows files, a windows CRLF in UTF-16le is
"\x0d\x00\x0a\x00"
.上面的回答很好。您还可以在逐行处理之前将整个文件转换为 UTF-8,但这可能会在大文件上产生更差的流行为。
Answer above is good. You could also convert the entire file to UTF-8 before processing it line-by-line, but that might have worse streaming behaviour on large files.