Ruby 1.8 Iconv UTF-16 到 UTF-8 失败并显示“\000” (图标::无效字符)

发布于 2024-11-10 03:17:15 字数 792 浏览 5 评论 0原文

我在处理 Windows 计算机上生成的表格数据文本文件时遇到问题。 我正在使用 Ruby 1.8。处理文件中的第二行时,以下内容给出错误(“\000”(Iconv::InvalidCharacter))。第一行已正确转换。

require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
  line = conv.iconv(line.strip)  # FAILS HERE
  puts line
  # DO MORE STUFF HERE
end

奇怪的是,它读取并转换文件中的第一行没有任何问题。 我在 Iconv 构造函数中有 //IGNORE 标志 - 我认为这应该抑制这种错误。

我已经兜圈子有一段时间了。任何建议将不胜感激。

谢谢!

编辑: 霍布斯解决方案解决了这个问题。谢谢。 只需将代码更改为:

require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
  line = conv.iconv(line.strip)  # NO LONGER FAILS HERE
  # DOES MORE STUFF HERE
end

现在我只需要找到一种方法来自动确定要使用哪个获取分隔符。

I am having trouble handling text files of tabulated data generated on a windows machine.
I'm working in Ruby 1.8. The following gives an error ("\000" (Iconv::InvalidCharacter)) when processing the SECOND line from the file. The first line is converted properly.

require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
  line = conv.iconv(line.strip)  # FAILS HERE
  puts line
  # DO MORE STUFF HERE
end

The strange thing is that it reads and converts the first line in the file with no problem.
I have the //IGNORE flag in the Iconv constructor -- I thought this was supposed to suppress this kind of error.

I've been going in circles for a while. Any advice would be highly appreciated.

Thanks!

EDIT:
hobbs solution fixes this. Thank you.
Simply change the code to:

require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
  line = conv.iconv(line.strip)  # NO LONGER FAILS HERE
  # DOES MORE STUFF HERE
end

Now I'll just need to find a way to automatically determine which gets separator to use.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

2024-11-17 03:17:15

错误消息相当模糊,但我认为它对在一行上发现奇数个字节这一事实感到不满,因为 UTF-16 中的每个字符都是两个(或偶尔是四个)字节。我认为 的原因是您使用 gets - 文件中的行由 UTF-16le 换行符分隔,即 0x0a 0x00< /code>,但 gets 仅在 0x0a 上进行拆分(并且 strip 正在删除)。

举例说明:假设该文件包含

ab
cd

以 UTF-16le 编码的文件。

0x61 0x00 0x62 0x00 0x0a 0x00 0x63 0x00 0x64 0x00 0x0a 0x00
    a         b         \n        c         d         \n

gets 读取到第一个 0x0astrip 删除该行,因此读取的第一行是 0x61 0x00 0x62 0x00, iconv 很乐意接受并编码为 UTF-8 为 0x61 0x62 —“ab”。 gets 然后读取下一个 0x0astrip 再次删除该 0x0a,因此第二次 line 获取 0x00 0x63 0x00 0x64 0x00 现在一切都搞砸了 - 我们不同步一个字节,需要转换奇数个字节,并且 iconv 崩溃了,因为这是不兼容的与你要求它做什么。

如果没有实际的工作文件编码/解码层,我认为您想要的是将 gets 分隔符从 "\n" ("\x0a") 到 "\x0a\x00",放弃所有 strip 的使用,因为它不是干净的编码,并使用 print 而不是 < code>puts ,这样您就不会添加额外的行尾(因为您将转换已有的行尾)。

如果您使用的是 Windows 文件,则 UTF-16le 中的 Windows CRLF 为 "\x0d\x00\x0a\x00"

The error message is pretty vague, but I think it's unhappy about the fact that it's found an odd number of bytes on a line, since every character in UTF-16 is two (or occasionally four) bytes. And I think the reason for that is your use of gets-- the lines in your file are separated by a UTF-16le newline, which is 0x0a 0x00, but gets is splitting on (and strip is removing) 0x0a only.

To illustrate: suppose the file contains

ab
cd

encoded in UTF-16le. That's

0x61 0x00 0x62 0x00 0x0a 0x00 0x63 0x00 0x64 0x00 0x0a 0x00
    a         b         \n        c         d         \n

gets reads up to the first 0x0a, which strip removes, so the first line read is 0x61 0x00 0x62 0x00, which iconv happily accepts and encodes to UTF-8 as 0x61 0x62 — "ab". gets then reads up to the next 0x0a, which strip again removes, so the second time line gets 0x00 0x63 0x00 0x64 0x00 and now everything is screwed up — we're out of sync by one byte and there's an odd number of bytes to convert, and iconv blows up because that's incompatible with what you asked it to do.

Absent an actual working file encoding/decoding layer, I think what you want is to change the gets separator from "\n" ("\x0a") to "\x0a\x00", abandon all use of strip since it's not encoding-clean, and use print instead of puts so that you don't add extra line-ends (since you'll be converting the ones you've already got).

If you're working with windows files, a windows CRLF in UTF-16le is "\x0d\x00\x0a\x00".

红玫瑰 2024-11-17 03:17:15

上面的回答很好。您还可以在逐行处理之前将整个文件转换为 UTF-8,但这可能会在大文件上产生更差的流行为。

Answer above is good. You could also convert the entire file to UTF-8 before processing it line-by-line, but that might have worse streaming behaviour on large files.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文