显示 iso-8859-1 编码数据给出奇怪的字符
我有一个 ISO-8859-1 编码的 csv 文件,我尝试使用 ruby 打开并解析该文件:
require 'csv'
filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row|
puts row
end
如果我在调用 File.open 时省略编码,则会收到错误
ArgumentError:UTF-8 中的字节序列无效
我的问题是对 puts row
的调用显示奇怪的字符而不是挪威语字符 æ,ø,å:
BOKF�RINGSDATO
如果我在 textmate 中打开文件,强制它使用 UTF-8 编码,我会得到同样的结果。
通过将文件内容分配给字符串,我可以检查该字符串使用的编码。正如预期的那样,它显示 ISO-8859-1。
那么,当我 put
每一行时,为什么它会将字符串输出为 UTF-8? 这与 csv 库有关吗?
我使用红宝石1.9.2。
I have a ISO-8859-1 encoded csv-file that I try to open and parse with ruby:
require 'csv'
filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row|
puts row
end
If I leave out the encoding from the call to File.open, I get an error
ArgumentError: invalid byte sequence in UTF-8
My problem is that the call to puts row
displays strange characters instead of the norwegian characters æ,ø,å:
BOKF�RINGSDATO
I get the same if I open the file in textmate, forcing it to use UTF-8 encoding.
By assigning the file content to a string, I can check the encoding used for the string. As expected, it shows ISO-8859-1.
So when I puts
each row, why does it output the string as UTF-8?
Is it something to do with the csv-library?
I use ruby 1.9.2.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
通过尝试文档中的不同内容,我找到了答案:
如您所见,我所做的就是在 CSV 解析器获取字符串之前将字符串编码为 UTF-8 字符串。
编辑:
在macruby-head上尝试这个解决方案,我从encode()收到以下错误消息:
即使我在打开文件时指定编码,macruby 仍使用 UTF-8。
这似乎是一个已知的 macruby 限制:编码始终为 UTF-8
Found myself an answer by trying different things from the documentation:
As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.
Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):
Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8
也许您可以在解析之前使用 Iconv 将文件内容转换为 UTF-8?
Maybe you could use Iconv to convert the file contents to UTF-8 before parsing?
ISO-8859-1 和 Win-1252 的字符集非常接近。某些应用程序可以处理该文件并转换它吗?或者它是否是从默认为 Win-1252(Windows 的标准设置)的计算机接收的?
如果 0x80 到 0x9F 字节范围内没有字符,则感知代码集的软件可能会得到错误的编码,因此您可以尝试设置
file = File.open(filename, "r:ISO-8859-1" )
到file = File.open(filename, "r:Windows-1252")
。 (我认为“Windows-1252”是正确的编码名称。)我曾经编写过蜘蛛程序,而 HTML 因标签错误或将一个字符集中的编码二进制字符嵌入到另一个字符集中而臭名昭著。几年前,在大多数语言都实现 UTF-8 和 Unicode 之前,我在解决这些问题时多次使用了一些不好的语言,所以我理解这种挫败感。
ISO/IEC_8859-1,
Windows-1252
ISO-8859-1 and Win-1252 are reaallly close in their character sets. Could some app have processed the file and converted it? Or could it have been received from a machine that was defaulting to Win-1252, which is Window's standard setting?
Software that senses the code-set can get the encoding wrong if there are no characters in the 0x80 to 0x9F byte-range so you might try setting
file = File.open(filename, "r:ISO-8859-1")
tofile = File.open(filename, "r:Windows-1252")
. (I think "Windows-1252" is the right encoding name.)I used to write spiders, and HTML is notorious for being mis-labeled or for having encoded binary characters from one character set embedded in another. I used some bad language many times over these problems several years ago, before most languages had implemented UTF-8 and Unicode so I understand the frustration.
ISO/IEC_8859-1,
Windows-1252