显示 iso-8859-1 编码数据给出奇怪的字符

发布于 2024-10-07 06:05:55 字数 670 浏览 4 评论 0原文

我有一个 ISO-8859-1 编码的 csv 文件,我尝试使用 ruby​​ 打开并解析该文件:

require 'csv'

filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row| 
  puts row 
end

如果我在调用 File.open 时省略编码,则会收到错误

ArgumentError:UTF-8 中的字节序列无效

我的问题是对 puts row 的调用显示奇怪的字符而不是挪威语字符 æ,ø,å:

BOKF�RINGSDATO

如果我在 textmate 中打开文件,强制它使用 UTF-8 编码,我会得到同样的结果。

通过将文件内容分配给字符串,我可以检查该字符串使用的编码。正如预期的那样,它显示 ISO-8859-1。

那么,当我 put 每一行时,为什么它会将字符串输出为 UTF-8? 这与 csv 库有关吗?

我使用红宝石1.9.2。

I have a ISO-8859-1 encoded csv-file that I try to open and parse with ruby:

require 'csv'

filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row| 
  puts row 
end

If I leave out the encoding from the call to File.open, I get an error

ArgumentError: invalid byte sequence in UTF-8

My problem is that the call to puts row displays strange characters instead of the norwegian characters æ,ø,å:

BOKF�RINGSDATO

I get the same if I open the file in textmate, forcing it to use UTF-8 encoding.

By assigning the file content to a string, I can check the encoding used for the string. As expected, it shows ISO-8859-1.

So when I puts each row, why does it output the string as UTF-8?
Is it something to do with the csv-library?

I use ruby 1.9.2.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

旧伤还要旧人安 2024-10-14 06:05:55

通过尝试文档中的不同内容,我找到了答案:

require 'csv'

filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
  CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row| 
    #                    ↳  returns a copy transcoded to UTF-8.
    puts row 
  end
end

如您所见,我所做的就是在 CSV 解析器获取字符串之前将字符串编码为 UTF-8 字符串。


编辑:
在macruby-head上尝试这个解决方案,我从encode()收到以下错误消息:

编码::InvalidByteSequenceError:UTF-8 上的“\xD8”

即使我在打开文件时指定编码,macruby 仍使用 UTF-8。
这似乎是一个已知的 macruby 限制:编码始终为 UTF-8

Found myself an answer by trying different things from the documentation:

require 'csv'

filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
  CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row| 
    #                    ↳  returns a copy transcoded to UTF-8.
    puts row 
  end
end

As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.


Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):

Encoding::InvalidByteSequenceError: "\xD8" on UTF-8

Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8

無處可尋 2024-10-14 06:05:55

也许您可以在解析之前使用 Iconv 将文件内容转换为 UTF-8?

Maybe you could use Iconv to convert the file contents to UTF-8 before parsing?

壹場煙雨 2024-10-14 06:05:55

ISO-8859-1 和 Win-1252 的字符集非常接近。某些应用程序可以处理该文件并转换它吗?或者它是否是从默认为 Win-1252(Windows 的标准设置)的计算机接收的?

如果 0x80 到 0x9F 字节范围内没有字符,则感知代码集的软件可能会得到错误的编码,因此您可以尝试设置 file = File.open(filename, "r:ISO-8859-1" )file = File.open(filename, "r:Windows-1252")。 (我认为“Windows-1252”是正确的编码名称。)

我曾经编写过蜘蛛程序,而 HTML 因标签错误或将一个字符集中的编码二进制字符嵌入到另一个字符集中而臭名昭著。几年前,在大多数语言都实现 UTF-8 和 Unicode 之前,我在解决这些问题时多次使用了一些不好的语言,所以我理解这种挫败感。

ISO/IEC_8859-1
Windows-1252

ISO-8859-1 and Win-1252 are reaallly close in their character sets. Could some app have processed the file and converted it? Or could it have been received from a machine that was defaulting to Win-1252, which is Window's standard setting?

Software that senses the code-set can get the encoding wrong if there are no characters in the 0x80 to 0x9F byte-range so you might try setting file = File.open(filename, "r:ISO-8859-1") to file = File.open(filename, "r:Windows-1252"). (I think "Windows-1252" is the right encoding name.)

I used to write spiders, and HTML is notorious for being mis-labeled or for having encoded binary characters from one character set embedded in another. I used some bad language many times over these problems several years ago, before most languages had implemented UTF-8 and Unicode so I understand the frustration.

ISO/IEC_8859-1,
Windows-1252

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文