尝试使用 Ruby 1.8 / FasterCSV 解析带有变音符号等的 CSV 文件时出现问题

发布于 2024-10-18 08:32:45 字数 841 浏览 2 评论 0原文

我有一个 CSV 文件,其中包含这样的行:

...,"Städtische Galerie im Lenbachhaus",...

我正在使用 Ruby 1.8 和 FasterCSV gem,如下所示:

FasterCSV.foreach(file, :encoding => 'u', :headers => :first_row) do |r|
    as = ImportObject.create!(r.to_hash)
end

对于大多数行,它工作正常,但对于这些行,带有特殊字符的字段被截断,所以我们得到“St”保存在数据库中。

我已经把 $KCODE="u" 和带/不带编码选项,但无济于事。

数据库是MySQL。

编辑:

我尝试将代码推送到 Heroku (Postgres),现在收到一个新错误:

2011-02-19T17:19:01-08:00 应用程序[web.1]: ActiveRecord::语句无效 (PGError:错误:无效字节 编码“UTF8”的序列:0xe46474

2011-02-19T17:19:01-08:00 应用程序[web.1]: 提示:如果 字节序列不匹配 服务器期望的编码,其中 由“client_encoding”控制。

2011-02-19T17:19:01-08:00 应用程序[web.1]: :插入“导入对象”(... “标题”,...) VALUES (..., 'St?dtische 伦巴赫豪斯画廊 (Galerie im Lenbachhaus),...) 返回“id”):

:(

I have a CSV file with lines like this in it:

...,"Städtische Galerie im Lenbachhaus",...

I am using Ruby 1.8, with the FasterCSV gem, like so:

FasterCSV.foreach(file, :encoding => 'u', :headers => :first_row) do |r|
    as = ImportObject.create!(r.to_hash)
end

For most rows its working fine, but for these rows the field with the special character is getting truncated, so we get "St" saved in the db.

I have put $KCODE="u" and with/without the encoding option, to no avail.

The DB is MySQL.

EDIT:

I tried pushing the code up to Heroku (Postgres) and now getting a new error:

2011-02-19T17:19:01-08:00 app[web.1]:
ActiveRecord::StatementInvalid
(PGError: ERROR: invalid byte
sequence for encoding "UTF8": 0xe46474

2011-02-19T17:19:01-08:00 app[web.1]:
HINT: This error can also happen if
the byte sequence does not match the
encoding expected by the server, which
is controlled by "client_encoding".

2011-02-19T17:19:01-08:00 app[web.1]:
: INSERT INTO "import_objects" (...
"title", ...) VALUES (..., 'St?dtische
Galerie im Lenbachhaus', ...)
RETURNING "id"):

:(

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

初相遇 2024-10-25 08:32:46

您没有说明您正在使用什么数据库类型,但数据库很可能没有配置为 UTF-8,而是需要 ASCII。向其中添加非 ASCII 字符可能会导致字符串被截断、字符丢失或字符被占位符替换,具体取决于数据库以及您用来与之通信的 gem 或 ORM。当我构建数据库时,我确保将其配置为 UTF-8,或者确保对推入其中的文本进行编码,以便它可以进行往返而不会损坏或丢失。我和你一样,经历了惨痛的教训,才吸取了教训。

检查数据库的日志,和/或检查代码以查看是否可以为数据库插入启用日志记录以及错误和警告消息。

对于许多数据库来说,禁用警告和错误很容易,但在开发过程中您不想这样做。这些信息很重要,可能预示着即将出现的大问题。忽略它们并将代码推入生产可能会导致不眠之夜。

You don't say what database type you're using, but it's very possible the DB is not configured for UTF-8, and instead is expecting ASCII. Throwing non-ASCII characters at it could result in a truncated string, a missing character, or a character replaced with a placeholder, depending on the database and what gem or ORM you're using to talk to it. When I build a database I make sure it's configured for UTF-8, or, I make sure the text I push into it is encoded so it can make a round-trip without corruption or loss. I learned that lesson the same way you are, the hard way.

Check the database's log, and/or, check your code to see whether you can enable logging and error and warning messages for the database inserts.

It's easy to disable warnings and errors with a lot of databases, but during development you don't want to do that. Those messages are important and can signal big problems to come. Ignoring them and pushing code to production can be a real recipe for sleepless nights.

痴骨ら 2024-10-25 08:32:45

正如您所猜测的,问题可能是文件编码问题。最可能的情况是您的文件实际上并未使用 UTF-8 编码,因此应用程序的其余部分无法识别外部编码。也有可能(但我认为不太可能)编码中使用的字节之一是 ASCII 中的引号或逗号,这会扰乱 FasterCSV 解析数据。

首先,创建一个仅包含 CSV 文件中“问题行”的测试文件。接下来,读取文件中的数据:

text_in = File.read('data.csv')

现在您必须对其进行转换。问题是,你并不真正知道它是什么。你必须尝试一些不同的事情。我最好的猜测是文本是 Latin-1 编码的。

require 'iconv'
text_out = Iconv.conv("UTF8", "LATIN1", text_in)

现在尝试导入此数据。或者,您可以写入磁盘并打开它,然后查看其编码是否正确。

但老实说,您可以在 Ruby 之外更轻松地完成此操作。

$ iconv -t UTF8 -f LATIN1 < data.csv > data_conv.csv

进一步阅读:

The problem is likely a file encoding issue, as you have surmised. The most likely scenario is your file is not actually encoded with UTF-8, so the rest of your application cannot recognize the foreign encoding. It's also possible -- but I believe quite unlikely -- that one of the bytes used in the encoding is a quote or comma in ASCII, which will mess up FasterCSV parsing the data.

First, make a test file with just the "problem row" in your CSV file. Next, read the data in the file:

text_in = File.read('data.csv')

Now you have to convert it. The problem is, you don't really know what it is. You'll have to try a few different things. My best guess is the text is Latin-1 encoded.

require 'iconv'
text_out = Iconv.conv("UTF8", "LATIN1", text_in)

Now try to import this data. Alternatively, you can write to disk and open it, and see if it's encoded properly.

But honestly, you can do this outside of Ruby much more easily.

$ iconv -t UTF8 -f LATIN1 < data.csv > data_conv.csv

Further reading:

菊凝晚露 2024-10-25 08:32:45

问题不在于 FasterCSV,因为在我的测试中,FasterCSV 读取此数据没有问题。例如:

>> FasterCSV.parse("a,Städtische Galerie im Lenbachhaus,b,ä", :headers => [:a,:b,:c,:d]) do |r|
|    r = r.to_hash
|    p r
|    puts r[:d]
|  end  
{:c=>"b", :a=>"a", :d=>"\303\244", :b=>"Städtische Galerie im Lenbachhaus"}
ä

请注意,Ruby 1.8 无法正确处理 unicode 字符,但这主要会影响 String#length 等内容。例如,Ruby 将返回该字符串的长度为 34,而不是 33。但是,除非您对该字符串执行某些操作(例如对其运行验证),否则这不会产生影响。

>> "Städtische Galerie im Lenbachhaus".length
=> 34
>> "Stadtische Galerie im Lenbachhaus".length
=> 33

所以我的猜测是,这与 ImportObject 或数据库连接的配置方式有关。


这些测试中使用的 Ruby 版本:

>> RUBY_DESCRIPTION 
=> "ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02"

The problem is not FasterCSV, as in my testing, FasterCSV does not have a problem reading this data. For instance:

>> FasterCSV.parse("a,Städtische Galerie im Lenbachhaus,b,ä", :headers => [:a,:b,:c,:d]) do |r|
|    r = r.to_hash
|    p r
|    puts r[:d]
|  end  
{:c=>"b", :a=>"a", :d=>"\303\244", :b=>"Städtische Galerie im Lenbachhaus"}
ä

Note that Ruby 1.8 doesn't handle unicode characters properly, but principally this affects things like String#length. For instance, Ruby will return the length of this string as 34 instead of 33. However this doesn't have an affect until you do something with the string, like run a validation on it.

>> "Städtische Galerie im Lenbachhaus".length
=> 34
>> "Stadtische Galerie im Lenbachhaus".length
=> 33

So my guess is it's something about ImportObject or how your database connection is configured.


Ruby version used in these tests:

>> RUBY_DESCRIPTION 
=> "ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文