使用 Ruby 的 fastcsv 和字符编码

发布于 2024-10-19 22:14:38 字数 317 浏览 1 评论 0原文

使用 Ruby 1.8.7,我想接受 csv 到我的系统中,即使这是一个管理应用程序,似乎我可以获得几种不同类型的 csv。在我的 Mac 上,如果我使用“windows csv”选项从 Excel 导出,则 Fastercsv 可以默认读取它。在Windows上,我似乎得到了utf-16编码的csv(我还没有弄清楚如何解析)

允许用户上传可能是utf8、utf16、ascii等类型格式的csv似乎是很常见的事情,检测并解析它们。有没有人弄清楚这一点?

我开始使用 UniversalDetector 来帮助我检测,然后使用 Iconv 进行转换,但这似乎很棘手,希望有人能弄清楚:)

Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)

It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?

I started to look at UniversalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

枯寂 2024-10-26 22:14:38

根据 FasterCSV 的文档initialize 方法需要一个:编码选项:

解析文件时使用的编码。默认为您的 $KDOCE 设置。有效值:n???或N???没有,e???或E???对于 EUC,s???或S???对于 SJIS,以及 u???或U???对于 UTF-8(请参阅 Regexp.new())。


由于其列表有限,您可能需要考虑使用 iconv 对内容进行预处理,然后将它们传递到 CSV。您可以使用 Ruby 的 iconv 接口(“Iconv”)或其命令行版本。 Iconv 非常强大且灵活,能够转换 UTF-16 等。

实际上检测文档的编码更成问题,但是命令行版本可以帮助您。如果我没记错的话它可以帮助识别编码。它还可以在编码之间进行转换,或者,如果您愿意,可以告诉它转换为 ASCII、转换为最接近的匹配字符或完全忽略它们。

在处理不同的字符集方面,Ruby 1.9.2 比 1.8.7 更强大,因此您可能需要考虑升级。另外,要更熟悉处理字符集和多字节字符的工具和问题,您应该阅读 James Gray 的博客

According to FasterCSV's docs, the initialize method takes an :encoding option:

The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values: n??? orN??? for none, e??? orE??? for EUC, s??? orS??? for SJIS, and u??? orU??? for UTF-8 (see Regexp.new()).

Because its list is limited, you might want to look into using iconv to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.

Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.

Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文