使用 Ruby 的 fastcsv 和字符编码
使用 Ruby 1.8.7,我想接受 csv 到我的系统中,即使这是一个管理应用程序,似乎我可以获得几种不同类型的 csv。在我的 Mac 上,如果我使用“windows csv”选项从 Excel 导出,则 Fastercsv 可以默认读取它。在Windows上,我似乎得到了utf-16编码的csv(我还没有弄清楚如何解析)
允许用户上传可能是utf8、utf16、ascii等类型格式的csv似乎是很常见的事情,检测并解析它们。有没有人弄清楚这一点?
我开始使用 UniversalDetector 来帮助我检测,然后使用 Iconv 进行转换,但这似乎很棘手,希望有人能弄清楚:)
Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)
It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?
I started to look at UniversalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据 FasterCSV 的文档,
initialize
方法需要一个:编码
选项:由于其列表有限,您可能需要考虑使用
iconv
对内容进行预处理,然后将它们传递到 CSV。您可以使用 Ruby 的 iconv 接口(“Iconv”)或其命令行版本。 Iconv 非常强大且灵活,能够转换 UTF-16 等。实际上检测文档的编码更成问题,但是命令行版本可以帮助您。如果我没记错的话它可以帮助识别编码。它还可以在编码之间进行转换,或者,如果您愿意,可以告诉它转换为 ASCII、转换为最接近的匹配字符或完全忽略它们。
在处理不同的字符集方面,Ruby 1.9.2 比 1.8.7 更强大,因此您可能需要考虑升级。另外,要更熟悉处理字符集和多字节字符的工具和问题,您应该阅读 James Gray 的博客。
According to FasterCSV's docs, the
initialize
method takes an:encoding
option:Because its list is limited, you might want to look into using
iconv
to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.
Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.