读取文件时如何避免被 UTF-8 BOM 绊倒
我正在使用最近添加了 Unicode BOM 标头 (U+FEFF) 的数据源,而我的 rake 任务现在被它搞乱了。
我可以使用 file.gets[3..-1] 跳过前 3 个字节,但是有没有一种更优雅的方式来读取 Ruby 中的文件,无论是否存在 BOM,都可以正确处理此问题?
I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.
I can skip the first 3 bytes with file.gets[3..-1]
but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
对于 ruby 1.9.2,您可以使用模式
r:bom|utf-8
或
或
无论 BOM 在文件中是否可用,都没关系。
您还可以将编码选项与其他命令一起使用:(
您将获得一个包含所有行的数组)。
或者使用 CSV:
With ruby 1.9.2 you can use the mode
r:bom|utf-8
or
or
It doesn't matter, if the BOM is available in the file or not.
You may also use the encoding option with other commands:
(You get an array with all lines).
Or with CSV:
我不会盲目地跳过前三个字节; 如果生产商停止再次添加 BOM 怎么办? 您应该做的是检查前几个字节,如果它们是 0xEF 0xBB 0xBF,则忽略它们。 这就是 BOM 字符 (U+FEFF) 在 UTF-8 中采用的形式; 我更喜欢在尝试解码流之前处理它,因为从一种语言/工具/框架到另一种语言/工具/框架,BOM 处理非常不一致。
事实上,这就是您应该处理 BOM 的方式。 如果文件已作为 UTF-16 提供,则必须在开始解码之前检查前两个字节,以便知道是否将其读取为大端字节序或小端字节序。 当然,UTF-8 BOM 与字节顺序无关,它只是让您知道编码是 UTF-8,以防您还不知道。
I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.
In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.
当存在 0xEF 0xBB 0xBF 的 BOM 时,我不会“信任”某些文件被编码为 UTF-8,您可能会失败。 通常在检测UTF-8 BOM时,当然应该确实是UTF-8编码的文件。 但是,例如,如果有人刚刚将 UTF-8 BOM 添加到 ISO 文件中,那么如果文件中存在高于 0x0F 的字节,那么您将无法对此类文件进行编码。 如果内部只有 0x0F 之前的字节,则可以信任该文件,因为在这种情况下,它是一个 UTF-8 兼容的 ASCII 文件,同时它也是一个有效的 UTF-8 文件。
如果文件中不只是字节 <= 0x0F(在 BOM 之后),为了确保它是正确的 UTF-8 编码,您必须检查有效序列,并且 - 即使所有序列都有效 - 还要检查是否序列中的每个代码点都使用可能的最短序列,并检查是否没有与高代理或低代理匹配的代码点。 还要检查序列的最大字节数是否不超过 4,并且最高代码点是否为 0x10FFFF。 最高代码点还限制起始字节的有效负载位不高于 0x4,并且第一个后续字节的有效负载不高于 0xF。 如果所有提到的检查都成功通过,那么您的 UTF-8 BOM 就是事实。
I'd not "trust" some file to be encoded as UTF-8 when a BOM of 0xEF 0xBB 0xBF is present, you might fail. Usually when detecting the UTF-8 BOM, it should really be a UTF-8 encoded file of course. But, if for example someone has just added the UTF-8 BOM to an ISO file, you'd fail to encode such file so bad if there are bytes in it that are above 0x0F. You can trust the file if you have only bytes up to 0x0F inside, because in this case it's a UTF-8 compatible ASCII file and at the same time it is a valid UTF-8 file.
If there are not just bytes <= 0x0F within the file (after the BOM), to be sure it is properly UTF-8 encoded you'll have to check for valid sequences and - even when all sequences are valid - check also if each codepoint from a sequence uses the shortest sequence possible and check also if there is no codepoint that matches a high- or low-surrogate. Also check if the maximum bytes of a sequence is not more than 4 and the highest codepoint is 0x10FFFF. The highest codepoint limits also the startbyte's payload bits to be not higher than 0x4 and the first following byte's payload not higher than 0xF. If all the mentioned checks pass successfully, your UTF-8 BOM tells the truth.