检测 Rails 中的非 ASCII 字符
我想知道是否有一种方法可以检测 Rails 中的非 ASCII 字符。
我读到Rails默认不使用Unicode,像中文和日文这样的字符在Unicode中分配了范围。有没有一种简单的方法可以在 Rails 中检测这些字符?或者只是指定我期望的字符范围?
有这个插件吗?提前致谢!
I am wondering if there's a way to detect non-ASCII characters in Rails.
I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these characters in Rails? or just specify the range of characters I am expecting?
Is there a plugin for this? Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
所有表意语言编码都使用多个字节来表示一个字符,Ruby 1.9+ 知道字节和字符之间的区别(Ruby 1.8 不知道)
您可以将字符长度与字符串的字节长度进行比较,这是一种快速而肮脏的方法探测器。但这可能并非万无一失。
All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)
You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.
这对于 1.9.2 来说非常容易,因为正则表达式在 1.9.2 中是基于字符的,并且 1.9.2 从上到下知道字节和字符之间的区别。您使用的是 Rails,因此您应该获得 UTF-8 格式的所有内容。令人高兴的是,UTF-8 和 ASCII 在整个 ASCII 范围内重叠,因此当您使用 UTF-8 时,您可以删除
' '
和'~'
之间以外的所有内容编码文本:实际上没有理由去解决所有这些麻烦。 Ruby 1.9 与 Unicode 配合得很好,Rails 和几乎所有其他东西也是如此。 15 年前,处理非 ASCII 文本是一场噩梦,现在它很常见并且相当简单。
如果您确实设法获取非 UTF-8 的文本数据,那么您有一些选择。如果编码是
ASCII-8BIT
或BINARY
那么您可能可以使用s.force_encoding('utf-8')
。如果您最终得到UTF-8
和ASCII-8BIT
以外的内容,那么您可以使用 Iconv 对其进行重新编码。参考文献:
This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between
' '
and'~'
when you have UTF-8 encoded text:There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.
If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is
ASCII-8BIT
orBINARY
then you can probably get away withs.force_encoding('utf-8')
. If you end up with something other thanUTF-8
andASCII-8BIT
then you can use Iconv to re-encode it.References: