检测 Rails 中的非 ASCII 字符

发布于 2024-12-01 06:59:34 字数 162 浏览 0 评论 0原文

我想知道是否有一种方法可以检测 Rails 中的非 ASCII 字符。

我读到Rails默认不使用Unicode，像中文和日文这样的字符在Unicode中分配了范围。有没有一种简单的方法可以在 Rails 中检测这些字符？或者只是指定我期望的字符范围？

有这个插件吗？提前致谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

堇年纸鸢 2024-12-08 06:59:34

所有表意语言编码都使用多个字节来表示一个字符，Ruby 1.9+ 知道字节和字符之间的区别（Ruby 1.8 不知道）

您可以将字符长度与字符串的字节长度进行比较，这是一种快速而肮脏的方法探测器。但这可能并非万无一失。

class String
  def multibyte?
    chars.count < bytes.count
  end
end

"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false

All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)

You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.

class String
  def multibyte?
    chars.count < bytes.count
  end
end

"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false

回复收藏 0 原文

一直在等你来 2024-12-08 06:59:34

这对于 1.9.2 来说非常容易，因为正则表达式在 1.9.2 中是基于字符的，并且 1.9.2 从上到下知道字节和字符之间的区别。您使用的是 Rails，因此您应该获得 UTF-8 格式的所有内容。令人高兴的是，UTF-8 和 ASCII 在整个 ASCII 范围内重叠，因此当您使用 UTF-8 时，您可以删除 ' ' 和 '~' 之间以外的所有内容编码文本：

>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"

实际上没有理由去解决所有这些麻烦。 Ruby 1.9 与 Unicode 配合得很好，Rails 和几乎所有其他东西也是如此。 15 年前，处理非 ASCII 文本是一场噩梦，现在它很常见并且相当简单。

如果您确实设法获取非 UTF-8 的文本数据，那么您有一些选择。如果编码是 ASCII-8BIT 或 BINARY 那么您可能可以使用 s.force_encoding('utf-8')。如果您最终得到 UTF-8 和 ASCII-8BIT 以外的内容，那么您可以使用 Iconv 对其进行重新编码。

参考文献：

This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' ' and '~' when you have UTF-8 encoded text:

>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"

There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.

If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT or BINARY then you can probably get away with s.force_encoding('utf-8'). If you end up with something other than UTF-8 and ASCII-8BIT then you can use Iconv to re-encode it.

References:

回复收藏 0 原文

~没有更多了~