检测 Rails 中的非 ASCII 字符

发布于 2024-12-01 06:59:34 字数 162 浏览 0 评论 0原文

我想知道是否有一种方法可以检测 Rails 中的非 ASCII 字符。

我读到Rails默认不使用Unicode,像中文和日文这样的字符在Unicode中分配了范围。有没有一种简单的方法可以在 Rails 中检测这些字符?或者只是指定我期望的字符范围?

有这个插件吗?提前致谢!

I am wondering if there's a way to detect non-ASCII characters in Rails.

I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these characters in Rails? or just specify the range of characters I am expecting?

Is there a plugin for this? Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

堇年纸鸢 2024-12-08 06:59:34

所有表意语言编码都使用多个字节来表示一个字符,Ruby 1.9+ 知道字节和字符之间的区别(Ruby 1.8 不知道)

您可以将字符长度与字符串的字节长度进行比较,这是一种快速而肮脏的方法探测器。但这可能并非万无一失。

class String
  def multibyte?
    chars.count < bytes.count
  end
end

"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false

All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)

You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.

class String
  def multibyte?
    chars.count < bytes.count
  end
end

"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false
一直在等你来 2024-12-08 06:59:34

这对于 1.9.2 来说非常容易,因为正则表达式在 1.9.2 中是基于字符的,并且 1.9.2 从上到下知道字节和字符之间的区别。您使用的是 Rails,因此您应该获得 UTF-8 格式的所有内容。令人高兴的是,UTF-8 和 ASCII 在整个 ASCII 范围内重叠,因此当您使用 UTF-8 时,您可以删除 ' ''~' 之间以外的所有内容编码文本:

>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"

实际上没有理由去解决所有这些麻烦。 Ruby 1.9 与 Unicode 配合得很好,Rails 和几乎所有其他东西也是如此。 15 年前,处理非 ASCII 文本是一场噩梦,现在它很常见并且相当简单。


如果您确实设法获取非 UTF-8 的文本数据,那么您有一些选择。如果编码是 ASCII-8BITBINARY 那么您可能可以使用 s.force_encoding('utf-8')。如果您最终得到 UTF-8ASCII-8BIT 以外的内容,那么您可以使用 Iconv 对其进行重新编码。

参考文献:

This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' ' and '~' when you have UTF-8 encoded text:

>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"

There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.


If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT or BINARY then you can probably get away with s.force_encoding('utf-8'). If you end up with something other than UTF-8 and ASCII-8BIT then you can use Iconv to re-encode it.

References:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文