如何在 Ruby 中检测字符串中的某些 Unicode 字符？

发布于 2024-10-11 19:57:05 字数 595 浏览 3 评论 0原文

给定 Ruby 1.8.7 中的一个字符串（没有使用 \p{} 支持 Unicode 属性的强大 Oniguruma 正则表达式引擎），我希望能够确定该字符串是否包含一个或多个中文、日文或韩文字符；即

class String
  def contains_cjk?
    ...
  end
end

>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false

我怀疑这将归结为查看字符串中的任何字符是否在 Unihan 中CJKV Unicode 块，但我认为值得询问是否有人知道 Ruby 中的现有解决方案。

原文

Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.

class String
  def contains_cjk?
    ...
  end
end

>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false

I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魔法唧唧 2024-10-18 19:57:05

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} 匹配字符的 Unicode 脚本。
支持以下文字：阿拉伯语、亚美尼亚语、巴厘岛语、孟加拉语、波波莫福语、盲文、布吉语、布希德语、加拿大原住民、卡里安语、查姆语、切罗基语、通用语、科普特语、楔形文字、塞浦路斯语、西里尔语、沙漠语、梵文、埃塞俄比亚语、格鲁吉亚语、格拉哥里语、哥特语、希腊语、古吉拉特语、古尔木基语、汉语、韩文、哈努努语、希伯来语、平假名、继承、卡纳达语、片假名、Kayah_Li、Kharoshthi、高棉语、老挝语、拉丁语、Lepcha、林布语、Linear_B、利西亚语、吕底亚语、马拉雅拉姆语、蒙古语、缅甸语、 New_Tai_Lue、Nko、Ogham、Ol_Chiki、Old_Italic、Old_Persian、Oriya、Osmanya、Phags_Pa、腓尼基语、Rejang、符文、Saurashtra、Shavian、僧伽罗语、巽他语、Syloti_Nagri、叙利亚语、他加禄语、Tagbanwa、Tai_Le、泰米尔语、泰卢固语、Thaana、泰语、藏语、提菲纳语、乌加里特语、瓦伊语和彝语。

哇。 Ruby 正则表达式源。

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

回复收藏 0 原文

假装爱人 2024-10-18 19:57:05

考虑到我的 Ruby 1.8.7 限制，这是我能做的最好的事情：

class String
  CJKV_RANGES = [
      (0xe2ba80..0xe2bbbf),
      (0xe2bfb0..0xe2bfbf),
      (0xe38080..0xe380bf),
      (0xe38180..0xe383bf),
      (0xe38480..0xe386bf),
      (0xe38780..0xe387bf),
      (0xe38880..0xe38bbf),
      (0xe38c80..0xe38fbf),
      (0xe39080..0xe4b6bf),
      (0xe4b780..0xe4b7bf),
      (0xe4b880..0xe9bfbf),
      (0xea8080..0xea98bf),
      (0xeaa080..0xeaaebf),
      (0xeaaf80..0xefbfbf),
  ]

  def contains_cjkv?
    each_char do |ch|
      return true if CJKV_RANGES.any? {|range| range.member? ch.unpack('H*').first.hex }
    end
    false
  end
end


strings = ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each {|s| puts s.contains_cjkv? }

#true
#true
#true
#false

相当简单，但它有效。它实际上也检测各种印度文字，所以它可能真的应该被称为 contains_asian ？

也许我应该为其他困在 Ruby 1.8 中的可怜的 I18N 黑客们提供这个资源。

Given my Ruby 1.8.7 constraint, this is the best I could do:

class String
  CJKV_RANGES = [
      (0xe2ba80..0xe2bbbf),
      (0xe2bfb0..0xe2bfbf),
      (0xe38080..0xe380bf),
      (0xe38180..0xe383bf),
      (0xe38480..0xe386bf),
      (0xe38780..0xe387bf),
      (0xe38880..0xe38bbf),
      (0xe38c80..0xe38fbf),
      (0xe39080..0xe4b6bf),
      (0xe4b780..0xe4b7bf),
      (0xe4b880..0xe9bfbf),
      (0xea8080..0xea98bf),
      (0xeaa080..0xeaaebf),
      (0xeaaf80..0xefbfbf),
  ]

  def contains_cjkv?
    each_char do |ch|
      return true if CJKV_RANGES.any? {|range| range.member? ch.unpack('H*').first.hex }
    end
    false
  end
end


strings = ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each {|s| puts s.contains_cjkv? }

#true
#true
#true
#false

Pretty hacktacular, but it works. It actually detects a variety of Indic scripts as well, so it should probably really be called contains_asian?

Maybe I should gem this up for other poor I18N hackers stuck with Ruby 1.8.

回复收藏 0 原文

空名 2024-10-18 19:57:05

我写了一个小宝石，将上面 steenslag 的答案中的方法打包起来：

https://github.com/jpatokal/ script_detector

它还可以尝试区分日语、韩语、简体中文和繁体中文，尽管由于汉族统一的复杂性，它只能可靠地处理大块文本。

回复收藏 0 原文

箹锭⒈辈孓 2024-10-18 19:57:05

Ruby 1.8 解决方案基于此代码，并使用 Josh Glover 在此线程上的解决方案中的 API：

class String
  CJKV_RANGES = [
    (0x4E00..0x9FFF),
    (0x3400..0x4DBF),
    (0x20000..0x2A6DF),
    (0x2A700..0x2B73F),
  ]

  def contains_cjkv?
    unpack("U*").any? { |char|
      CJKV_RANGES.any? { |range| range.member?(char) }
    }
  end
end

Ruby 1.8 solution based on this code and using the API from Josh Glover's solution on this thread:

class String
  CJKV_RANGES = [
    (0x4E00..0x9FFF),
    (0x3400..0x4DBF),
    (0x20000..0x2A6DF),
    (0x2A700..0x2B73F),
  ]

  def contains_cjkv?
    unpack("U*").any? { |char|
      CJKV_RANGES.any? { |range| range.member?(char) }
    }
  end
end

回复收藏 0 原文

~没有更多了~