如何在 Ruby 中检测字符串中的某些 Unicode 字符?
给定 Ruby 1.8.7 中的一个字符串(没有使用 \p{} 支持 Unicode 属性的强大 Oniguruma 正则表达式引擎),我希望能够确定该字符串是否包含一个或多个中文、日文或韩文字符;即
class String
def contains_cjk?
...
end
end
>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false
我怀疑这将归结为查看字符串中的任何字符是否在 Unihan 中CJKV Unicode 块,但我认为值得询问是否有人知道 Ruby 中的现有解决方案。
Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.
class String
def contains_cjk?
...
end
end
>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false
I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
(ruby 1.9.2)
\p{} 匹配字符的 Unicode 脚本。
支持以下文字:阿拉伯语、亚美尼亚语、巴厘岛语、孟加拉语、波波莫福语、盲文、布吉语、布希德语、加拿大原住民、卡里安语、查姆语、切罗基语、通用语、科普特语、楔形文字、塞浦路斯语、西里尔语、沙漠语、梵文、埃塞俄比亚语、格鲁吉亚语、格拉哥里语、哥特语、希腊语、古吉拉特语、古尔木基语、汉语、韩文、哈努努语、希伯来语、平假名、继承、卡纳达语、片假名、Kayah_Li、Kharoshthi、高棉语、老挝语、拉丁语、Lepcha、林布语、Linear_B、利西亚语、吕底亚语、马拉雅拉姆语、蒙古语、缅甸语、 New_Tai_Lue、Nko、Ogham、Ol_Chiki、Old_Italic、Old_Persian、Oriya、Osmanya、Phags_Pa、腓尼基语、Rejang、符文、Saurashtra、Shavian、僧伽罗语、巽他语、Syloti_Nagri、叙利亚语、他加禄语、Tagbanwa、Tai_Le、泰米尔语、泰卢固语、Thaana、泰语、藏语、提菲纳语、乌加里特语、瓦伊语和彝语。
哇。 Ruby 正则表达式源 。
(ruby 1.9.2)
\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.
Wow. Ruby Regexp source .
考虑到我的 Ruby 1.8.7 限制,这是我能做的最好的事情:
相当简单,但它有效。它实际上也检测各种印度文字,所以它可能真的应该被称为 contains_asian ?
也许我应该为其他困在 Ruby 1.8 中的可怜的 I18N 黑客们提供这个资源。
Given my Ruby 1.8.7 constraint, this is the best I could do:
Pretty hacktacular, but it works. It actually detects a variety of Indic scripts as well, so it should probably really be called contains_asian?
Maybe I should gem this up for other poor I18N hackers stuck with Ruby 1.8.
我写了一个小宝石,将上面 steenslag 的答案中的方法打包起来:
https://github.com/jpatokal/ script_detector
它还可以尝试区分日语、韩语、简体中文和繁体中文,尽管由于汉族统一的复杂性,它只能可靠地处理大块文本。
I've written a little gem that packages up the approach in steenslag's answer above:
https://github.com/jpatokal/script_detector
It can also take a stab at differentiating between Japanese, Korean, simplified Chinese and traditional Chinese, although due to the complexities of Han unification it only works reliably with large slabs of text.
Ruby 1.8 解决方案基于 此代码,并使用 Josh Glover 在此线程上的解决方案中的 API:
Ruby 1.8 solution based on this code and using the API from Josh Glover's solution on this thread: