如何在 Ruby 1.9 中为 unicode 西里尔字符指定 Regexp
#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2
问题是为什么 \w
忽略西里尔字符?
我已经从 http://rubyinstaller.org/ 安装了最新的 ruby 软件包。 这是我的 ruby -v 输出
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]
据我所知 1.9 oniguruma 正则表达式库完全支持 unicode 字符。
#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2
The question is why \w
ignore cyrillic characters?
I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]
As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是在 Ruby 文档 中指定的:
\w< /code> 相当于
[a-zA-Z0-9_]
,因此不针对任何 unicode 字符。您可能想使用
[[:alnum:]]
来代替,其中包括所有 unicode 字母和数字字符。另请检查[[:word:]]
和[[:alpha:]]
。This is as specified in the Ruby documentation:
\w
is equivalent to[a-zA-Z0-9_]
and thus doesn't target any unicode character.You probably want to use
[[:alnum:]]
instead, which includes all unicode alphabetic and numeric characters. Check also[[:word:]]
and[[:alpha:]]
.