使用 \d 扫描字符串中的 Unicode 数字
根据 Oniguruma 文档,\d
字符类型匹配:
十进制数字字符
Unicode:General_Category -- Decimal_Number
但是,在包含所有 Decimal_Number 字符的字符串中扫描 \d
会导致仅匹配拉丁 0-9 数字:
#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
我是否误读了文档?为什么 \d
不匹配其他 Unicode 数字,和/或有没有办法让它这样做?
According to the Oniguruma documentation, the \d
character type matches:
decimal digit char
Unicode: General_Category -- Decimal_Number
However, scanning for \d
in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:
#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
Am I misreading the documentation? Why doesn't \d
match other Unicode numerals, and/or is there a way to make it do so?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Brian Candler 在 ruby-talk 上指出:
\w
仅匹配 ASCII 字母和数字,而[[:alpha:]]
匹配全套 Unicode 字母。\d
仅匹配 ASCII 数字,而[[:digit:]]
匹配全套 Unicode 数字。因此,行为是“一致的”,我们有一个针对 Unicode 数字的简单解决方法。阅读 同一个 Oniguruma 文档\w a> 我们看到文本:
根据 Ruby 的真实行为和上面的“Not Unicode”文本,文档似乎描述了两种模式——Unicode 模式和 Not Unicode 模式——并且 Ruby 运行在不统一码模式。
这可以解释为什么
\d
与完整的 Unicode 集不匹配:尽管 Oniguruma 文档未能准确描述在非 Unicode 模式下匹配的内容,但我们现在知道记录为“Unicode”的行为是不出所料。留给读者的练习是发现如何(如果有的话)在 Ruby 正则表达式中启用 Unicode 模式,如
/u
标志(例如/\w/u) 不这样做。 (也许 Ruby 必须使用 Oniguruma 的特殊标志重新编译。)
更新:看来我链接到的 Oniguruma 文档对于 Ruby 1.9 来说并不准确。请参阅此票证讨论,包括以下帖子:
更好的参考:这是有关 Ruby 1.9 正则表达式语法的官方文档:
https://github.com/ruby/ruby/blob/trunk/doc /re.rdoc
Noted by Brian Candler on ruby-talk:
\w
only matches ASCII letters and digits, while[[:alpha:]]
matches the full set of Unicode letters.\d
only matches ASCII digits, while[[:digit:]]
matches the full set of Unicode numbers.The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on
\w
in the same Oniguruma doc we see the text:In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why
\d
does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the
/u
flag (e.g./\w/u
) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc
请尝试使用 Unicode 字符类
\p{N}
。匹配所有 Unicode 数字。不知道为什么\d
不起作用。Try the Unicode character class
\p{N}
instead. That matches all Unicode digits. No idea why\d
isn't working.默认情况下,
\d
仅匹配 ASCII 数字。您可以使用(违反直觉的)(?u)
语法在正则表达式中手动打开 Unicode 匹配:\d
will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive)(?u)
syntax:Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:
You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post:
https://idiosyncratic-ruby.com/30-regex-with-class.html