使用 \d 扫描字符串中的 Unicode 数字

发布于 2024-11-28 13:40:38 字数 853 浏览 8 评论 0原文

根据 Oniguruma 文档，\d 字符类型匹配：

十进制数字字符
Unicode：General_Category -- Decimal_Number

但是，在包含所有 Decimal_Number 字符的字符串中扫描 \d 会导致仅匹配拉丁 0-9 数字：

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

我是否误读了文档？为什么 \d 不匹配其他 Unicode 数字，和/或有没有办法让它这样做？

原文

According to the Oniguruma documentation, the \d character type matches:

decimal digit char
Unicode: General_Category -- Decimal_Number

However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Am I misreading the documentation? Why doesn't \d match other Unicode numerals, and/or is there a way to make it do so?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凹づ凸ル 2024-12-05 13:40:38

Brian Candler 在 ruby-talk 上指出：

\w 仅匹配 ASCII 字母和数字，而 [[:alpha:]] 匹配全套 Unicode 字母。
\d 仅匹配 ASCII 数字，而 [[:digit:]] 匹配全套 Unicode 数字。

因此，行为是“一致的”，我们有一个针对 Unicode 数字的简单解决方法。阅读同一个 Oniguruma 文档\w a> 我们看到文本：

\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

根据 Ruby 的真实行为和上面的“Not Unicode”文本，文档似乎描述了两种模式——Unicode 模式和 Not Unicode 模式——并且 Ruby 运行在不统一码模式。

这可以解释为什么 \d 与完整的 Unicode 集不匹配：尽管 Oniguruma 文档未能准确描述在非 Unicode 模式下匹配的内容，但我们现在知道记录为“Unicode”的行为是不出所料。

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

留给读者的练习是发现如何（如果有的话）在 Ruby 正则表达式中启用 Unicode 模式，如 /u 标志（例如 /\w/u) 不这样做。（也许 Ruby 必须使用 Oniguruma 的特殊标志重新编译。）

更新：看来我链接到的 Oniguruma 文档对于 Ruby 1.9 来说并不准确。请参阅此票证讨论，包括以下帖子：

[Yui NARUSE]“RE.txt 适用于原始 Oniguruma，不适用于 Ruby 1.9 的正则表达式。我们可能需要自己的文档。”
[Matz]“我们的 Oniguruma 是分叉的。在 geocities.jp 中发现的原始 Oniguruma 尚未更改。”

更好的参考：这是有关 Ruby 1.9 正则表达式语法的官方文档：
https://github.com/ruby/ruby/blob/trunk/doc /re.rdoc

Noted by Brian Candler on ruby-talk:

\w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
\d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:

\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

回复收藏 0 原文

标点 2024-12-05 13:40:38

请尝试使用 Unicode 字符类 \p{N}。匹配所有 Unicode 数字。不知道为什么 \d 不起作用。

回复收藏 0 原文

め七分饶幸 2024-12-05 13:40:38

默认情况下，\d 仅匹配 ASCII 数字。您可以使用（违反直觉的）(?u) 语法在正则表达式中手动打开 Unicode 匹配：

\d will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u) syntax:

"????".match(/(?u)\d/) # => #<MatchData "????">

Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:

/[[:digit:]]/ # posix style
/\p{Nd}/ # unicode property/category style

You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post:
https://idiosyncratic-ruby.com/30-regex-with-class.html

回复收藏 0 原文

~没有更多了~