使用 \d 扫描字符串中的 Unicode 数字

发布于 2024-11-28 13:40:38 字数 853 浏览 4 评论 0原文

根据 Oniguruma 文档\d 字符类型匹配:

十进制数字字符
Unicode:General_Category -- Decimal_Number

但是,在包含所有 Decimal_Number 字符的字符串中扫描 \d 会导致仅匹配拉丁 0-9 数字:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

我是否误读了文档?为什么 \d 不匹配其他 Unicode 数字,和/或有没有办法让它这样做?

According to the Oniguruma documentation, the \d character type matches:

decimal digit char
Unicode: General_Category -- Decimal_Number

However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Am I misreading the documentation? Why doesn't \d match other Unicode numerals, and/or is there a way to make it do so?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凹づ凸ル 2024-12-05 13:40:38

Brian Candler 在 ruby​​-talk 上指出:

  • \w 仅匹配 ASCII 字母和数字,而 [[:alpha:]] 匹配全套 Unicode 字母。
  • \d 仅匹配 ASCII 数字,而 [[:digit:]] 匹配全套 Unicode 数字。

因此,行为是“一致的”,我们有一个针对 Unicode 数字的简单解决方法。阅读 同一个 Oniguruma 文档\w a> 我们看到文本:

\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

根据 Ruby 的真实行为和上面的“Not Unicode”文本,文档似乎描述了两种模式——Unicode 模式和 Not Unicode 模式——并且 Ruby 运行在不统一码模式。

这可以解释为什么 \d 与完整的 Unicode 集不匹配:尽管 Oniguruma 文档未能准确描述在非 Unicode 模式下匹配的内容,但我们现在知道记录为“Unicode”的行为是不出所料。

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

留给读者的练习是发现如何(如果有的话)在 Ruby 正则表达式中启用 Unicode 模式,如 /u 标志(例如 /\w/u) 不这样做。 (也许 Ruby 必须使用 Oniguruma 的特殊标志重新编译。)

更新:看来我链接到的 Oniguruma 文档对于 Ruby 1.9 来说并不准确。请参阅此票证讨论,包括以下帖子:

[Yui NARUSE]“RE.txt 适用于原始 Oniguruma,不适用于 Ruby 1.9 的正则表达式。我们可能需要自己的文档。”
[Matz]“我们的 Oniguruma 是分叉的。在 geocities.jp 中发现的原始 Oniguruma 尚未更改。”

更好的参考:这是有关 Ruby 1.9 正则表达式语法的官方文档:
https://github.com/ruby/ruby/blob/trunk/doc /re.rdoc

Noted by Brian Candler on ruby-talk:

  • \w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
  • \d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:

\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

标点 2024-12-05 13:40:38

请尝试使用 Unicode 字符类 \p{N}。匹配所有 Unicode 数字。不知道为什么 \d 不起作用。

Try the Unicode character class \p{N} instead. That matches all Unicode digits. No idea why \d isn't working.

め七分饶幸 2024-12-05 13:40:38

默认情况下,\d 仅匹配 ASCII 数字。您可以使用(违反直觉的)(?u) 语法在正则表达式中手动打开 Unicode 匹配:

"

\d will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u) syntax:

"????".match(/(?u)\d/) # => #<MatchData "????">

Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:

/[[:digit:]]/ # posix style
/\p{Nd}/ # unicode property/category style

You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post:
https://idiosyncratic-ruby.com/30-regex-with-class.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文