Python Regex：带有re.ascii的模式仍然可以匹配Unicode字符？

发布于 2025-02-06 16:32:57 字数 333 浏览 0 评论 0 原文

我是Python Regex的新手，并且正在尝试与Python中的非白色太空ASCII角色相匹配。

以下是我的代码：

impore re

p = re.compile(r"[\S]{2,3}", re.ASCII)

p.search('1234')  # have some result

p.search('你好吗') # also have result, but Why?

我在 re.compile 中指定了ASCII模式，但是 p.Search（'你好吗'）仍然具有结果。我想知道我在这里做错了什么？

原文

I am new to Python regex and am trying to match non-white space ASCII characters in Python.

The following is my code:

impore re

p = re.compile(r"[\S]{2,3}", re.ASCII)

p.search('1234')  # have some result

p.search('你好吗') # also have result, but Why?

I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心奴独伤 2025-02-13 16:32:57

re.A 标志仅影响 shorthand字符类匹配。

在python 3.x中， shorthand charthand cartem class href =“ https://docs.python.org/2/library/re.html#re.unicode” rel =“ noreferrer”> python 2.x 2.x re.unicode /code>/ re .U 默认情况下

。 [nd] ）
\ d ：匹配任何不是十进制数字的字符。（因此，除这些字符以外的所有字符在
\ w - 匹配Unicode Word字符；这包括大多数可以在任何语言中都可以成为单词一部分的字符以及数字和下划线。（so， \ w+与中的每个单词匹配我的名字是~~ 字符串）
\ w - 匹配任何不是单词字符的字符。。的对立
这是 \ w （它将匹配 nel ，硬空间等）
\ s - 匹配任何不是whitespace字符的字符。 （因此， nel ，硬空间等不匹配）
\ b - 单词边界与Unicode Letters/Digits/Digits和非字母/非字母/数字之间的位置匹配字符串。
\ b - 非字边界匹配两个Unicode字母/数字，两个非字母/数字之间的位置，或者在Unicode非字母/数字和字符串的启动/结束之间匹配。

如果要禁用此行为，则使用 re.a 或 re.ascii ：

make \ w ， \ w ， \ b ， \ b ， \ d \ d ， \ d ， \ s 和 \ s 仅执行ASCII匹配，而不是完整的Unicode匹配。这仅对Unicode模式有意义，并且对于字节模式而被忽略。对应于Inline Flag （？a）。

这意味着：

\ d = [0-9] - 并且不再匹配印地语，孟加拉语等。Digits
\ code> \ d = [^0-9] - 并与ASCII数字以外的其他字符匹配（即它作为（？u）（？！[0-9]）\ d 现在）
<代码> \ w = [a-za-z0-9 _] - 现在仅与ascii单词匹配， wiktor 与 \ w+匹配，但是▪~期不
\ w = [^a-Za-Z0-9 _] - 它与任何char匹配，但ASCII匹配Letters/Digits/ _ （即它匹配，
> code> ~~/code> 等。 = [\ t \ n \ r \ f \ v] - 匹配常规空间，选项卡，linefeed，carriage返回，表单feed和垂直选项卡
\ code> \ s = [^ \ t \ n \ r \ f \ v] - 匹配除空间以外的任何其他字符，tab，linefeed，carriage return，return，form feed和垂直选项卡，所以它匹配所有Unicode字母，数字和标点符号以及Unicode（non-ascii）Whitespace。 xa0'，flags = re.a）将返回'{}'，如您所见， \ s 现在匹配硬空间。

The re.A flag only affects what shorthand character classes match.

In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:

\d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
\w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
\W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
\s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
\S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
\b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.

If you want to disable this behavior, you use re.A or re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

That means that:

\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
\D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.