Python Regex:带有re.ascii的模式仍然可以匹配Unicode字符?
我是Python Regex的新手,并且正在尝试与Python中的非白色太空ASCII角色相匹配。
以下是我的代码:
impore re
p = re.compile(r"[\S]{2,3}", re.ASCII)
p.search('1234') # have some result
p.search('你好吗') # also have result, but Why?
我在 re.compile
中指定了ASCII模式,但是 p.Search('你好吗')
仍然具有结果。我想知道我在这里做错了什么?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
re.A
标志仅影响 shorthand字符类匹配。在python 3.x中, shorthand charthand cartem class href =“ https://docs.python.org/2/library/re.html#re.unicode” rel =“ noreferrer”> python 2.x 2.x
re.unicode
/code>/re .U
默认情况下\ d
:匹配任何不是十进制数字的字符。(因此,除这些字符以外的所有字符在\ w
- 匹配Unicode Word字符;这包括大多数可以在任何语言中都可以成为单词一部分的字符以及数字和下划线。(so,\ w+
与中的每个单词匹配我的名字是~~
字符串)\ w
- 匹配任何不是单词字符的字符。 。的对立\ w (它将匹配
nel
,硬空间等)\ s
- 匹配任何不是whitespace字符的字符。 (因此,nel
,硬空间等不匹配)\ b
- 单词边界与Unicode Letters/Digits/Digits和非字母/非字母/数字之间的位置匹配字符串。\ b
- 非字边界匹配两个Unicode字母/数字,两个非字母/数字之间的位置,或者在Unicode非字母/数字和字符串的启动/结束之间匹配。如果要禁用此行为,则使用
re.a
或re.ascii
:这意味着:
\ d
=[0-9]
- 并且不再匹配印地语,孟加拉语等。Digits\ code> \ d
=[^0-9]
- 并与ASCII数字以外的其他字符匹配(即它作为(?u)(?![0-9])\ d
现在)[a-za-z0-9 _]
- 现在仅与ascii单词匹配,wiktor
与\ w+匹配
,但是▪~期
不\ w
=[^a-Za-Z0-9 _]
- 它与任何char匹配,但ASCII匹配Letters/Digits/_
(即它匹配> code> ~~/code> 等。 =
[\ t \ n \ r \ f \ v]
- 匹配常规空间,选项卡,linefeed,carriage返回,表单feed和垂直选项卡\ code> \ s
=[^ \ t \ n \ r \ f \ v]
- 匹配除空间以外的任何其他字符,tab,linefeed,carriage return,return,form feed和垂直选项卡,所以它匹配所有Unicode字母,数字和标点符号以及Unicode(non-ascii)Whitespace。 xa0',flags = re.a)将返回'{}'
,如您所见,\ s
现在匹配硬空间。The
re.A
flag only affects what shorthand character classes match.In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x
re.UNICODE
/re.U
is ON by default. That means:\d
: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])\D
: Matches any character which is not a decimal digit. (So, all characters other than those in theNd
Unicode category).\w
- Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So,\w+
matches each word in aMy name is Виктор
string)\W
- Matches any character which is not a word character. This is the opposite of\w
. (So, it will not match any Unicode letter or digit.)\s
- Matches Unicode whitespace characters (it will matchNEL
, hard spaces, etc.)\S
- Matches any character which is not a whitespace character. (So, no match forNEL
, hard space, etc.)\b
- word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.\B
- non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.If you want to disable this behavior, you use
re.A
orre.ASCII
:That means that:
\d
=[0-9]
- and no longer matches Hindi, Bengali, etc. digits\D
=[^0-9]
- and matches any characters other than ASCII digits (i.e. it acts as(?u)(?![0-9])\d
now)\w
=[A-Za-z0-9_]
- and it only matches ASCII words now,Wiktor
is matched with\w+
, butВиктор
does not\W
=[^A-Za-z0-9_]
- it matches any char but ASCII letters/digits/_
(i.e. it matches你好吗
,Виктор
, etc.\s
=[ \t\n\r\f\v]
- matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab\S
=[^ \t\n\r\f\v]
- matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g.,re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A)
will return'{ } '
, as you see, the\S
now matches hard spaces.