在Python中将区分大小写的unicode字符串与正则表达式匹配
假设我想匹配一个小写字母后跟一个大写字母,我可以做类似
re.compile(r"[a-z][A-Z]")
现在我想对 unicode 字符串做同样的事情,即匹配“aÅ”或“yÜ”之类的内容。
尝试过
re.compile(r"[a-z][A-Z]", re.UNICODE)
但不起作用。
有什么线索吗?
Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like
re.compile(r"[a-z][A-Z]")
Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'.
Tried
re.compile(r"[a-z][A-Z]", re.UNICODE)
but that does not work.
Any clues?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这对于 Python 正则表达式来说很难做到,因为当前的实现不支持 Unicode 属性快捷方式,例如
\p{Lu}
和\p{Ll}
。[A-Za-z]
当然只会匹配 ASCII 字母,无论是否设置了 Unicode 选项。因此,直到更新
re
模块(或者安装regex package
当前正在开发中),您要么需要以编程方式执行此操作(遍历字符串并执行
char.islower()
/char.isupper()
字符上),或指定所有 unicode手动代码点这可能不值得付出努力......This is hard to do with Python regex because the current implementation doesn't support Unicode property shortcuts like
\p{Lu}
and\p{Ll}
.[A-Za-z]
will of course only match ASCII letters, regardless of whether the Unicode option is set or not.So until the
re
module is updated (or you install theregex
package currently in development), you either need to do it programmatically (iterate through the string and dochar.islower()
/char.isupper()
on the characters), or specify all the unicode code points manually which probably isn't worth the effort...