特殊字符的正则态度(顶部有线)
我试图在Python中编写正则表达式来用下划线替换所有非ascii,但如果其中一个字符是“S̄
”(顶部有一行的“S”),它会添加一个额外的“S”...有没有办法解释这个角色?我相信这是一个有效的 utf-8 字符,但不是 ascii
这里有代码:
import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))
我希望它输出:
ra_ndom_word_
但我得到:
ra_ndom_wordS__
I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is "S̄
" (an 'S' with a line on the top), it adds an extra 'S'... Is there a way to account for this character as well? I believe it's a valid utf-8 character, but not ascii
Here's there code:
import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))
I would expect it to output:
ra_ndom_word_
But instead I get:
ra_ndom_wordS__
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Python 以这种方式工作的原因是,您实际上正在查看两个不同的字符;有一个
S
,然后是一个组合宏 U+0304在一般情况下,如果您想用下划线替换一系列组合字符和基本字符,请尝试例如
顺便说一句,
\W
不需要正方形它周围的括号;它已经是一个正则表达式字符类。Python 的 re 模块缺乏对重要 Unicode 属性的支持,但如果您确实想为此专门使用正则表达式,则第三方 regex 库对 Unicode 类别具有适当的支持。
“Ll”
是小写字母,“Lu”
是大写字母。还有其他 Unicode L 类别,因此可以调整它以满足您的要求(unicodedata.category(char).startswith("L")
也许?);另请参阅 https://www.fileformat.info/info/unicode/category/索引.htmThe reason Python works this way is that you are actually looking at two distinct characters; there's an
S
and then it's followed by a combining macron U+0304In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.
By the by,
\W
does not need square brackets around it; it's already a regex character class.Python's
re
module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-partyregex
library has proper support for Unicode categories."Ll"
is lowercase alphabetics and"Lu"
are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L")
maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm您可以使用以下脚本来获取所需的输出:
输出
在这种方法中,它也适用于其他非 ASCII 字符:
输出:
You can use the following script to get the desired output:
Output
In this approach, it works with other non-ascii characters as well :
output: