特殊字符的正则态度（顶部有线）

发布于 2025-01-19 06:53:03 字数 350 浏览 1 评论 0原文

我试图在Python中编写正则表达式来用下划线替换所有非ascii，但如果其中一个字符是“S̄”（顶部有一行的“S”），它会添加一个额外的“S”...有没有办法解释这个角色？我相信这是一个有效的 utf-8 字符，但不是 ascii

这里有代码：

import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))

我希望它输出：

ra_ndom_word_

但我得到：

ra_ndom_wordS__

原文

I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is "S̄" (an 'S' with a line on the top), it adds an extra 'S'... Is there a way to account for this character as well? I believe it's a valid utf-8 character, but not ascii

Here's there code:

import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))

I would expect it to output:

ra_ndom_word_

But instead I get:

ra_ndom_wordS__

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眉黛浅 2025-01-26 06:53:03

Python 以这种方式工作的原因是，您实际上正在查看两个不同的字符；有一个 S ，然后是一个组合宏 U+0304

在一般情况下，如果您想用下划线替换一系列组合字符和基本字符，请尝试例如

import unicodedata

def cleanup(line):
    cleaned = []
    strip = False
    for char in line:
        if unicodedata.combining(char):
            strip = True
            continue
        if strip:
            cleaned.pop()
            strip = False
        if unicodedata.category(char) not in ("Ll", "Lu"):
            char = "_"
        cleaned.append(char)
    return ''.join(cleaned)

顺便说一句， \W 不需要正方形它周围的括号；它已经是一个正则表达式字符类。

Python 的 re 模块缺乏对重要 Unicode 属性的支持，但如果您确实想为此专门使用正则表达式，则第三方 regex 库对 Unicode 类别具有适当的支持。

“Ll” 是小写字母，“Lu” 是大写字母。还有其他 Unicode L 类别，因此可以调整它以满足您的要求（unicodedata.category(char).startswith("L") 也许？）；另请参阅 https://www.fileformat.info/info/unicode/category/索引.htm

The reason Python works this way is that you are actually looking at two distinct characters; there's an S and then it's followed by a combining macron U+0304

In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.

import unicodedata

def cleanup(line):
    cleaned = []
    strip = False
    for char in line:
        if unicodedata.combining(char):
            strip = True
            continue
        if strip:
            cleaned.pop()
            strip = False
        if unicodedata.category(char) not in ("Ll", "Lu"):
            char = "_"
        cleaned.append(char)
    return ''.join(cleaned)

By the by, \W does not need square brackets around it; it's already a regex character class.

Python's re module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex library has proper support for Unicode categories.

"Ll" is lowercase alphabetics and "Lu" are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L") maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm

回复收藏 0 原文

一瞬间的火花 2025-01-26 06:53:03

您可以使用以下脚本来获取所需的输出：

import re

line="ra*ndom wordS̄"
print(re.sub('[^[-~]+]*','_',line))

输出

ra_ndom_word_

在这种方法中，它也适用于其他非 ASCII 字符：

import re

line="ra*ndom ¡¢£Ä wordS̄.  another non-ascii: Ä and Ï"
print(re.sub('[^[-~]+]*','_',line))

输出：

ra_ndom_word_another_non_ascii_and_

You can use the following script to get the desired output:

import re

line="ra*ndom wordS̄"
print(re.sub('[^[-~]+]*','_',line))

Output

ra_ndom_word_

In this approach, it works with other non-ascii characters as well :

import re

line="ra*ndom ¡¢£Ä wordS̄.  another non-ascii: Ä and Ï"
print(re.sub('[^[-~]+]*','_',line))

output:

ra_ndom_word_another_non_ascii_and_

回复收藏 0 原文

~没有更多了~

关于作者

寻梦旅人

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

特殊字符的正则态度（顶部有线）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

特殊字符的正则态度（顶部有线）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。