从Python中的字符串中去除不可打印的字符
我曾经在 Perl 上运行
$s =~ s/[^[:print:]]//g;
来摆脱不可打印的字符。
在 Python 中,没有 POSIX 正则表达式类,而且我无法编写 [:print:] 让它表达我想要的意思。 我知道 Python 中没有办法检测字符是否可打印。
你会怎么办?
编辑:它也必须支持 Unicode 字符。 string.printable 方式很乐意将它们从输出中删除。 对于任何 unicode 字符,curses.ascii.isprint 都会返回 false。
I use to run
$s =~ s/[^[:print:]]//g;
on Perl to get rid of non printable characters.
In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.
What would you do?
EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output.
curses.ascii.isprint will return false for any unicode character.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
在 Python 3 中,
请参阅这篇有关删除标点符号的 StackOverflow 帖子< /a> 了解 .translate() 与正则表达式 & 的比较 .replace()
如果 unicodedata.category(c)=='Cc' 则可以通过 nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) 生成范围) 使用 Unicode 字符数据库类别,如 @Ants Aasma 所示。
In Python 3,
See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()
The ranges can be generated via
nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc')
using the Unicode character database categories as shown by @Ants Aasma.该函数使用列表推导式和 str.join,因此它以线性时间运行,而不是 O(n^2):
This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):
python 3 中的另一个选项:
Yet another option in python 3:
根据@Ber的回答,我建议仅删除 Unicode 字符数据库类别:
Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:
从 python 中的字符串中剥离“不可打印”字符的一个优雅的 pythonic 解决方案是根据用例将 isprintable() 字符串方法与生成器表达式或列表理解一起使用。 字符串的大小:
str.isprintable()
如果字符串中的所有字符均可打印或字符串为空,则返回 True,否则返回 False。 不可打印字符是在 Unicode 字符数据库中定义为“其他”或“分隔符”的字符,但被视为可打印的 ASCII 空格 (0x20) 除外。 (请注意,此上下文中的可打印字符是在字符串上调用 repr() 时不应转义的字符。它与写入 sys.stdout 或 sys.stderr 的字符串的处理无关。)
An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:
str.isprintable()
Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)
我现在想出的最好的方法是(感谢上面的 python-izers)
这是我发现的处理 Unicode 字符/字符串的唯一方法
还有更好的选择吗?
The best I've come up with now is (thanks to the python-izers above)
This is the only way I've found out that works with Unicode characters/strings
Any better options?
使用 regex 库时有: https:// /pypi.org/project/regex/
它维护良好,支持 Unicode 正则表达式、Posix 正则表达式等等。 用法(方法签名)非常类似于Python的
re
。从文档中:
(我不隶属,只是一个用户。)
There are when using the
regex
library: https://pypi.org/project/regex/It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's
re
.From the documentation:
(I'm not affiliated, just a user.)
下面的执行速度比上面的其他执行得更快。 看一看
The one below performs faster than the others above. Take a look
改编自 Ants Aasma 和 shawnrad:
在 Python 3.7.7 上测试
Adapted from answers by Ants Aasma and shawnrad:
tested on Python 3.7.7
要删除“空白”,
To remove 'whitespace',
运行复制粘贴的python代码报:
Python invalid non-printable character U+00A0
错误原因
复制的代码中的空格与Python中的格式不一样;
解决方案
删除空格并重新输入空格。 比如上图中红色部分就是异常空间。 删除并重新输入空格即可运行;
来源: Python 无效的不可打印字符 U+00A0
Run the copied and pasted python code report:
Python invalid non-printable character U+00A0
The cause of the error
The space in the copied code is not the same as the format in Python;
Solution
Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;
Source : Python invalid non-printable character U+00A0
我用过这个:
I used this:
下面的代码适用于 Unicode 输入,并且速度相当快...
我自己的测试表明,这种方法比使用
str.join
迭代字符串并返回结果的函数更快。The following will work with Unicode input and is rather fast...
My own testing suggests this approach is faster than functions that iterate over the string and return a result using
str.join
.不幸的是,在 Python 中迭代字符串相当慢。 对于这种事情,正则表达式的速度要快一个数量级。 您只需要自己构建角色类即可。 unicodedata 模块对此非常有帮助,尤其是 unicodedata.category() 函数。 有关类别的说明,请参阅 Unicode 字符数据库。
对于 Python2
对于某些用例,附加类别(例如,所有来自控制组的类别可能更可取,尽管这可能会减慢处理时间并显着增加内存使用量。每个类别的字符数:
Cc
(控制):65Cf
(格式):161Cs
(代理):2048Co
(私人使用) :137468Cn
(未分配):836601编辑添加评论中的建议。
Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
For Python2
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc
(control): 65Cf
(format): 161Cs
(surrogate): 2048Co
(private-use): 137468Cn
(unassigned): 836601Edit Adding suggestions from the comments.
据我所知,最Pythonic/最有效的方法是:
As far as I know, the most pythonic/efficient method would be:
您可以尝试使用
unicodedata.category()
函数设置过滤器:请参阅 可用类别的 Unicode 数据库字符属性
You could try setting up a filter using the
unicodedata.category()
function:See Table 4-9 on page 175 in the Unicode database character properties for the available categories