从Python中的字符串中去除不可打印的字符

发布于 2024-07-05 17:53:40 字数 324 浏览 9 评论 0原文

我曾经在 Perl 上运行

$s =~ s/[^[:print:]]//g;

来摆脱不可打印的字符。

在 Python 中,没有 POSIX 正则表达式类,而且我无法编写 [:print:] 让它表达我想要的意思。 我知道 Python 中没有办法检测字符是否可打印。

你会怎么办?

编辑:它也必须支持 Unicode 字符。 string.printable 方式很乐意将它们从输出中删除。 对于任何 unicode 字符,curses.ascii.isprint 都会返回 false。

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output.
curses.ascii.isprint will return false for any unicode character.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

十二 2024-07-12 17:53:40

在 Python 3 中,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

请参阅这篇有关删除标点符号的 StackOverflow 帖子< /a> 了解 .translate() 与正则表达式 & 的比较 .replace()

如果 unicodedata.category(c)=='Cc' 则可以通过 nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) 生成范围) 使用 Unicode 字符数据库类别,如 @Ants Aasma 所示。

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

生生漫 2024-07-12 17:53:40

该函数使用列表推导式和 str.join,因此它以线性时间运行,而不是 O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))
自控 2024-07-12 17:53:40

python 3 中的另一个选项:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)
像你 2024-07-12 17:53:40

根据@Ber的回答,我建议仅删除 Unicode 字符数据库类别:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))
故事与诗 2024-07-12 17:53:40

从 python 中的字符串中剥离“不可打印”字符的一个优雅的 pythonic 解决方案是根据用例将 isprintable() 字符串方法与生成器表达式或列表理解一起使用。 字符串的大小:

    ''.join(c for c in my_string if c.isprintable())

str.isprintable()
如果字符串中的所有字符均可打印或字符串为空,则返回 True,否则返回 False。 不可打印字符是在 Unicode 字符数据库中定义为“其他”或“分隔符”的字符,但被视为可打印的 ASCII 空格 (0x20) 除外。 (请注意,此上下文中的可打印字符是在字符串上调用 repr() 时不应转义的字符。它与写入 sys.stdout 或 sys.stderr 的字符串的处理无关。)

An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:

    ''.join(c for c in my_string if c.isprintable())

str.isprintable()
Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

撩人痒 2024-07-12 17:53:40

我现在想出的最好的方法是(感谢上面的 python-izers)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

这是我发现的处理 Unicode 字符/字符串的唯一方法

还有更好的选择吗?

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

迷迭香的记忆 2024-07-12 17:53:40

在 Python 中没有 POSIX 正则表达式类

使用 regex 库时有: https:// /pypi.org/project/regex/

它维护良好,支持 Unicode 正则表达式、Posix 正则表达式等等。 用法(方法签名)非常类似于Python的re

从文档中:

<代码>[[:alpha:]]; [[:^alpha:]]

支持 POSIX 字符类。 这些
通常被视为 \p{...} 的替代形式。

(我不隶属,只是一个用户。)

In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These
are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)

萌逼全场 2024-07-12 17:53:40

下面的执行速度比上面的其他执行得更快。 看一看

''.join([x if x in string.printable else '' for x in Str])

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str])
梦毁影碎の 2024-07-12 17:53:40

改编自 Ants Aasmashawnrad

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)

在 Python 3.7.7 上测试

Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)

tested on Python 3.7.7

叹沉浮 2024-07-12 17:53:40

要删除“空白”,

import re
t = """
\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

To remove 'whitespace',

import re
t = """
\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))
终陌 2024-07-12 17:53:40
  1. 错误描述
    运行复制粘贴的python代码报:

Python invalid non-printable character U+00A0

  1. 错误原因
    复制的代码中的空格与Python中的格式不一样;

  2. 解决方案
    删除空格并重新输入空格。 比如上图中红色部分就是异常空间。 删除并重新输入空格即可运行;

来源: Python 无效的不可打印字符 U+00A0

  1. Error description
    Run the copied and pasted python code report:

Python invalid non-printable character U+00A0

  1. The cause of the error
    The space in the copied code is not the same as the format in Python;

  2. Solution
    Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;

Source : Python invalid non-printable character U+00A0

指尖上的星空 2024-07-12 17:53:40

我用过这个:

import sys
import unicodedata

# the test string has embedded characters, \u2069 \u2068
test_string = """"ABC⁩.⁨ 6", "}"""
nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if
                        unicodedata.category(c) in ['Cc','Cf']))

translate_dict = {character: None for character in nonprintable}
print("Before translate, using repr()", repr(test_string))
print("After translate, using repr()", repr(test_string.translate(translate_dict)))

I used this:

import sys
import unicodedata

# the test string has embedded characters, \u2069 \u2068
test_string = """"ABC⁩.⁨ 6", "}"""
nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if
                        unicodedata.category(c) in ['Cc','Cf']))

translate_dict = {character: None for character in nonprintable}
print("Before translate, using repr()", repr(test_string))
print("After translate, using repr()", repr(test_string.translate(translate_dict)))
妄司 2024-07-12 17:53:40

下面的代码适用于 Unicode 输入,并且速度相当快...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

我自己的测试表明,这种方法比使用 str.join 迭代字符串并返回结果的函数更快。

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

浮生未歇 2024-07-12 17:53:40

不幸的是,在 Python 中迭代字符串相当慢。 对于这种事情,正则表达式的速度要快一个数量级。 您只需要自己构建角色类即可。 unicodedata 模块对此非常有帮助,尤其是 unicodedata.category() 函数。 有关类别的说明,请参阅 Unicode 字符数据库

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于 Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于某些用例,附加类别(例如,所有来自控制组的类别可能更可取,尽管这可能会减慢处理时间并显着增加内存使用量。每个类别的字符数:

  • Cc(控制):65
  • Cf(格式):161
  • Cs(代理):2048
  • Co(私人使用) :137468
  • Cn(未分配):836601

编辑添加评论中的建议。

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

醉殇 2024-07-12 17:53:40

据我所知,最Pythonic/最有效的方法是:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)
莫言歌 2024-07-12 17:53:40

您可以尝试使用 unicodedata.category() 函数设置过滤器:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

请参阅 可用类别的 Unicode 数据库字符属性

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文