在Python中从字符串中删除除字母数字字符之外的所有内容
使用Python从字符串中去除所有非字母数字字符的最佳方法是什么?
这个问题的 PHP 变体中提出的解决方案可能会通过一些细微的调整来工作,但看起来不太“Pythonic”我。
作为记录,我不仅想删除句点和逗号(以及其他标点符号),还想删除引号、括号等。
What is the best way to strip all non alphanumeric characters from a string, using Python?
The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don't seem very 'pythonic' to me.
For the record, I don't just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
我只是出于好奇而对一些功能进行了计时。 在这些测试中,我从字符串 string.printable (内置
string
模块的一部分)中删除非字母数字字符。 使用已编译的'[\W_]+'
和pattern.sub('', str)
被发现是最快的。I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string
string.printable
(part of the built-instring
module). The use of compiled'[\W_]+'
andpattern.sub('', str)
was found to be fastest.正则表达式来救援:
Regular expressions to the rescue:
使用
str.translate()
方法。假设您经常这样做:
创建一个包含您要删除的所有字符的字符串:
每当您想要压缩字符串时:
设置成本可能与
re.compile
相比毫不逊色; 边际成本要低得多:注意:使用
string.printable
作为基准数据会给模式'[\W_]+'
带来不公平的优势 ; 所有非字母数字字符都集中在一堆...在典型数据中,需要进行不止一次替换:如果您给
re.sub
多做一点工作,会发生以下情况:Use the
str.translate()
method.Presuming you will be doing this often:
Once, create a string containing all the characters you wish to delete:
Whenever you want to scrunch a string:
The setup cost probably compares favourably with
re.compile
; the marginal cost is way lower:Note: Using
string.printable
as benchmark data gives the pattern'[\W_]+'
an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:Here's what happens if you give
re.sub
a bit more work to do:你可以尝试:
You could try:
怎么样:
如果
ascii_letters
和digits
组合字符串中存在,则使用列表理解来生成InputString
中的字符列表。 。 然后它将列表连接成一个字符串。How about:
This works by using list comprehension to produce a list of the characters in
InputString
if they are present in the combinedascii_letters
anddigits
strings. It then joins the list together into a string.我用 perfplot (我的一个项目)检查了结果,发现这
是最快的。 对于短弦,
也是可以接受的。
重现该情节的代码:
I checked the results with perfplot (a project of mine) and found that
is fastest. For short strings,
is also acceptable.
Code to reproduce the plot:
作为此处其他一些答案的衍生,我提供了一种非常简单且灵活的方法来定义您想要将字符串内容限制为的一组字符。 在这种情况下,我允许使用字母数字加破折号和下划线。 只需根据您的用例从我的
PERMITTED_CHARS
添加或删除字符即可。As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string's content to. In this case, I'm allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my
PERMITTED_CHARS
as suits your use case.使用 ASCII 可打印的随机字符串进行计时:
结果 (Python 3.7):
str.maketrans
&str.translate
速度最快,但包含所有非 ASCII 字符。重新编译
&pattern.sub
速度较慢,但在某种程度上比''.join
& 更快。 过滤器。Timing with random strings of ASCII printables:
Result (Python 3.7):
str.maketrans
&str.translate
is fastest, but includes all non-ASCII characters.re.compile
&pattern.sub
is slower, but is somehow faster than''.join
&filter
.对于简单的单行代码(Python 3.0):
对于 Python < 3.0:
注意:如果需要,您可以将其他字符添加到允许的字符列表中(例如“0123456789abcdefghijklmnopqrstuvwxyz.,_”)。
For a simple one-liner (Python 3.0):
For Python < 3.0:
Note: you could add other characters to the allowed characters list if desired (e.g. '0123456789abcdefghijklmnopqrstuvwxyz.,_').
Python 3
使用与 @John Machin 的答案相同的方法,但针对 Python 3 进行了更新:
>翻译
有效。Python 代码现在假定以 UTF-8 编码
(来源:PEP 3120)
这意味着包含您的所有字符的字符串希望删除变得更大:
并且
translate
方法现在需要使用我们可以使用maketrans()
创建的转换表:现在,和以前一样,任何字符串
s
你想要“压缩”:使用 @Joe Machin 的最后一个计时示例,我们可以看到它仍然比
re
好一个数量级:Python 3
Uses the same method as @John Machin's answer but updated for Python 3:
translate
works.Python code is now assumed to be encoded in UTF-8
(source: PEP 3120)
This means the string containing all the characters you wish to delete gets much larger:
And the
translate
method now needs to consume a translation table which we can create withmaketrans()
:Now, as before, any string
s
you want to "scrunch":Using the last timing example from @Joe Machin, we can see it still beats
re
by an order of magnitude:一个简单的解决方案,因为这里的所有答案都很复杂
A simple solution because all answers here are complicated
例如,如果您想保留像 áéíóúãẽĩõũ 这样的字符,请使用以下命令:
If you'd like to preserve characters like áéíóúãẽĩõũ for example, use this:
如果我理解正确,最简单的方法是使用正则表达式,因为它为您提供了很大的灵活性,但另一种简单的方法是使用 for 循环,下面是带有示例的代码,我还计算了单词的出现次数并存储在字典中。
请评分如果这个答案有用的话!
If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary..
please rate this if this answer is useful!