在Python中从字符串中删除除字母数字字符之外的所有内容

发布于 2024-08-02 02:06:58 字数 211 浏览 5 评论 0原文

使用Python从字符串中去除所有非字母数字字符的最佳方法是什么?

这个问题的 PHP 变体中提出的解决方案可能会通过一些细微的调整来工作,但看起来不太“Pythonic”我。

作为记录,我不仅想删除句点和逗号(以及其他标点符号),还想删除引号、括号等。

What is the best way to strip all non alphanumeric characters from a string, using Python?

The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don't seem very 'pythonic' to me.

For the record, I don't just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

荒路情人 2024-08-09 02:06:58

我只是出于好奇而对一些功能进行了计时。 在这些测试中,我从字符串 string.printable (内置 string 模块的一部分)中删除非字母数字字符。 使用已编译的 '[\W_]+'pattern.sub('', str) 被发现是最快的。

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop

I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable (part of the built-in string module). The use of compiled '[\W_]+' and pattern.sub('', str) was found to be fastest.

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop
夏日落 2024-08-09 02:06:58

正则表达式来救援:

import re
re.sub(r'\W+', '', your_string)

根据Python定义'\W == [^a-zA-Z0-9_],排除所有数字字母_

Regular expressions to the rescue:

import re
re.sub(r'\W+', '', your_string)

By Python definition '\W == [^a-zA-Z0-9_], which excludes all numbers, letters and _

画骨成沙 2024-08-09 02:06:58

使用 str.translate() 方法。

假设您经常这样做:

  1. 创建一个包含您要删除的所有字符的字符串:

    delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum()) 
      
  2. 每当您想要压缩字符串时:

    scrunched = s.translate(None, delchars) 
      

设置成本可能与 re.compile 相比毫不逊色; 边际成本要低得多:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

注意:使用string.printable作为基准数据会给模式'[\W_]+'带来不公平的优势 ; 所有非字母数字字符都集中在一堆...在典型数据中,需要进行不止一次替换:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

如果您给 re.sub 多做一点工作,会发生以下情况:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop

Use the str.translate() method.

Presuming you will be doing this often:

  1. Once, create a string containing all the characters you wish to delete:

    delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())
    
  2. Whenever you want to scrunch a string:

    scrunched = s.translate(None, delchars)
    

The setup cost probably compares favourably with re.compile; the marginal cost is way lower:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

Note: Using string.printable as benchmark data gives the pattern '[\W_]+' an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here's what happens if you give re.sub a bit more work to do:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop
九厘米的零° 2024-08-09 02:06:58

你可以尝试:

print ''.join(ch for ch in some_string if ch.isalnum())

You could try:

print ''.join(ch for ch in some_string if ch.isalnum())
很酷不放纵 2024-08-09 02:06:58
>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13
>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13
口干舌燥 2024-08-09 02:06:58

怎么样:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

如果 ascii_lettersdigits 组合字符串中存在,则使用列表理解来生成 InputString 中的字符列表。 。 然后它将列表连接成一个字符串。

How about:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii_letters and digits strings. It then joins the list together into a string.

酒浓于脸红 2024-08-09 02:06:58

我用 perfplot (我的一个项目)检查了结果,发现这

pattern = re.compile("[\W_]")
pattern.sub("", s)

是最快的。 对于短弦,

"".join(filter(str.isalnum, s))

也是可以接受的。

输入图片此处描述

重现该情节的代码:

import perfplot
import random
import re
import string

pattern = re.compile("[\W_]")
pattern_plus = re.compile("[\W_]+")


def setup(n):
    return "".join(random.choices(string.ascii_letters + string.digits, k=n))


def string_alphanum(s):
    return "".join(ch for ch in s if ch.isalnum())


def filter_str(s):
    return "".join(filter(str.isalnum, s))


def re_sub(s):
    return re.sub("[\W_]", "", s)


def re_sub_pattern(s):
    return pattern.sub("", s)


def re_sub_plus(s):
    return re.sub("[\W_]+", "", s)


def re_sub_pattern_plus(s):
    return pattern_plus.sub("", s)


b = perfplot.bench(
    setup=setup,
    kernels=[
        string_alphanum,
        filter_str,
        re_sub,
        re_sub_pattern,
        re_sub_plus,
        re_sub_pattern_plus,
    ],
    n_range=[2**k for k in range(15)],
)
b.save("out.png")
b.show()

I checked the results with perfplot (a project of mine) and found that

pattern = re.compile("[\W_]")
pattern.sub("", s)

is fastest. For short strings,

"".join(filter(str.isalnum, s))

is also acceptable.

enter image description here

Code to reproduce the plot:

import perfplot
import random
import re
import string

pattern = re.compile("[\W_]")
pattern_plus = re.compile("[\W_]+")


def setup(n):
    return "".join(random.choices(string.ascii_letters + string.digits, k=n))


def string_alphanum(s):
    return "".join(ch for ch in s if ch.isalnum())


def filter_str(s):
    return "".join(filter(str.isalnum, s))


def re_sub(s):
    return re.sub("[\W_]", "", s)


def re_sub_pattern(s):
    return pattern.sub("", s)


def re_sub_plus(s):
    return re.sub("[\W_]+", "", s)


def re_sub_pattern_plus(s):
    return pattern_plus.sub("", s)


b = perfplot.bench(
    setup=setup,
    kernels=[
        string_alphanum,
        filter_str,
        re_sub,
        re_sub_pattern,
        re_sub_plus,
        re_sub_pattern_plus,
    ],
    n_range=[2**k for k in range(15)],
)
b.save("out.png")
b.show()
云淡月浅 2024-08-09 02:06:58
sent = "".join(e for e in sent if e.isalpha())
sent = "".join(e for e in sent if e.isalpha())
只有一腔孤勇 2024-08-09 02:06:58

作为此处其他一些答案的衍生,我提供了一种非常简单且灵活的方法来定义您想要将字符串内容限制为的一组字符。 在这种情况下,我允许使用字母数字加破折号和下划线。 只需根据您的用例从我的 PERMITTED_CHARS 添加或删除字符即可。

PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-" 
someString = "".join(c for c in someString if c in PERMITTED_CHARS)

As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string's content to. In this case, I'm allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS as suits your use case.

PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-" 
someString = "".join(c for c in someString if c in PERMITTED_CHARS)
神也荒唐 2024-08-09 02:06:58

使用 ASCII 可打印的随机字符串进行计时:

from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit

pattern_single = re.compile(r'[\W]')
pattern_repeat = re.compile(r'[\W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))


def generate_test_string(length):
    return ''.join(sample(printable, length))


def main():
    for i in range(0, 60, 10):
        for test in [
            lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
            lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
            lambda: re.sub(r'[\W]', '', generate_test_string(i)),
            lambda: re.sub(r'[\W]+', '', generate_test_string(i)),
            lambda: pattern_single.sub('', generate_test_string(i)),
            lambda: pattern_repeat.sub('', generate_test_string(i)),
            lambda: generate_test_string(i).translate(translation_tb),

        ]:
            print(timeit(test), i, getsource(test).lstrip('            lambda: ').rstrip(',\n'), sep='\t')


if __name__ == '__main__':
    main()

结果 (Python 3.7):

       Time       Length                           Code                           
6.3716264850008880  00  ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750  00  ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940  00  re.sub(r'[\W]', '', generate_test_string(i))
8.0002205439959650  00  re.sub(r'[\W]+', '', generate_test_string(i))
5.5290945199958510  00  pattern_single.sub('', generate_test_string(i))
5.4417179649972240  00  pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590  00  generate_test_string(i).translate(translation_tb)
23.574712151996210  10  ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970  10  ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840  10  re.sub(r'[\W]', '', generate_test_string(i))
27.203713296003116  10  re.sub(r'[\W]+', '', generate_test_string(i))
24.008979928999906  10  pattern_single.sub('', generate_test_string(i))
23.945240008994006  10  pattern_repeat.sub('', generate_test_string(i))
21.830899796994345  10  generate_test_string(i).translate(translation_tb)
38.731336012999236  20  ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825  20  ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350  20  re.sub(r'[\W]', '', generate_test_string(i))
41.933375883003464  20  re.sub(r'[\W]+', '', generate_test_string(i))
38.899814646996674  20  pattern_single.sub('', generate_test_string(i))
38.636144253003295  20  pattern_repeat.sub('', generate_test_string(i))
36.201238164998360  20  generate_test_string(i).translate(translation_tb)
49.377356811004574  30  ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385  30  ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850  30  re.sub(r'[\W]', '', generate_test_string(i))
52.130339455994545  30  re.sub(r'[\W]+', '', generate_test_string(i))
50.061149017004940  30  pattern_single.sub('', generate_test_string(i))
49.366573111998150  30  pattern_repeat.sub('', generate_test_string(i))
46.649754120997386  30  generate_test_string(i).translate(translation_tb)
63.107938601999194  40  ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030  40  ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800  40  re.sub(r'[\W]', '', generate_test_string(i))
66.027950693998720  40  re.sub(r'[\W]+', '', generate_test_string(i))
63.315361931003280  40  pattern_single.sub('', generate_test_string(i))
62.342320287003530  40  pattern_repeat.sub('', generate_test_string(i))
58.249303059004890  40  generate_test_string(i).translate(translation_tb)
73.810345625002810  50  ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020  50  ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540  50  re.sub(r'[\W]', '', generate_test_string(i))
75.106637657001560  50  re.sub(r'[\W]+', '', generate_test_string(i))
74.681338128997600  50  pattern_single.sub('', generate_test_string(i))
72.430461594005460  50  pattern_repeat.sub('', generate_test_string(i))
69.394243567003290  50  generate_test_string(i).translate(translation_tb)

str.maketrans & str.translate 速度最快,但包含所有非 ASCII 字符。
重新编译 & pattern.sub 速度较慢,但​​在某种程度上比 ''.join & 更快。 过滤器。

Timing with random strings of ASCII printables:

from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit

pattern_single = re.compile(r'[\W]')
pattern_repeat = re.compile(r'[\W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))


def generate_test_string(length):
    return ''.join(sample(printable, length))


def main():
    for i in range(0, 60, 10):
        for test in [
            lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
            lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
            lambda: re.sub(r'[\W]', '', generate_test_string(i)),
            lambda: re.sub(r'[\W]+', '', generate_test_string(i)),
            lambda: pattern_single.sub('', generate_test_string(i)),
            lambda: pattern_repeat.sub('', generate_test_string(i)),
            lambda: generate_test_string(i).translate(translation_tb),

        ]:
            print(timeit(test), i, getsource(test).lstrip('            lambda: ').rstrip(',\n'), sep='\t')


if __name__ == '__main__':
    main()

Result (Python 3.7):

       Time       Length                           Code                           
6.3716264850008880  00  ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750  00  ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940  00  re.sub(r'[\W]', '', generate_test_string(i))
8.0002205439959650  00  re.sub(r'[\W]+', '', generate_test_string(i))
5.5290945199958510  00  pattern_single.sub('', generate_test_string(i))
5.4417179649972240  00  pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590  00  generate_test_string(i).translate(translation_tb)
23.574712151996210  10  ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970  10  ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840  10  re.sub(r'[\W]', '', generate_test_string(i))
27.203713296003116  10  re.sub(r'[\W]+', '', generate_test_string(i))
24.008979928999906  10  pattern_single.sub('', generate_test_string(i))
23.945240008994006  10  pattern_repeat.sub('', generate_test_string(i))
21.830899796994345  10  generate_test_string(i).translate(translation_tb)
38.731336012999236  20  ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825  20  ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350  20  re.sub(r'[\W]', '', generate_test_string(i))
41.933375883003464  20  re.sub(r'[\W]+', '', generate_test_string(i))
38.899814646996674  20  pattern_single.sub('', generate_test_string(i))
38.636144253003295  20  pattern_repeat.sub('', generate_test_string(i))
36.201238164998360  20  generate_test_string(i).translate(translation_tb)
49.377356811004574  30  ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385  30  ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850  30  re.sub(r'[\W]', '', generate_test_string(i))
52.130339455994545  30  re.sub(r'[\W]+', '', generate_test_string(i))
50.061149017004940  30  pattern_single.sub('', generate_test_string(i))
49.366573111998150  30  pattern_repeat.sub('', generate_test_string(i))
46.649754120997386  30  generate_test_string(i).translate(translation_tb)
63.107938601999194  40  ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030  40  ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800  40  re.sub(r'[\W]', '', generate_test_string(i))
66.027950693998720  40  re.sub(r'[\W]+', '', generate_test_string(i))
63.315361931003280  40  pattern_single.sub('', generate_test_string(i))
62.342320287003530  40  pattern_repeat.sub('', generate_test_string(i))
58.249303059004890  40  generate_test_string(i).translate(translation_tb)
73.810345625002810  50  ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020  50  ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540  50  re.sub(r'[\W]', '', generate_test_string(i))
75.106637657001560  50  re.sub(r'[\W]+', '', generate_test_string(i))
74.681338128997600  50  pattern_single.sub('', generate_test_string(i))
72.430461594005460  50  pattern_repeat.sub('', generate_test_string(i))
69.394243567003290  50  generate_test_string(i).translate(translation_tb)

str.maketrans & str.translate is fastest, but includes all non-ASCII characters.
re.compile & pattern.sub is slower, but is somehow faster than ''.join & filter.

梦太阳 2024-08-09 02:06:58

对于简单的单行代码(Python 3.0):

''.join(filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped ))

对于 Python < 3.0:

filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped )

注意:如果需要,您可以将其他字符添加到允许的字符列表中(例如“0123456789abcdefghijklmnopqrstuvwxyz.,_”)。

For a simple one-liner (Python 3.0):

''.join(filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped ))

For Python < 3.0:

filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped )

Note: you could add other characters to the allowed characters list if desired (e.g. '0123456789abcdefghijklmnopqrstuvwxyz.,_').

谜泪 2024-08-09 02:06:58

Python 3

使用与 @John Machin 的答案相同的方法,但针对 Python 3 进行了更新:

  • 更大的字符集
  • >翻译有效。

Python 代码现在假定以 UTF-8 编码
(来源:PEP 3120

这意味着包含您的所有字符的字符串希望删除变得更大:

    
del_chars = ''.join(c for c in map(chr, range(1114111)) if not c.isalnum())
    

并且 translate 方法现在需要使用我们可以使用 maketrans() 创建的转换表:

    
del_map = str.maketrans('', '', del_chars)
    

现在,和以前一样,任何字符串 s 你想要“压缩”:

    
scrunched = s.translate(del_map)
    

使用 @Joe Machin 的最后一个计时示例,我们可以看到它仍然比 re 好一个数量级:

    
> python -mtimeit -s"d=''.join(c for c in map(chr,range(1114111)) if not c.isalnum());m=str.maketrans('','',d);s='foo-'*25" "s.translate(m)"
    
1000000 loops, best of 5: 255 nsec per loop
    
> python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
    
50000 loops, best of 5: 4.8 usec per loop
    

Python 3

Uses the same method as @John Machin's answer but updated for Python 3:

  • larger character set
  • slight changes to how translate works.

Python code is now assumed to be encoded in UTF-8
(source: PEP 3120)

This means the string containing all the characters you wish to delete gets much larger:

    
del_chars = ''.join(c for c in map(chr, range(1114111)) if not c.isalnum())
    

And the translate method now needs to consume a translation table which we can create with maketrans():

    
del_map = str.maketrans('', '', del_chars)
    

Now, as before, any string s you want to "scrunch":

    
scrunched = s.translate(del_map)
    

Using the last timing example from @Joe Machin, we can see it still beats re by an order of magnitude:

    
> python -mtimeit -s"d=''.join(c for c in map(chr,range(1114111)) if not c.isalnum());m=str.maketrans('','',d);s='foo-'*25" "s.translate(m)"
    
1000000 loops, best of 5: 255 nsec per loop
    
> python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
    
50000 loops, best of 5: 4.8 usec per loop
    
梦在深巷 2024-08-09 02:06:58
for char in my_string:
    if not char.isalnum():
        my_string = my_string.replace(char,"")
for char in my_string:
    if not char.isalnum():
        my_string = my_string.replace(char,"")
揽清风入怀 2024-08-09 02:06:58

一个简单的解决方案,因为这里的所有答案都很复杂

filtered = ''
for c in unfiltered:
    if str.isalnum(c):
        filtered += c
    
print(filtered)

A simple solution because all answers here are complicated

filtered = ''
for c in unfiltered:
    if str.isalnum(c):
        filtered += c
    
print(filtered)
好倦 2024-08-09 02:06:58

例如,如果您想保留像 áéíóúãẽĩõũ 这样的字符,请使用以下命令:

import re
re.sub('[\W\d_]+', '', your_string)

If you'd like to preserve characters like áéíóúãẽĩõũ for example, use this:

import re
re.sub('[\W\d_]+', '', your_string)
无远思近则忧 2024-08-09 02:06:58

如果我理解正确,最简单的方法是使用正则表达式,因为它为您提供了很大的灵活性,但另一种简单的方法是使用 for 循环,下面是带有示例的代码,我还计算了单词的出现次数并存储在字典中。

s = """An... essay is, generally, a piece of writing that gives the author's own 
argument — but the definition is vague, 
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays 
have traditionally been 
sub-classified as formal and informal. Formal essays are characterized by "serious 
purpose, dignity, logical 
organization, length," whereas the informal essay is characterized by "the personal 
element (self-revelation, 
individual tastes and experiences, confidential manner), humor, graceful style, 
rambling structure, unconventionality 
or novelty of theme," etc.[1]"""

d = {}      # creating empty dic      
words = s.split() # spliting string and stroing in list
for word in words:
    new_word = ''
    for c in word:
        if c.isalnum(): # checking if indiviual chr is alphanumeric or not
            new_word = new_word + c
    print(new_word, end=' ')
    # if new_word not in d:
    #     d[new_word] = 1
    # else:
    #     d[new_word] = d[new_word] +1
print(d)

请评分如果这个答案有用的话!

If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary..

s = """An... essay is, generally, a piece of writing that gives the author's own 
argument — but the definition is vague, 
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays 
have traditionally been 
sub-classified as formal and informal. Formal essays are characterized by "serious 
purpose, dignity, logical 
organization, length," whereas the informal essay is characterized by "the personal 
element (self-revelation, 
individual tastes and experiences, confidential manner), humor, graceful style, 
rambling structure, unconventionality 
or novelty of theme," etc.[1]"""

d = {}      # creating empty dic      
words = s.split() # spliting string and stroing in list
for word in words:
    new_word = ''
    for c in word:
        if c.isalnum(): # checking if indiviual chr is alphanumeric or not
            new_word = new_word + c
    print(new_word, end=' ')
    # if new_word not in d:
    #     d[new_word] = 1
    # else:
    #     d[new_word] = d[new_word] +1
print(d)

please rate this if this answer is useful!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文