将字符串拆分为列表,保留重音字符和表情符号,但删除标点符号

发布于 2024-10-08 21:49:06 字数 204 浏览 1 评论 0原文

如果我有字符串:

"O João foi almoçar :) ." 

我如何最好地将它分成Python中的单词列表,如下所示:

['O','João', 'foi', 'almoçar', ':)']

谢谢:)

索菲亚

If i have the string:

"O João foi almoçar :) ." 

how do i best split it into a list of words in python like so:

['O','João', 'foi', 'almoçar', ':)']

?

Thanks :)

Sofia

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

末骤雨初歇 2024-10-15 21:49:06

如果标点符号像您的示例一样落入其自己的空格分隔标记中,那么很简单:

>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']

如果不是这种情况,您可以像这样定义笑脸字典(您需要添加更多):

d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}

然后替换每个带有不包含标点符号的占位符的笑脸实例(我们认为 <> 不是标点符号):

for smiley, placeholder in d.iteritems():
    s = s.replace(smiley, placeholder)

这让我们得到 "O João foi almoçar < ;HAPPY_SMILEY>。”

然后我们去掉标点符号:

s = ''.join(filter(lambda c: c not in '.,!', list(s)))

得到“O João foi almoçar

我们确实恢复了表情符号:

for smiley, placeholder in d.iteritems():
    s = s.replace(placeholder, smiley)

然后将其分割:

s = s.split()

给我们最终结果:['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']

将它们全部放在一个函数中:

def split_special(s):
    d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
    for smiley, placeholder in d.iteritems():
        s = s.replace(smiley, placeholder)
    s = ''.join(filter(lambda c: c not in '.,!', list(s)))
    for smiley, placeholder in d.iteritems():
        s = s.replace(placeholder, smiley)
    return s.split()

If the punctuation falls into its own space-separated token as with your example, then it's easy:

>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']

If this is not the case, you can define a dictionary of smileys like this (you'll need to add more):

d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}

and then replace each instance of the smiley with the place-holder that doesn't contain punctuation (we'll consider <> not to be punctuation):

for smiley, placeholder in d.iteritems():
    s = s.replace(smiley, placeholder)

Which gets us to "O João foi almoçar <HAPPY_SMILEY> .".

We then strip punctuation:

s = ''.join(filter(lambda c: c not in '.,!', list(s)))

Which gives us "O João foi almoçar <HAPPY_SMILEY>".

We do revert the smileys:

for smiley, placeholder in d.iteritems():
    s = s.replace(placeholder, smiley)

Which we then split:

s = s.split()

Giving us our final result: ['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)'].

Putting it all together into a function:

def split_special(s):
    d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
    for smiley, placeholder in d.iteritems():
        s = s.replace(smiley, placeholder)
    s = ''.join(filter(lambda c: c not in '.,!', list(s)))
    for smiley, placeholder in d.iteritems():
        s = s.replace(placeholder, smiley)
    return s.split()
盛装女皇 2024-10-15 21:49:06
>>> import string
>>> [ i for i in s.split(' ') if i not in string.punctuation]
['O', 'João', 'foi', 'almoçar', ':)']
>>> import string
>>> [ i for i in s.split(' ') if i not in string.punctuation]
['O', 'João', 'foi', 'almoçar', ':)']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文