将字符串拆分为列表，保留重音字符和表情符号，但删除标点符号

发布于 2024-10-08 21:49:06 字数 204 浏览 1 评论 0原文

如果我有字符串：

"O João foi almoçar :) ."

我如何最好地将它分成Python中的单词列表，如下所示：

['O','João', 'foi', 'almoçar', ':)']

？

谢谢:)

索菲亚

原文

If i have the string:

"O João foi almoçar :) ."

how do i best split it into a list of words in python like so:

['O','João', 'foi', 'almoçar', ':)']

Thanks :)

Sofia

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

末骤雨初歇 2024-10-15 21:49:06

如果标点符号像您的示例一样落入其自己的空格分隔标记中，那么很简单：

>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']

如果不是这种情况，您可以像这样定义笑脸字典（您需要添加更多）：

d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}

然后替换每个带有不包含标点符号的占位符的笑脸实例（我们认为 <> 不是标点符号）：

for smiley, placeholder in d.iteritems():
    s = s.replace(smiley, placeholder)

这让我们得到 "O João foi almoçar < ;HAPPY_SMILEY>。”。

然后我们去掉标点符号：

s = ''.join(filter(lambda c: c not in '.,!', list(s)))

得到“O João foi almoçar”。

我们确实恢复了表情符号：

for smiley, placeholder in d.iteritems():
    s = s.replace(placeholder, smiley)

然后将其分割：

s = s.split()

给我们最终结果：['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']。

将它们全部放在一个函数中：

def split_special(s):
    d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
    for smiley, placeholder in d.iteritems():
        s = s.replace(smiley, placeholder)
    s = ''.join(filter(lambda c: c not in '.,!', list(s)))
    for smiley, placeholder in d.iteritems():
        s = s.replace(placeholder, smiley)
    return s.split()

If the punctuation falls into its own space-separated token as with your example, then it's easy:

>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']

If this is not the case, you can define a dictionary of smileys like this (you'll need to add more):

d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}

and then replace each instance of the smiley with the place-holder that doesn't contain punctuation (we'll consider <> not to be punctuation):

for smiley, placeholder in d.iteritems():
    s = s.replace(smiley, placeholder)

Which gets us to "O João foi almoçar <HAPPY_SMILEY> .".

We then strip punctuation:

s = ''.join(filter(lambda c: c not in '.,!', list(s)))

Which gives us "O João foi almoçar <HAPPY_SMILEY>".

We do revert the smileys:

for smiley, placeholder in d.iteritems():
    s = s.replace(placeholder, smiley)

Which we then split:

s = s.split()

Giving us our final result: ['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)'].

Putting it all together into a function:

def split_special(s):
    d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
    for smiley, placeholder in d.iteritems():
        s = s.replace(smiley, placeholder)
    s = ''.join(filter(lambda c: c not in '.,!', list(s)))
    for smiley, placeholder in d.iteritems():
        s = s.replace(placeholder, smiley)
    return s.split()

回复收藏 0 原文

盛装女皇 2024-10-15 21:49:06

>>> import string
>>> [ i for i in s.split(' ') if i not in string.punctuation]
['O', 'João', 'foi', 'almoçar', ':)']

>>> import string
>>> [ i for i in s.split(' ') if i not in string.punctuation]
['O', 'João', 'foi', 'almoçar', ':)']

回复收藏 0 原文

~没有更多了~

关于作者

司马昭之心

暂无简介

0 文章

0 评论

21 人气

关注发私信

留蓝

文章 0 评论 0

关注

18790681156

文章 0 评论 0

关注

zach7772

文章 0 评论 0

关注

Wini

文章 0 评论 0

关注

ayeshaaroy

文章 0 评论 0

关注

初雪

文章 0 评论 0

友情链接

文江博客

将字符串拆分为列表，保留重音字符和表情符号，但删除标点符号

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

将字符串拆分为列表，保留重音字符和表情符号，但删除标点符号

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。