将字符串拆分为列表,保留重音字符和表情符号,但删除标点符号
如果我有字符串:
"O João foi almoçar :) ."
我如何最好地将它分成Python中的单词列表,如下所示:
['O','João', 'foi', 'almoçar', ':)']
?
谢谢:)
索菲亚
If i have the string:
"O João foi almoçar :) ."
how do i best split it into a list of words in python like so:
['O','João', 'foi', 'almoçar', ':)']
?
Thanks :)
Sofia
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果标点符号像您的示例一样落入其自己的空格分隔标记中,那么很简单:
如果不是这种情况,您可以像这样定义笑脸字典(您需要添加更多):
然后替换每个带有不包含标点符号的占位符的笑脸实例(我们认为
<>
不是标点符号):这让我们得到
"O João foi almoçar < ;HAPPY_SMILEY>。”
。然后我们去掉标点符号:
得到
“O João foi almoçar”
。我们确实恢复了表情符号:
然后将其分割:
给我们最终结果:
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
。将它们全部放在一个函数中:
If the punctuation falls into its own space-separated token as with your example, then it's easy:
If this is not the case, you can define a dictionary of smileys like this (you'll need to add more):
and then replace each instance of the smiley with the place-holder that doesn't contain punctuation (we'll consider
<>
not to be punctuation):Which gets us to
"O João foi almoçar <HAPPY_SMILEY> ."
.We then strip punctuation:
Which gives us
"O João foi almoçar <HAPPY_SMILEY>"
.We do revert the smileys:
Which we then split:
Giving us our final result:
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
.Putting it all together into a function: