将字符串转换为单词列表?
我正在尝试使用 python 将字符串转换为单词列表。我想采用如下所示的内容:
string = 'This is a string, with words!'
然后转换为如下所示的内容:
list = ['This', 'is', 'a', 'string', 'with', 'words']
注意标点符号和空格的省略。解决这个问题最快的方法是什么?
I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
我认为对于其他人来说,鉴于迟到的回复,这是最简单的方法:
I think this is the simplest way for anyone else stumbling on this post given the late response:
试试这个:
工作原理:
来自文档:
返回通过用替换 repl 替换字符串中最左边不重叠的模式而获得的字符串。如果未找到该模式,则字符串原样返回。 repl 可以是字符串或函数。
所以在我们的例子中:
模式是任何非字母数字字符。
[\w] 表示任何字母数字字符并且等于字符集
[a-zA-Z0-9_]
a 到 z、A 到 Z、0 到 9 和下划线。
所以我们匹配任何非字母数字字符并将其替换为空格。
然后我们 split() 它,它按空格分割字符串并将其转换为列表,
因此 'hello-world'
变成 'hello world '
与 re.sub
,然后
在 split() 之后使用
['hello' , 'world']让我知道是否出现任何疑问。
Try this:
How it works:
From the docs :
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.
so in our case :
pattern is any non-alphanumeric character.
[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]
a to z, A to Z , 0 to 9 and underscore.
so we match any non-alphanumeric character and replace it with a space .
and then we split() it which splits string by space and converts it to a list
so 'hello-world'
becomes 'hello world'
with re.sub
and then ['hello' , 'world']
after split()
let me know if any doubts come up.
要正确地做到这一点是相当复杂的。对于您的研究,它被称为单词标记化。如果你想看看其他人做了什么,你应该看看 NLTK,而不是从头开始:
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
最简单的方法:
The most simple way:
使用 string.punctuation 来保证完整性:
这也可以处理换行符。
Using
string.punctuation
for completeness:This handles newlines as well.
好吧,您可以使用
注意,
string
和list
都是内置类型的名称,因此您可能不想使用它们作为变量名称。Well, you could use
Note that both
string
andlist
are names of builtin types, so you probably don't want to use those as your variable names.受到 @mtrw's 答案的启发,但经过改进,仅删除单词边界处的标点符号:
Inspired by @mtrw's answer, but improved to strip out punctuation at word boundaries only:
就我个人而言,我认为这比提供的答案稍微干净一些
Personally, I think this is slightly cleaner than the answers provided
单词的正则表达式将为您提供最大的控制权。您需要仔细考虑如何处理带有破折号或撇号的单词,例如“I'm”。
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
通过这种方式,您可以消除字母表之外的每个特殊字符:
我不确定这是否是快速或最佳的,甚至是正确的编程方式。
This way you eliminate every special char outside of the alphabet:
I'm not sure if this is fast or optimal or even the right way to program.
该函数将返回给定字符串的单词列表。
在这种情况下,如果我们按如下方式调用该函数,
该函数的返回输出将是
This function will return the list of words of a given string.
In this case, if we call the function as follows,
The return output of the function would be
这是我对不能使用正则表达式的编码挑战的尝试,
撇号的作用似乎很有趣。
This is from my attempt on a coding challenge that can't use regex,
The role of apostrophe seems interesting.
可能不是很优雅,但至少你知道发生了什么。
Probably not very elegant, but at least you know what's going on.
string = '这是一个带有单词的字符串!'
list = [string.split() 中逐字逐句]
print(list)
string = 'This is a string, with words!'
list = [word for word in string.split()]
print(list)
您可以尝试这样做:
You can try and do this: