如何将字符串拆分为单词列表?

发布于 2024-07-17 04:19:22 字数 364 浏览 14 评论 0原文

如何拆分句子并将每个单词存储在列表中? 例如

"these are words"   ⟶   ["these", "are", "words"]

要按其他分隔符拆分,请参阅在 python 中按分隔符拆分字符串

To拆分为单个字符,请参阅如何将字符串拆分为字符列表?

How do I split a sentence and store each word in a list? e.g.

"these are words"   ⟶   ["these", "are", "words"]

To split on other delimiters, see Split a string by a delimiter in python.

To split into individual characters, see How do I split a string into a list of characters?.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

惜醉颜 2024-07-24 04:19:22

给定一个字符串 sentence,它将每个单词存储在名为 words 的列表中:

words = sentence.split()

Given a string sentence, this stores each word in a list called words:

words = sentence.split()
猫弦 2024-07-24 04:19:22

要在任何连续的空格上分割字符串 text

words = text.split()      

要在自定义分隔符(例如 ",")上分割字符串 text

words = text.split(",")   

< code>words 变量将是一个 list 并包含在分隔符上拆分的 text 中的单词。

To split the string text on any consecutive runs of whitespace:

words = text.split()      

To split the string text on a custom delimiter such as ",":

words = text.split(",")   

The words variable will be a list and contain the words from text split on the delimiter.

凉月流沐 2024-07-24 04:19:22

使用 str.split()

返回字符串中的单词列表,使用sep作为分隔符
...如果未指定 sep 或为 None,则应用不同的分割算法:连续的空格被视为单个分隔符,并且结果将不包含如果字符串有前导或尾随空格,则在开头或结尾处为空字符串。

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']

Use str.split():

Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
萌无敌 2024-07-24 04:19:22

根据您计划对句子列表执行的操作,您可能需要查看自然语言工具包。 它主要涉及文本处理和评估。 您还可以使用它来解决您的问题:

import nltk
words = nltk.word_tokenize(raw_sentence)

这具有拆分标点符号的额外好处。

示例:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

这允许您过滤掉任何不需要的标点符号并仅使用单词。

请注意,如果您不打算对句子进行任何复杂的操作,那么使用 string.split() 的其他解决方案会更好。

[编辑]

Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:

import nltk
words = nltk.word_tokenize(raw_sentence)

This has the added benefit of splitting out punctuation.

Example:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

This allows you to filter out any punctuation you don't want and use only words.

Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.

[Edited]

护你周全 2024-07-24 04:19:22

这个算法怎么样? 在空白处拆分文本,然后修剪标点符号。 这会小心地删除单词边缘的标点符号,而不会损坏单词内的撇号,例如 were

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
弃爱 2024-07-24 04:19:22

我希望我的 python 函数分割一个句子(输入)并将每个单词存储在列表中

str().split() 方法执行此操作,它需要一个字符串,将其拆分为一个列表:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0

I want my python function to split a sentence (input) and store each word in a list

The str().split() method does this, it takes a string, splits it into a list:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
独夜无伴 2024-07-24 04:19:22

如果您想要列表中单词/句子的所有字符,请执行以下操作:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

If you want all the chars of a word/sentence in a list, do this:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
破晓 2024-07-24 04:19:22

shlex 有一个 .split() 函数。 它与 str.split() 的不同之处在于它不保留引号并将带引号的短语视为单个单词:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

注意:它适用于类 Unix 命令行字符串。 它不适用于自然语言处理。

shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.

尐籹人 2024-07-24 04:19:22

如果您想将字符串拆分为单词列表,并且字符串包含标点符号,则建议删除它们。 例如,str.split()以下字符串为

s = "Hi, these are words; these're, also, words."
words = s.split()
# ['Hi,', 'these', 'are', 'words;', "these're,", 'also,', 'words.']

其中Hi,,words;,also,等. 附有标点符号。 Python 有一个内置的 string 模块,该模块将标点符号字符串作为属性 (string.punctuation)。 摆脱标点符号的一种方法是简单地从每个单词中删除它们:

import string
words = [w.strip(string.punctuation) for w in s.split()]
# ['Hi', 'these', 'are', 'words', "these're", 'also', 'words']

另一种方法是制作要删除的字符串的综合字典

table = str.maketrans('', '', string.punctuation)
words = s.translate(table).split() 
# ['Hi', 'these', 'are', 'words', 'thesere', 'also', 'words']

它不处理像 these're 这样的单词,因此它会处理case nltk.word_tokenize 可以用作tgray 建议。 只是,过滤掉完全由标点符号组成的单词。

import nltk
words = [w for w in nltk.word_tokenize(s) if w not in string.punctuation]
# ['Hi', 'these', 'are', 'words', 'these', "'re", 'also', 'words']

If you want to split a string into a list of words and if the string has punctuations, it's probably advisable to remove them. For example, str.split() the following string as

s = "Hi, these are words; these're, also, words."
words = s.split()
# ['Hi,', 'these', 'are', 'words;', "these're,", 'also,', 'words.']

where Hi,, words;, also, etc. have punctuation attached to them. Python has a built-in string module that has a string of punctuations as an attribute (string.punctuation). One way to get rid of the punctuations is to simply strip them from each word:

import string
words = [w.strip(string.punctuation) for w in s.split()]
# ['Hi', 'these', 'are', 'words', "these're", 'also', 'words']

another is make a comprehensive dictionary of the strings to remove

table = str.maketrans('', '', string.punctuation)
words = s.translate(table).split() 
# ['Hi', 'these', 'are', 'words', 'thesere', 'also', 'words']

It doesn't handle words like these're, so it handle that case nltk.word_tokenize could be used as tgray suggested. Only, filter out the words that consist entirely of punctuation.

import nltk
words = [w for w in nltk.word_tokenize(s) if w not in string.punctuation]
# ['Hi', 'these', 'are', 'words', 'these', "'re", 'also', 'words']
够运 2024-07-24 04:19:22

拆分单词而不损害单词内的撇号
请求出input_1和input_2摩尔定律

def split_into_words(line):
    import re
    word_regex_improved = r"(\w[\w']*\w|\w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

split_into_words(input_2)
#output
['Oh',
 'you',
 "can't",
 'help',
 'that',
 'said',
 'the',
 'Cat',
 "we're",
 'all',
 'mad',
 'here',
 "I'm",
 'mad',
 "You're",
 'mad']

Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law

def split_into_words(line):
    import re
    word_regex_improved = r"(\w[\w']*\w|\w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

split_into_words(input_2)
#output
['Oh',
 'you',
 "can't",
 'help',
 'that',
 'said',
 'the',
 'Cat',
 "we're",
 'all',
 'mad',
 'here',
 "I'm",
 'mad',
 "You're",
 'mad']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文