将字符串转换为单词列表?

发布于 2024-11-10 13:24:35 字数 252 浏览 0 评论 0原文

我正在尝试使用 python 将字符串转换为单词列表。我想采用如下所示的内容:

string = 'This is a string, with words!'

然后转换为如下所示的内容:

list = ['This', 'is', 'a', 'string', 'with', 'words']

注意标点符号和空格的省略。解决这个问题最快的方法是什么?

I'm trying to convert a string to a list of words using python. I want to take something like the following:

string = 'This is a string, with words!'

Then convert to something like this :

list = ['This', 'is', 'a', 'string', 'with', 'words']

Notice the omission of punctuation and spaces. What would be the fastest way of going about this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

好听的两个字的网名 2024-11-17 13:24:36

我认为对于其他人来说,鉴于迟到的回复,这是最简单的方法:

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']

I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
情魔剑神 2024-11-17 13:24:36

试试这个:

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

工作原理:

来自文档:

re.sub(pattern, repl, string, count=0, flags=0)

返回通过用替换 repl 替换字符串中最左边不重叠的模式而获得的字符串。如果未找到该模式,则字符串原样返回。 repl 可以是字符串或函数。

所以在我们的例子中:

模式是任何非字母数字字符。

[\w] 表示任何字母数字字符并且等于字符集
[a-zA-Z0-9_]

a 到 z、A 到 Z、0 到 9 和下划线。

所以我们匹配任何非字母数字字符并将其替换为空格。

然后我们 split() 它,它按空格分割字符串并将其转换为列表,

因此 'hello-world'

变成 'hello world '

与 re.sub

,然后

在 split() 之后使用

['hello' , 'world']让我知道是否出现任何疑问。

Try this:

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.

抱着落日 2024-11-17 13:24:36

要正确地做到这一点是相当复杂的。对于您的研究,它被称为单词标记化。如果你想看看其他人做了什么,你应该看看 NLTK,而不是从头开始:

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
南风几经秋 2024-11-17 13:24:36

最简单的方法:

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

The most simple way:

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
星星的軌跡 2024-11-17 13:24:36

使用 string.punctuation 来保证完整性:

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

这也可以处理换行符。

Using string.punctuation for completeness:

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

This handles newlines as well.

暮光沉寂 2024-11-17 13:24:36

好吧,您可以使用

import re
list = re.sub(r'[.!,;?]', ' ', string).split()

注意,stringlist 都是内置类型的名称,因此您可能不想使用它们作为变量名称。

Well, you could use

import re
list = re.sub(r'[.!,;?]', ' ', string).split()

Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.

笨死的猪 2024-11-17 13:24:36

受到 @mtrw's 答案的启发,但经过改进,仅删除单词边界处的标点符号:

import re
import string

def extract_words(s):
    return [re.sub('^[{0}]+|[{0}]+
.format(string.punctuation), '', w) for w in s.split()]

>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

Inspired by @mtrw's answer, but improved to strip out punctuation at word boundaries only:

import re
import string

def extract_words(s):
    return [re.sub('^[{0}]+|[{0}]+
.format(string.punctuation), '', w) for w in s.split()]

>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
不再见 2024-11-17 13:24:36

就我个人而言,我认为这比提供的答案稍微干净一些

def split_to_words(sentence):
    return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed

Personally, I think this is slightly cleaner than the answers provided

def split_to_words(sentence):
    return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
原野 2024-11-17 13:24:36

单词的正则表达式将为您提供最大的控制权。您需要仔细考虑如何处理带有破折号或撇号的单词,例如“I'm”。

A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".

笑脸一如从前 2024-11-17 13:24:36
list=mystr.split(" ",mystr.count(" "))
list=mystr.split(" ",mystr.count(" "))
九局 2024-11-17 13:24:36

通过这种方式,您可以消除字母表之外的每个特殊字符:

def wordsToList(strn):
    L = strn.split()
    cleanL = []
    abc = 'abcdefghijklmnopqrstuvwxyz'
    ABC = abc.upper()
    letters = abc + ABC
    for e in L:
        word = ''
        for c in e:
            if c in letters:
                word += c
        if word != '':
            cleanL.append(word)
    return cleanL

s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L)  # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

我不确定这是否是快速或最佳的,甚至是正确的编程方式。

This way you eliminate every special char outside of the alphabet:

def wordsToList(strn):
    L = strn.split()
    cleanL = []
    abc = 'abcdefghijklmnopqrstuvwxyz'
    ABC = abc.upper()
    letters = abc + ABC
    for e in L:
        word = ''
        for c in e:
            if c in letters:
                word += c
        if word != '':
            cleanL.append(word)
    return cleanL

s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L)  # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

I'm not sure if this is fast or optimal or even the right way to program.

以酷 2024-11-17 13:24:36
def split_string(string):
    return string.split()

该函数将返回给定字符串的单词列表。
在这种情况下,如果我们按如下方式调用该函数,

string = 'This is a string, with words!'
split_string(string)

该函数的返回输出将是

['This', 'is', 'a', 'string,', 'with', 'words!']
def split_string(string):
    return string.split()

This function will return the list of words of a given string.
In this case, if we call the function as follows,

string = 'This is a string, with words!'
split_string(string)

The return output of the function would be

['This', 'is', 'a', 'string,', 'with', 'words!']
ζ澈沫 2024-11-17 13:24:36

这是我对不能使用正则表达式的编码挑战的尝试,

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')

撇号的作用似乎很有趣。

This is from my attempt on a coding challenge that can't use regex,

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')

The role of apostrophe seems interesting.

清泪尽 2024-11-17 13:24:36

可能不是很优雅,但至少你知道发生了什么。

my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
    if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
        pass
    else:
        if my_str[number_letter_in_data] in [' ']:
            #if you want longer than 3 char words
            if len(temp)>3:
                list_words_number +=1
                my_lst.append(temp)
                temp=""
            else:
                pass
        else:
            temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)

Probably not very elegant, but at least you know what's going on.

my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
    if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
        pass
    else:
        if my_str[number_letter_in_data] in [' ']:
            #if you want longer than 3 char words
            if len(temp)>3:
                list_words_number +=1
                my_lst.append(temp)
                temp=""
            else:
                pass
        else:
            temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)
ま昔日黯然 2024-11-17 13:24:36

string = '这是一个带有单词的字符串!'

list = [string.split() 中逐字逐句]

print(list)

['这个'、'是'、'a'、'字符串'、'with'、'单词!']

string = 'This is a string, with words!'

list = [word for word in string.split()]

print(list)

['This', 'is', 'a', 'string,', 'with', 'words!']

〆凄凉。 2024-11-17 13:24:36

您可以尝试这样做:

tryTrans = string.maketrans(",!", "  ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

You can try and do this:

tryTrans = string.maketrans(",!", "  ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文