如何从字符串列表中删除单词列表

发布于 2024-09-14 16:51:52 字数 783 浏览 5 评论 0原文

抱歉,如果问题有点令人困惑。这类似于这个问题

我认为这是上面的问题接近我想要的,但是在 Clojure 中。

还有另一个问题

我需要这样的东西,但不是'[br]' 在该问题中,有一个需要搜索和删除的字符串列表。

希望我说清楚了。

我认为这是因为 python 中的字符串是不可变的。

我有一个噪音词列表,需要从字符串列表中删除。

如果我使用列表理解,我最终会一次又一次地搜索相同的字符串。因此,只有“of”被删除,而不是“the”。所以我的修改后的列表看起来像这样,

places = ['New York', 'the New York City', 'at Moscow' and many more]

noise_words_list = ['of', 'the', 'in', 'for', 'at']

for place in places:
    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

我想知道我犯了什么错误。

Sorry if the question is bit confusing. This is similar to this question

I think this the above question is close to what I want, but in Clojure.

There is another question

I need something like this but instead of '[br]' in that question, there is a list of strings that need to be searched and removed.

Hope I made myself clear.

I think that this is due to the fact that strings in python are immutable.

I have a list of noise words that need to be removed from a list of strings.

If I use the list comprehension, I end up searching the same string again and again. So, only "of" gets removed and not "the". So my modified list looks like this

places = ['New York', 'the New York City', 'at Moscow' and many more]

noise_words_list = ['of', 'the', 'in', 'for', 'at']

for place in places:
    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

I would like to know as to what mistake I'm doing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

硬不硬你别怂 2024-09-21 16:51:52

如果没有正则表达式,你可以这样做:

places = ['of New York', 'of the New York']

noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
         for place in places
         ]
print stuff

Without regexp you could do like this:

places = ['of New York', 'of the New York']

noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
         for place in places
         ]
print stuff
眼泪也成诗 2024-09-21 16:51:52

这是我的尝试。这使用了正则表达式。

import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']

Sans lambda

[pattern.sub("", phrase) for phrase in phrases]

更新

修复gnibbler指出的错误(谢谢!):

pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']

@prabhu:上述更改避免了从“Spain”中剪掉尾随的“in”。要验证,请针对短语“西班牙下雨”运行两个版本的正则表达式。

Here is my stab at it. This uses regular expressions.

import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']

Sans lambda:

[pattern.sub("", phrase) for phrase in phrases]

Update

Fix for the bug pointed out by gnibbler (thanks!):

pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']

@prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".

杀お生予夺 2024-09-21 16:51:52
>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']
>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']
生来就爱笑 2024-09-21 16:51:52

由于您想知道自己做错了什么,因此

stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

发生了这一行:,然后开始循环单词。首先它检查“of”。检查您的地点(例如“of the New York”)是否以“of”开头。它被转换(调用替换和剥离)并添加到结果列表中。这里最重要的是结果不会被再次检查。对于您在推导式中迭代的每个单词,都会将一个新结果添加到结果列表中。因此下一个单词是“the”,并且您的地点(“of the New York”)不以“the”开头,因此不会添加新结果。

我假设您最终得到的结果是位置变量的串联。更容易阅读和理解的过程版本是(未经测试):

results = []
for place in places:
    for word in words:
        if place.startswith(word):
            place = place.replace(word, "").strip()
    results.append(place)

请记住,replace() 会删除字符串中任何位置的单词,即使它作为简单的子字符串出现。您可以通过使用具有类似于 ^the\b 的模式的正则表达式来避免这种情况。

Since you would like to know what you are doing wrong, this line:

stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.

I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):

results = []
for place in places:
    for word in words:
        if place.startswith(word):
            place = place.replace(word, "").strip()
    results.append(place)

Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文