当前位置：文江博客话题详情

Python nltk

比较字符串，找到每个字符串中存在的部分

发布于 2025-02-11 06:17:27 字数 234 浏览 2 评论 0 原文

如何比较几行，并找到每行中存在的单词的单词/组合？使用纯Python，NLTK或其他任何东西。

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')
# some magic
result = 'foo bar'

原文

How do I compare several rows and find words/combination of words that are present in each row? Using pure python, nltk or anything else.

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')
# some magic
result = 'foo bar'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我只土不豪 2025-02-18 06:17:27

将每个字符串分开在白空间，然后将结果单词保存到集合中。然后，计算三组的交点：

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')
sets = [set(s.split()) for s in few_strings]
common_words = sets[0].intersection(*sets[1:])
print(common_words)

输出：

{'bar', 'foo'}

Split each string at whitespaces and save the resulting words into sets. Then, compute the intersection of the three sets:

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')
sets = [set(s.split()) for s in few_strings]
common_words = sets[0].intersection(*sets[1:])
print(common_words)

Output:

{'bar', 'foo'}

回复收藏 0 原文

拒绝两难 2025-02-18 06:17:27

您可能需要使用标准库 fifflib 进行序列比较，包括查找常见子字符串：

from difflib import SequenceMatcher

list_of_str = ['this is foo bar', 'this is not a foo bar', 'some other foo bar here']

result = list_of_str[0]
for next_string in list_of_str:
    match = SequenceMatcher(None, result, next_string).find_longest_match()
    result = result[match.a:match.a + match.size]

# result be 'foo bar'

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match()

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a:match.a + match.size])  # -> apple pie
print(string2[match.b:match.b + match.size])  # -> apple pie

You might want to use the standard library difflib for sequence comparisons including finding common substrings:

from difflib import SequenceMatcher

list_of_str = ['this is foo bar', 'this is not a foo bar', 'some other foo bar here']

result = list_of_str[0]
for next_string in list_of_str:
    match = SequenceMatcher(None, result, next_string).find_longest_match()
    result = result[match.a:match.a + match.size]

# result be 'foo bar'

The documentation
The two-string example:

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match()

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a:match.a + match.size])  # -> apple pie
print(string2[match.b:match.b + match.size])  # -> apple pie

回复收藏 0 原文

何以畏孤独 2025-02-18 06:17:27

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')

句子划分的每个句子（“” ）创建一组
单词
为每个一个句子

# 1.
sets = [set(s.split(" ")) for s in few_strings]
# 2.
result = sets[0]
# 3.
for i in range(len(sets)):
    result = result.intersection(sets[i])

现在，您有一个python set 单词的单词>，这在所有句子中发生。
您可以将集合转换为列表：

result = list(result)

或与

result = " ".join(result)

few_strings = ('this is foo bar', 'this is not a foo bar', 'some other foo bar here')

Create sets of words for each sentence splitting by space (" ")
Add the first string to results
Loop over the sentences and update result variable with the interesction of the current result and one sentence

# 1.
sets = [set(s.split(" ")) for s in few_strings]
# 2.
result = sets[0]
# 3.
for i in range(len(sets)):
    result = result.intersection(sets[i])

Now you have a Python Set of words which occured in all sentences.
You can convert the set to list with:

result = list(result)

or to string with

result = " ".join(result)

回复收藏 0 原文

零崎曲识 2025-02-18 06:17:27

您也可以在不使用库的情况下做到这一点

few_strings = ('this is foo bar', 'some other foo bar here', 'this is not a foo bar')
strings = [s.split() for s in few_strings]
strings.sort(key=len)
print(strings)
result = ''

for word in strings[0]:
    count = 0
    for string in strings:
        if word not in string:
            break
        else:
            count += 1
    if count == len(strings):
        result += word + ' '

print(result)

You can do it without using libraries too

few_strings = ('this is foo bar', 'some other foo bar here', 'this is not a foo bar')
strings = [s.split() for s in few_strings]
strings.sort(key=len)
print(strings)
result = ''

for word in strings[0]:
    count = 0
    for string in strings:
        if word not in string:
            break
        else:
            count += 1
    if count == len(strings):
        result += word + ' '

print(result)

回复收藏 0 原文

~没有更多了~