在Python中比较两个.txt文件并将精确和相似的匹配保存到.txt文件

发布于 2024-11-19 07:46:33 字数 332 浏览 5 评论 0原文

我需要的是：

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

当我使用“set”时，输出将是：（

apple
ice

“相当于re.match”）

但我想得到：（

apple
ice
icecream

“相当于re.search”）

有什么方法可以做到这一点？文件很大，所以我不能只迭代它并使用正则表达式。

原文

What i need is:

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

When i use "set", output will be:

apple
ice

("equivalent of re.match")

but I want to get:

apple
ice
icecream

("equivalent of re.search")

Is there any way how to do this? Files are large, so I can't just iterate over it and use regex.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

与酒说心事 2024-11-26 07:46:33

您可能想查看 difflib

回复收藏 0 原文

风轻花落早 2024-11-26 07:46:33

如果您想要的只是从文件中提取单词，其中一个单词是另一个单词的子字符串（包括相同的单词），您可以这样做：

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

或者，如果您想要基于字符串在字母顺序上的相似程度来进行相似性，您可以按照 Paul 在他的回答中的建议使用 difflib 提供的类之一：

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

我没有对两个示例中的任何一个进行计时，但我猜第二个示例会运行得慢得多，因为对于每一对，您都必须实例化一个对象...

If all you want is to extract from the files words which are one a substring of the other (including those that are identical) you could do:

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

Alternatively, if you want a similarity based on how strings are similar in the order of their letters, you could use as suggested by Paul in his answer one of the classes provided by difflib:

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

I did not timed either of the two samples, but I would guess the second will run much slower, as for each couple you will have to instantiate an object...

回复收藏 0 原文

~没有更多了~