在Python中比较两个.txt文件并将精确和相似的匹配保存到.txt文件

发布于 2024-11-19 07:46:33 字数 332 浏览 5 评论 0原文

我需要的是:

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

当我使用“set”时,输出将是:(

apple
ice

“相当于re.match”)

但我想得到:(

apple
ice
icecream

“相当于re.search”)

有什么方法可以做到这一点?文件很大,所以我不能只迭代它并使用正则表达式。

What i need is:

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

When i use "set", output will be:

apple
ice

("equivalent of re.match")

but I want to get:

apple
ice
icecream

("equivalent of re.search")

Is there any way how to do this? Files are large, so I can't just iterate over it and use regex.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

与酒说心事 2024-11-26 07:46:33

您可能想查看 difflib

you might want to check out difflib

风轻花落早 2024-11-26 07:46:33

如果您想要的只是从文件中提取单词,其中一个单词是另一个单词的子字符串(包括相同的单词),您可以这样做:

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

或者,如果您想要基于字符串在字母顺序上的相似程度来进行相似性,您可以按照 Paul 在他的回答中的建议使用 difflib 提供的类之一:

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

我没有对两个示例中的任何一个进行计时,但我猜第二个示例会运行得慢得多,因为对于每一对,您都必须实例化一个对象...

If all you want is to extract from the files words which are one a substring of the other (including those that are identical) you could do:

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

Alternatively, if you want a similarity based on how strings are similar in the order of their letters, you could use as suggested by Paul in his answer one of the classes provided by difflib:

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

I did not timed either of the two samples, but I would guess the second will run much slower, as for each couple you will have to instantiate an object...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文