改进 Python 中的模糊匹配算法
任务:获取两个文本文件并输出 100% 匹配和 75% 匹配。
解决方案:
import difflib
import csv
# Imports and parses the files
fileA = open("H:/comm.names.txt", 'r')
try:
setA = fileA.readlines()
finally:
fileA.close()
fileB = open("H:/acad.names.txt", 'r')
try:
setB = fileB.readlines()
finally:
fileB.close()
# 100% Match
setMatch100 = set(setA).intersection(setB)
Match100 = open("H:\Match100.txt", 'w')
try:
for item in setMatch100:
Match100.write(item)
finally:
Match100.close()
# Remove 100% matches from the two lists
setA_LeftOver = set(setA).difference(setMatch100)
setB_LeftOver = set(setB).difference(setMatch100)
#Return the best match for setA_LeftOver[i] in setB_LeftOver that is at least 75% matching.
fMatch75 = open("H:\Match75.csv", 'w')
Match75 = csv.writer(fMatch75)
try:
Match75.writerow(['File A', 'File B'])
for item in setA_LeftOver:
match = difflib.get_close_matches(item, setB_LeftOver, 1, 0.75)
if len(match) > 0:
row = [item.rstrip(), match[0].rstrip()]
Match75.writerow(row)
finally:
fMatch75.close()
问题:这可行,但结果不是很好。下面是一个匹配的例子:
Fovea Pharmaceuticals SA Kobe Pharmaceutical UnivI can't turn up the minimum percent in Diff by too much because I need to be able to match Univ with University. Also, I can't just make sure that the first words match because some strings start with "The" and need to be matched with strings that exclude "The". Can anyone point me in a direction that would throw out matches that technically are 75% similar, but to a human aren't similar at all?
Task: Take two text files and output 100% matches and 75% matches.
Solution:
import difflib
import csv
# Imports and parses the files
fileA = open("H:/comm.names.txt", 'r')
try:
setA = fileA.readlines()
finally:
fileA.close()
fileB = open("H:/acad.names.txt", 'r')
try:
setB = fileB.readlines()
finally:
fileB.close()
# 100% Match
setMatch100 = set(setA).intersection(setB)
Match100 = open("H:\Match100.txt", 'w')
try:
for item in setMatch100:
Match100.write(item)
finally:
Match100.close()
# Remove 100% matches from the two lists
setA_LeftOver = set(setA).difference(setMatch100)
setB_LeftOver = set(setB).difference(setMatch100)
#Return the best match for setA_LeftOver[i] in setB_LeftOver that is at least 75% matching.
fMatch75 = open("H:\Match75.csv", 'w')
Match75 = csv.writer(fMatch75)
try:
Match75.writerow(['File A', 'File B'])
for item in setA_LeftOver:
match = difflib.get_close_matches(item, setB_LeftOver, 1, 0.75)
if len(match) > 0:
row = [item.rstrip(), match[0].rstrip()]
Match75.writerow(row)
finally:
fMatch75.close()
Problem: This works, however the results aren't very good. Here is an example of a match:
Fovea Pharmaceuticals SA Kobe Pharmaceutical Univ
I can't turn up the minimum percent in Diff by too much because I need to be able to match Univ with University. Also, I can't just make sure that the first words match because some strings start with "The" and need to be matched with strings that exclude "The". Can anyone point me in a direction that would throw out matches that technically are 75% similar, but to a human aren't similar at all?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我最终写了一个最常见的单词脚本,然后删除了最常见的单词。正如 @e-satis 在他的评论中建议的那样,这显着改善了我的结果。然而,difflib 给了我比 pylevenshtein 更好的结果,所以我不能将他的答案标记为已接受。
I ended up writing a most common word script, and then I removed the most common words. This significantly improved my results as @e-satis suggested in his comment. However, difflib gave me better results than pylevenshtein so I can't mark his answer as accepted.
我会尝试使用 pylevenshtein 等工具比较字符串。它允许模糊字符串比较。
I would try comparing strings with a tool such as pylevenshtein. It allows fuzzy string comparisons.