使用python匹配模糊字符串

发布于 2025-01-22 00:04:42 字数 880 浏览 0 评论 0原文

我有一个用于EG的培训数据集。

Letter    Word
A         Apple
B         Bat
C         Cat
D         Dog
E         Elephant

我需要检查诸如

AD    Apple Dog
AE    Applet Elephant
DC    Dog Cow
EB    Elephant Bag
AED   Apple Elephant Dog  
D     Door                
ABC   All Bat Cat

实例ad，ae，eb之类的数据框架几乎是准确的（Apple和Applet彼此靠近，蝙蝠和袋子相似），但是DC 不匹配。

所需的输出：

Letters    Words               Status
AD         Apple Dog           Accept
AE         Applet Elephant     Accept
DC         Dog Cow             Reject
EB         Elephant Bag        Accept
AED        Apple Elephant Dog  Accept
D          Door                Reject
ABC        All Bat Cat         Accept

ABC被接受，因为3个单词中有2个匹配。

接受的单词需要匹配70％（模糊匹配）。但是，阈值可能会变化。我如何使用Python找到这些匹配。

原文

I have a training dataset for eg.

Letter    Word
A         Apple
B         Bat
C         Cat
D         Dog
E         Elephant

and I need to check the dataframe such as

AD    Apple Dog
AE    Applet Elephant
DC    Dog Cow
EB    Elephant Bag
AED   Apple Elephant Dog  
D     Door                
ABC   All Bat Cat

the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.

Output Required:

Letters    Words               Status
AD         Apple Dog           Accept
AE         Applet Elephant     Accept
DC         Dog Cow             Reject
EB         Elephant Bag        Accept
AED        Apple Elephant Dog  Accept
D          Door                Reject
ABC        All Bat Cat         Accept

ABC accepted because 2 of 3 words match.

The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change.
How can I find these matches using Python.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

緦唸λ蓇 2025-01-29 00:04:42

您可以使用 thefuzz 解决您的问题：

# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz

THRESHOLD = 70

df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
                     .merge(df1, left_on='Letters', right_on='Letter')
                     .groupby('index')['Word'].agg(' '.join))

df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')

输出：输出：

>>> df2
  Letters               Words              Others  Ratio  Status
0      AD           Apple Dog           Apple Dog    100  Accept
1      AE     Applet Elephant      Apple Elephant     97  Accept
2      DC             Dog Cow             Dog Cat     71  Accept
3      EB        Elephant Bag        Elephant Bat     92  Accept
4     AED  Apple Elephant Dog  Apple Dog Elephant     78  Accept
5       D                Door                 Dog     57  Reject
6     ABC         All Bat Cat       Apple Cat Bat     67  Reject

You can use thefuzz to solve your problem:

# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz

THRESHOLD = 70

df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
                     .merge(df1, left_on='Letters', right_on='Letter')
                     .groupby('index')['Word'].agg(' '.join))

df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')

Output:

>>> df2
  Letters               Words              Others  Ratio  Status
0      AD           Apple Dog           Apple Dog    100  Accept
1      AE     Applet Elephant      Apple Elephant     97  Accept
2      DC             Dog Cow             Dog Cat     71  Accept
3      EB        Elephant Bag        Elephant Bat     92  Accept
4     AED  Apple Elephant Dog  Apple Dog Elephant     78  Accept
5       D                Door                 Dog     57  Reject
6     ABC         All Bat Cat       Apple Cat Bat     67  Reject

回复收藏 0 原文

~没有更多了~