比较两列并返回最相似的列python

发布于 2025-01-10 16:22:17 字数 634 浏览 2 评论 0原文

我有两个数据框。

df1 如下所示。

List1
[apple, banana]
[carrots]
[for, spinach, mushrooms, the]

df2 如下所示。

List2
[apple, garden]
[spinach, smoothie]
[garlic, carrots]
[carrots]
[mushroom, the]

我想将 df1 中的列表与 df2 中的列表进行匹配并生成相似度分数。

因此所需的输出如下所示。

List1                              List2              Sim_Score
[apple, banana]                    [apple, garden]    0.52
[carrots]                          [carrots]          1.0
[for, spinach, mushrooms, the]     [mushrooms, the]   0.49

我可以处理相似度分数部分。我的问题是如何使用 List2 找到 List1 中每一行的最佳匹配？

原文

I have two data frames.

df1 looks like below.

List1
[apple, banana]
[carrots]
[for, spinach, mushrooms, the]

df2 looks like below.

List2
[apple, garden]
[spinach, smoothie]
[garlic, carrots]
[carrots]
[mushroom, the]

I want to match the lists in df1 to the lists in df2 and produce a similarity score.

So desired output looks something like below.

List1                              List2              Sim_Score
[apple, banana]                    [apple, garden]    0.52
[carrots]                          [carrots]          1.0
[for, spinach, mushrooms, the]     [mushrooms, the]   0.49

I can handle the similarity score part. My question is how can I find the best match for every row in List1 using List2?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

謸气贵蔟 2025-01-17 16:22:17

您的问题是如何使用 List2 找到 List1 中每一行的最佳匹配。

为了回答这个问题，我将模拟一个相似性评分算法，只是为了获得对于正在分析相似性的不同对而言不同的数字。

下面是一些代码，用于执行您所要求的操作，即针对 List1 中的每一行识别 List2 中具有“最佳”（即最高）相似度的匹配项得分：

        List1 = [
            ['apple', 'banana'],
            ['carrots'],
            ['for', 'spinach', 'mushrooms', 'the']
        ]
        List2 = [
            ['apple', 'garden'],
            ['spinach', 'smoothie'],
            ['garlic', 'carrots'],
            ['carrots'],
            ['mushroom', 'the']
        ]
        def get_sim_score(a, b):
            return (Counter(a) & Counter(b)).total() * 2 / (len(a) + len(b))
        Sim_Score = [max(((b, get_sim_score(a, b)) for b in List2), key=lambda x:x[1]) for a in List1]
        print(f"{'List1' : <40}"+f"{'List2' : <25}"+"Sim_Score")
        [print(f"{str(List1[i]) : <40}"+f"{str(Sim_Score[i][0]) : <25}"+f"{Sim_Score[i][1]}") for i in range(len(List1))]
        return

输出为：

List1                                   List2                    Sim_Score
['apple', 'banana']                     ['apple', 'garden']      0.5
['carrots']                             ['carrots']              1.0
['for', 'spinach', 'mushrooms', 'the']  ['spinach', 'smoothie']  0.3333333333333333

关键代码行可以展开（为了更容易理解）如下：

        #Sim_Score = [max(((b, get_sim_score(a, b)) for b in List2), key=lambda x:x[1]) for a in List1]

        Sim_Score = []
        for a in List1:
            cur_b, cur_max = None, 0
            for b in List2:
                cur_score = get_sim_score(a, b)
                if cur_b is None or cur_score > cur_max:
                    cur_b, cur_max = b, cur_score
            Sim_Score.append((cur_b, cur_max))

Your question is how to find the best match for every row in List1 using List2.

To answer it, I will mock up a similarity score algorithm just to get numbers that are different for different pairs being analyzed for similarity.

Here is some code that does what you're asking in terms of identifying, for each row in List1, the match in List2 with the "best" (namely highest) similarity score:

        List1 = [
            ['apple', 'banana'],
            ['carrots'],
            ['for', 'spinach', 'mushrooms', 'the']
        ]
        List2 = [
            ['apple', 'garden'],
            ['spinach', 'smoothie'],
            ['garlic', 'carrots'],
            ['carrots'],
            ['mushroom', 'the']
        ]
        def get_sim_score(a, b):
            return (Counter(a) & Counter(b)).total() * 2 / (len(a) + len(b))
        Sim_Score = [max(((b, get_sim_score(a, b)) for b in List2), key=lambda x:x[1]) for a in List1]
        print(f"{'List1' : <40}"+f"{'List2' : <25}"+"Sim_Score")
        [print(f"{str(List1[i]) : <40}"+f"{str(Sim_Score[i][0]) : <25}"+f"{Sim_Score[i][1]}") for i in range(len(List1))]
        return

Output is:

List1                                   List2                    Sim_Score
['apple', 'banana']                     ['apple', 'garden']      0.5
['carrots']                             ['carrots']              1.0
['for', 'spinach', 'mushrooms', 'the']  ['spinach', 'smoothie']  0.3333333333333333

The key line of code can be expanded (for easier understanding) as follows:

        #Sim_Score = [max(((b, get_sim_score(a, b)) for b in List2), key=lambda x:x[1]) for a in List1]

        Sim_Score = []
        for a in List1:
            cur_b, cur_max = None, 0
            for b in List2:
                cur_score = get_sim_score(a, b)
                if cur_b is None or cur_score > cur_max:
                    cur_b, cur_max = b, cur_score
            Sim_Score.append((cur_b, cur_max))

回复收藏 0 原文

~没有更多了~