使用 python 或 SQL 根据相似的句子对行进行排名?

发布于 2025-01-12 15:49:14 字数 860 浏览 3 评论 0原文

如何根据行值对数据框进行排名。即我有一行包含文本数据想要提供基于相似性的排名?下面是示例数据集,原始数据集包含大约 100000 条记录。请参考此问题对句子匹配进行排序

在此处输入图像描述

有一些方法已被测试用于对相似句子小数据集进行聚类,请参阅上面的附加链接。输出我们需要对相似的句子进行聚类,无论长度如何。

示例 - 要使用 python 匹配句子,感谢 laurent。如果句子长度较短,下面的代码效果很好

df = (
    df
    .assign(
        match=df["text"].map(
            lambda x: [
                i
                for i, text in enumerate(df["text"])
                if textdistance.jaro_winkler(x, text) >= 0.9
            ]
        )
    )
    .sort_values(by="match")
    .drop(columns="match")
)

How to rank the data frame based on the row value. i.e I have a row that contains text data want to provide the rank based on the similarity? Below is the sample datasets,the original datasets contain around 100000 records. Kindly refer this question for Sort sentence matching

enter image description here

There are some methods are tested to cluster the similar sentences small datasets please the refer the above attached link . Output we need to cluster the similar sentences irrespective the length.

Example - To match sentence using python , thanks to laurent. The below code works well if the sentence length is less

df = (
    df
    .assign(
        match=df["text"].map(
            lambda x: [
                i
                for i, text in enumerate(df["text"])
                if textdistance.jaro_winkler(x, text) >= 0.9
            ]
        )
    )
    .sort_values(by="match")
    .drop(columns="match")
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

远山浅 2025-01-19 15:49:14

你可以试试这个:

import pandas as pd
import textdistance

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "name": [
            "alexander szelle",
            "al futtain stroes llic",
            "irca middle east llc",
            "giga real estate",
            "mr marwan mohad ibrahim al abdulla",
            "knowledge management human resource consul",
            "yaaqoub hamdan foodstuff trading co llc",
            "grand star contracting llc",
            "middle east llc",
            "marwan mohad ibrahim",
        ],
    }
)

df = (
    df.assign(
        match=df["name"].map(
            lambda x: max(
                [textdistance.jaro_winkler(x, text) for text in df["name"]],
                key=lambda x: x if x != 1 else 0,
            )
        )
    )
    .sort_values(by="match")
    .reset_index(drop=True)
)

print(df)
# Output
   id                                        name     match
0   6  knowledge management human resource consul  0.615140
1   4                            giga real estate  0.638258
2   1                            alexander szelle  0.654924
3   7     yaaqoub hamdan foodstuff trading co llc  0.660684
4   8                  grand star contracting llc  0.660684
5   2                      al futtain stroes llic  0.670047
6   5          mr marwan mohad ibrahim al abdulla  0.741471
7  10                        marwan mohad ibrahim  0.741471
8   3                        irca middle east llc  0.805556
9   9                             middle east llc  0.805556

You could try this:

import pandas as pd
import textdistance

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "name": [
            "alexander szelle",
            "al futtain stroes llic",
            "irca middle east llc",
            "giga real estate",
            "mr marwan mohad ibrahim al abdulla",
            "knowledge management human resource consul",
            "yaaqoub hamdan foodstuff trading co llc",
            "grand star contracting llc",
            "middle east llc",
            "marwan mohad ibrahim",
        ],
    }
)

df = (
    df.assign(
        match=df["name"].map(
            lambda x: max(
                [textdistance.jaro_winkler(x, text) for text in df["name"]],
                key=lambda x: x if x != 1 else 0,
            )
        )
    )
    .sort_values(by="match")
    .reset_index(drop=True)
)

print(df)
# Output
   id                                        name     match
0   6  knowledge management human resource consul  0.615140
1   4                            giga real estate  0.638258
2   1                            alexander szelle  0.654924
3   7     yaaqoub hamdan foodstuff trading co llc  0.660684
4   8                  grand star contracting llc  0.660684
5   2                      al futtain stroes llic  0.670047
6   5          mr marwan mohad ibrahim al abdulla  0.741471
7  10                        marwan mohad ibrahim  0.741471
8   3                        irca middle east llc  0.805556
9   9                             middle east llc  0.805556
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文