比较两个大熊猫系列，其中元素是逗号分隔字符串与矢量操作

发布于 2025-02-05 11:58:37 字数 2154 浏览 0 评论 0原文

我正在为RecordLinkage Python库创建一种自定义比较算法。我的功能将两个熊猫系列作为参数，其中该系列的每个元素都是一个或多个电话号码的列表。因此，该系列的一个示例将看起来像这样：

series1 = pd.Series([
"1234567890,0987654321",
"0987654321"
])

series2 = pd.Series([
    "0987654321",
    "1234567890,0987654321"
])

0    1234567890,0987654321
1    0987654321
dtype: object

0    0987654321
1    1234567890,0987654321
dtype: object

然后我将系列传递给以下功能，该功能对结果串联的dataFrame执行lambda函数操作：

    def _compute_vectorized(self, ph1, ph2):
    """
    Applies lambda function compare_phones to all elements of the two equal-sized Series.

    :param ph1: First series where each element is a comma-separated string of phone numbers.
    :param ph2: Second series where each element is a comma-separated string of phone numbers.
    :return sim: Series of similarity coefficients calculated between both input series.
    """

    ph_df = pd.concat([ph1, ph2], axis=1)
    ph_df.columns = ["ph1", "ph2"]
    sim = ph_df.apply(lambda x: self.compare_phones(x["ph1"], x["ph2"]), axis=1)

    return sim

这是在数据帧上执行的lambda函数：

from strsimpy.normalized_levenshtein import NormalizedLevenshtein

nl = NormalizedLevenshtein()

def compare_phones(self, ph_str_1, ph_str_2):
    """
    Compare comma-separated strings of customer's phone numbers. If any phone numbers match between the sets,
    return a similarity value of 1. Otherwise, compute the normalized Levenshtein distance between the two
    comma-separated strings.

    :param ph_str_1: First comma-separated string of phone numbers.
    :param ph_str_2: Second comma-separated string of phone numbers.
    :return sim: Float similarity coefficient between the two comma-separated strings of phone numbers.
    """

    if len([ph for ph in ph_str_1.split(',') if ph in ph_str_2.split(',')]) > 0:
        sim = 1
    else:
        sim = nl.distance(ph_str_1, ph_str_2)

    return sim

本质上，如果有任何电话号码，则匹配两列之间，我们获得了1个相似系数为1。否则，该函数确定使用strsampy库的逗号分隔单元之间的归一化Levenshtein距离。

该操作大大减慢了我的总比较逻辑，但是有必要在电话号码上执行此自定义算法。我的问题是...有没有办法在输入系列（或两个系列的串联df）上作为向量操作执行此操作？我知道这会更快，但是我无法围绕如何准确地做到这一点。

非常感谢您！

原文

I am creating a custom comparison algorithm for the recordlinkage Python library. My function takes two pandas Series as arguments, where each element of the series is a list of one or multiple phone numbers. So an example of the series would look like this:

series1 = pd.Series([
"1234567890,0987654321",
"0987654321"
])

series2 = pd.Series([
    "0987654321",
    "1234567890,0987654321"
])

0    1234567890,0987654321
1    0987654321
dtype: object

0    0987654321
1    1234567890,0987654321
dtype: object

Then I am passing the series to the following function which performs a lambda function operation on the resultant concatenated DataFrame:

    def _compute_vectorized(self, ph1, ph2):
    """
    Applies lambda function compare_phones to all elements of the two equal-sized Series.

    :param ph1: First series where each element is a comma-separated string of phone numbers.
    :param ph2: Second series where each element is a comma-separated string of phone numbers.
    :return sim: Series of similarity coefficients calculated between both input series.
    """

    ph_df = pd.concat([ph1, ph2], axis=1)
    ph_df.columns = ["ph1", "ph2"]
    sim = ph_df.apply(lambda x: self.compare_phones(x["ph1"], x["ph2"]), axis=1)

    return sim

Here is the lambda function being performed on the DataFrame:

from strsimpy.normalized_levenshtein import NormalizedLevenshtein

nl = NormalizedLevenshtein()

def compare_phones(self, ph_str_1, ph_str_2):
    """
    Compare comma-separated strings of customer's phone numbers. If any phone numbers match between the sets,
    return a similarity value of 1. Otherwise, compute the normalized Levenshtein distance between the two
    comma-separated strings.

    :param ph_str_1: First comma-separated string of phone numbers.
    :param ph_str_2: Second comma-separated string of phone numbers.
    :return sim: Float similarity coefficient between the two comma-separated strings of phone numbers.
    """

    if len([ph for ph in ph_str_1.split(',') if ph in ph_str_2.split(',')]) > 0:
        sim = 1
    else:
        sim = nl.distance(ph_str_1, ph_str_2)

    return sim

Essentially, if any phone numbers match between the two columns, we get a similarity coefficient of 1. Otherwise, the function determines the normalized Levenshtein distance between the comma-separated strings of phone numbers using strsimpy library.

This operation has significantly slowed down my total comparison logic, but it is necessary to perform this custom algorithm on phone numbers. My question is...is there a way I can perform this as a vector operation on the input series (or concatenated df of the two series)? I know this would be faster, but I cannot wrap my head around how to do it exactly.

Thank you so much in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一袭水袖舞倾城 2025-02-12 11:58:37

查看下面的代码，我将其简化为最大值:)

我正在使用其他模块晒黑，以找到Levenshtein距离，但它很小又快。

import pandas as pd

df = ({'A':["1234567890","5555555555","0987654321"], 
       'B':["1237654890","4444444444","0984567321"]})
df = pd.DataFrame(df)

list1 = df.A.tolist()
list2 = df.B.tolist()

# !pip install fuzzywuzzy
from fuzzywuzzy import fuzz

# list1 = ["1234567890","5555555555","0987654321"]
# list2 = ["1237654890","4444444444","0984567321"]

for i in list1:
  for j in list2:
    Ratio = fuzz.ratio(i,j)
    print(f'Between {i} and {j} the ratio is {Ratio}.')

输出：

Between 1234567890 and 1237654890 the ratio is 70.
Between 1234567890 and 4444444444 the ratio is 10.
Between 1234567890 and 0984567321 the ratio is 40.
Between 5555555555 and 1237654890 the ratio is 10.
Between 5555555555 and 4444444444 the ratio is 0.
Between 5555555555 and 0984567321 the ratio is 10.
Between 0987654321 and 1237654890 the ratio is 40.
Between 0987654321 and 4444444444 the ratio is 10.
Between 0987654321 and 0984567321 the ratio is 70.

fuzzywuzzy 文档在这里。

Look at my code below, I've simplified it to the maximum :)

I'm using a different module tan you to find Levenshtein distance, but it's tiny and fast.

import pandas as pd

df = ({'A':["1234567890","5555555555","0987654321"], 
       'B':["1237654890","4444444444","0984567321"]})
df = pd.DataFrame(df)

list1 = df.A.tolist()
list2 = df.B.tolist()

# !pip install fuzzywuzzy
from fuzzywuzzy import fuzz

# list1 = ["1234567890","5555555555","0987654321"]
# list2 = ["1237654890","4444444444","0984567321"]

for i in list1:
  for j in list2:
    Ratio = fuzz.ratio(i,j)
    print(f'Between {i} and {j} the ratio is {Ratio}.')

Output:

Between 1234567890 and 1237654890 the ratio is 70.
Between 1234567890 and 4444444444 the ratio is 10.
Between 1234567890 and 0984567321 the ratio is 40.
Between 5555555555 and 1237654890 the ratio is 10.
Between 5555555555 and 4444444444 the ratio is 0.
Between 5555555555 and 0984567321 the ratio is 10.
Between 0987654321 and 1237654890 the ratio is 40.
Between 0987654321 and 4444444444 the ratio is 10.
Between 0987654321 and 0984567321 the ratio is 70.

FuzzyWuzzy documentation here.

回复收藏 0 原文

~没有更多了~