比较两个大熊猫系列,其中元素是逗号分隔字符串与矢量操作
我正在为RecordLinkage Python库创建一种自定义比较算法。我的功能将两个熊猫系列作为参数,其中该系列的每个元素都是一个或多个电话号码的列表。因此,该系列的一个示例将看起来像这样:
series1 = pd.Series([
"1234567890,0987654321",
"0987654321"
])
series2 = pd.Series([
"0987654321",
"1234567890,0987654321"
])
0 1234567890,0987654321
1 0987654321
dtype: object
0 0987654321
1 1234567890,0987654321
dtype: object
然后我将系列传递给以下功能,该功能对结果串联的dataFrame执行lambda函数操作:
def _compute_vectorized(self, ph1, ph2):
"""
Applies lambda function compare_phones to all elements of the two equal-sized Series.
:param ph1: First series where each element is a comma-separated string of phone numbers.
:param ph2: Second series where each element is a comma-separated string of phone numbers.
:return sim: Series of similarity coefficients calculated between both input series.
"""
ph_df = pd.concat([ph1, ph2], axis=1)
ph_df.columns = ["ph1", "ph2"]
sim = ph_df.apply(lambda x: self.compare_phones(x["ph1"], x["ph2"]), axis=1)
return sim
这是在数据帧上执行的lambda函数:
from strsimpy.normalized_levenshtein import NormalizedLevenshtein
nl = NormalizedLevenshtein()
def compare_phones(self, ph_str_1, ph_str_2):
"""
Compare comma-separated strings of customer's phone numbers. If any phone numbers match between the sets,
return a similarity value of 1. Otherwise, compute the normalized Levenshtein distance between the two
comma-separated strings.
:param ph_str_1: First comma-separated string of phone numbers.
:param ph_str_2: Second comma-separated string of phone numbers.
:return sim: Float similarity coefficient between the two comma-separated strings of phone numbers.
"""
if len([ph for ph in ph_str_1.split(',') if ph in ph_str_2.split(',')]) > 0:
sim = 1
else:
sim = nl.distance(ph_str_1, ph_str_2)
return sim
本质上,如果有任何电话号码,则匹配两列之间,我们获得了1个相似系数为1。否则,该函数确定使用strsampy库的逗号分隔单元之间的归一化Levenshtein距离。
该操作大大减慢了我的总比较逻辑,但是有必要在电话号码上执行此自定义算法。我的问题是...有没有办法在输入系列(或两个系列的串联df)上作为向量操作执行此操作?我知道这会更快,但是我无法围绕如何准确地做到这一点。
非常感谢您!
I am creating a custom comparison algorithm for the recordlinkage Python library. My function takes two pandas Series as arguments, where each element of the series is a list of one or multiple phone numbers. So an example of the series would look like this:
series1 = pd.Series([
"1234567890,0987654321",
"0987654321"
])
series2 = pd.Series([
"0987654321",
"1234567890,0987654321"
])
0 1234567890,0987654321
1 0987654321
dtype: object
0 0987654321
1 1234567890,0987654321
dtype: object
Then I am passing the series to the following function which performs a lambda function operation on the resultant concatenated DataFrame:
def _compute_vectorized(self, ph1, ph2):
"""
Applies lambda function compare_phones to all elements of the two equal-sized Series.
:param ph1: First series where each element is a comma-separated string of phone numbers.
:param ph2: Second series where each element is a comma-separated string of phone numbers.
:return sim: Series of similarity coefficients calculated between both input series.
"""
ph_df = pd.concat([ph1, ph2], axis=1)
ph_df.columns = ["ph1", "ph2"]
sim = ph_df.apply(lambda x: self.compare_phones(x["ph1"], x["ph2"]), axis=1)
return sim
Here is the lambda function being performed on the DataFrame:
from strsimpy.normalized_levenshtein import NormalizedLevenshtein
nl = NormalizedLevenshtein()
def compare_phones(self, ph_str_1, ph_str_2):
"""
Compare comma-separated strings of customer's phone numbers. If any phone numbers match between the sets,
return a similarity value of 1. Otherwise, compute the normalized Levenshtein distance between the two
comma-separated strings.
:param ph_str_1: First comma-separated string of phone numbers.
:param ph_str_2: Second comma-separated string of phone numbers.
:return sim: Float similarity coefficient between the two comma-separated strings of phone numbers.
"""
if len([ph for ph in ph_str_1.split(',') if ph in ph_str_2.split(',')]) > 0:
sim = 1
else:
sim = nl.distance(ph_str_1, ph_str_2)
return sim
Essentially, if any phone numbers match between the two columns, we get a similarity coefficient of 1. Otherwise, the function determines the normalized Levenshtein distance between the comma-separated strings of phone numbers using strsimpy library.
This operation has significantly slowed down my total comparison logic, but it is necessary to perform this custom algorithm on phone numbers. My question is...is there a way I can perform this as a vector operation on the input series (or concatenated df of the two series)? I know this would be faster, but I cannot wrap my head around how to do it exactly.
Thank you so much in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
查看下面的代码,我将其简化为最大值:)
我正在使用其他模块晒黑,以找到Levenshtein距离,但它很小又快。
输出:
fuzzywuzzy 文档在这里。
Look at my code below, I've simplified it to the maximum :)
I'm using a different module tan you to find Levenshtein distance, but it's tiny and fast.
Output:
FuzzyWuzzy documentation here.