比较两个文本列以测量其在Python中的数据框中的相似性

发布于 2025-01-25 18:29:29 字数 370 浏览 3 评论 0原文

我想将列A与C和B与C进行比较，并测量每对的相似性，然后报告具有较高相似程度的列。

df = pd.DataFrame([['JAMES LIKEN', 'LINDEN R. EVANS', 'LINDEN R. EVANS'], ['HENRY THEISEN', 'SCOTT ULLEM', 'Henry J. Theisen']])
df.columns = ['A', 'B', 'C']

结果应为三列的形式。前两个包含相似性率，第三列应包含A或B列，无论与C更相似。 应用和lambda用于使用每一行的功能，但导致错误。

原文

I want to compare columns A with C and also B with C and measure each pair's similarity and then report the one that has a higher degree of similarity.

df = pd.DataFrame([['JAMES LIKEN', 'LINDEN R. EVANS', 'LINDEN R. EVANS'], ['HENRY THEISEN', 'SCOTT ULLEM', 'Henry J. Theisen']])
df.columns = ['A', 'B', 'C']

Result should be in the form of three columns. The first two contain similarity ratio and the third column should contain either column A or B, whichever that is more similar to C. I used fuzz.partial_ratio and sequencematcher, and used apply and lambda to use the function for each row, but it led to error.

分享到QQ

分享到微博