比较两个文本列以测量其在Python中的数据框中的相似性
我想将列A与C和B与C进行比较,并测量每对的相似性,然后报告具有较高相似程度的列。
df = pd.DataFrame([['JAMES LIKEN', 'LINDEN R. EVANS', 'LINDEN R. EVANS'], ['HENRY THEISEN', 'SCOTT ULLEM', 'Henry J. Theisen']])
df.columns = ['A', 'B', 'C']
结果应为三列的形式。前两个包含相似性率,第三列应包含A或B列,无论与C更相似。 应用
和lambda
用于使用每一行的功能,但导致错误。
I want to compare columns A with C and also B with C and measure each pair's similarity and then report the one that has a higher degree of similarity.
df = pd.DataFrame([['JAMES LIKEN', 'LINDEN R. EVANS', 'LINDEN R. EVANS'], ['HENRY THEISEN', 'SCOTT ULLEM', 'Henry J. Theisen']])
df.columns = ['A', 'B', 'C']
Result should be in the form of three columns. The first two contain similarity ratio and the third column should contain either column A or B, whichever that is more similar to C. I used fuzz.partial_ratio
and sequencematcher
, and used apply
and lambda
to use the function for each row, but it led to error.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论