使用Jaccard索引找到所需技能和老师之间的最佳匹配

发布于 2025-01-26 01:58:11 字数 1743 浏览 1 评论 0 原文

我有一组他们想要学习的技能的学生,并设有一系列他们准备教授的技能的老师。

基于此信息,我有以下给出的表。一个是给学生的,一个适合老师。 '1'代表学生愿意学习的技能,老师愿意教书。 '0'意味着相反。

|  Students  |  Skill 1  |  Skill 2  |  Skill 3 |  Skill 4 |  Skill 5  |
|------------|-----------|---- ------|----------|----------|-----------|
|      A     |      1    |      0    |     0    |     1    |     0     |
|      B     |      1    |      1    |     0    |     0    |     1     |
|      C     |      0    |      0    |     1    |     1    |     0     |
|      D     |      1    |      1    |     0    |     1    |     1     |
|      E     |      0    |      1    |     1    |     0    |     1     |


|  Teachers  |  Skill 1  |  Skill 2  |  Skill 3 |  Skill 4 |  Skill 5  |
|------------|-----------|---- ------|----------|----------|-----------|
|      F     |      1    |      1    |     1    |     1    |     1     |
|      G     |      0    |      1    |     0    |     0    |     0     |
|      H     |      0    |      0    |     1    |     1    |     1     |
|      I     |      1    |      1    |     0    |     0    |     0     |
|      J     |      0    |      0    |     1    |     0    |     1     |

我试图将老师与适当的学生相匹配,我可以看到的一个建议是使用jaccard索引。但是,我不确定jaccard索引在二进制数据上是否正常工作。

我试图根据以下的小数据集使用它,但我没有得到正确的结果。

import numpy as np

a = [0, 1, 1, 0, 1, 0, 0]
b = [0, 1, 1, 0, 1, 0, 0]

#define Jaccard Similarity function

def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

#find Jaccard Similarity between the two sets 

jaccard(a, b)

即使二进制列表完全相同,0.16666也是输出。

在这种情况下,如何正确使用Jaccard索引或其他任何方式将老师与学生匹配的建议有什么建议?谢谢!

I have a set of Students with a list of skills they want to learn and set of teachers with a list of skills they are ready to teach.

Based on this information I have the below given tables. One for the Students and one for the Teachers. '1' represents a skill a student is willing to learn and the teacher is willing to teach. '0' means the opposite.

|  Students  |  Skill 1  |  Skill 2  |  Skill 3 |  Skill 4 |  Skill 5  |
|------------|-----------|---- ------|----------|----------|-----------|
|      A     |      1    |      0    |     0    |     1    |     0     |
|      B     |      1    |      1    |     0    |     0    |     1     |
|      C     |      0    |      0    |     1    |     1    |     0     |
|      D     |      1    |      1    |     0    |     1    |     1     |
|      E     |      0    |      1    |     1    |     0    |     1     |


|  Teachers  |  Skill 1  |  Skill 2  |  Skill 3 |  Skill 4 |  Skill 5  |
|------------|-----------|---- ------|----------|----------|-----------|
|      F     |      1    |      1    |     1    |     1    |     1     |
|      G     |      0    |      1    |     0    |     0    |     0     |
|      H     |      0    |      0    |     1    |     1    |     1     |
|      I     |      1    |      1    |     0    |     0    |     0     |
|      J     |      0    |      0    |     1    |     0    |     1     |

I am trying to match the Teachers with the appropriate Students and one suggestion I can see is to use the Jaccard Index. However, I am not sure if the Jaccard index works correctly on the Binary data.

I tried to use it on a small dataset as per below but I am not getting the correct results.

import numpy as np

a = [0, 1, 1, 0, 1, 0, 0]
b = [0, 1, 1, 0, 1, 0, 0]

#define Jaccard Similarity function

def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

#find Jaccard Similarity between the two sets 

jaccard(a, b)

0.16666 is the output even though the binary lists are exactly the same.

Any suggestions on how to correctly use the Jaccard Index in this case or any other way to match the teachers to the students? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

请持续率性 2025-02-02 01:58:11

如果我理解正确,您想使用

第一步是计算jaccard索引的矩阵:

S = (df1.melt(id_vars='Students')
        .query('value==1')
        .groupby('Students')['variable']
        .agg(frozenset)
     )
T = (df2.melt(id_vars='Teachers')
        .query('value==1')
        .groupby('Teachers')['variable']
        .agg(frozenset)
     )

def jaccard(s1, s2):
    return len(s1&s2)/len(s1|s2)

from itertools import product

df = (pd
   .Series({(s,t): jaccard(S[s], T[t]) for s,t in product(S.index, T.index)})
   .unstack()
   .rename_axis(index='student', columns='teacher')
)

# df
teacher    A         B         C         D         E
student                                             
A        0.4  0.000000  0.250000  0.333333  0.000000
B        0.6  0.333333  0.200000  0.666667  0.250000
C        0.4  0.000000  0.666667  0.000000  0.333333
D        0.8  0.250000  0.400000  0.500000  0.200000
E        0.6  0.333333  0.500000  0.250000  0.666667

然后,我们可以解决

from scipy.optimize import linear_sum_assignment

x, y = linear_sum_assignment(df, maximize=True)

out = pd.DataFrame({'student': df.columns[y], 'teacher': df.index[x]})

# out
  student teacher
0       B       A
1       D       B
2       C       C
3       A       D
4       E       E

另外,如果您只想为每个学生提供最好的老师,即使这意味着有可能没有学生和其他学生的老师,请使用 idxmax

df.idxmax(axis=1)

student
A    A
B    D
C    C
D    A
E    E
dtype: object

If I understand correctly, you want to compute the maximum skill overlap using the Jaccard index and assign the "best" teacher to each student.

The first step is to compute a matrix of Jaccard indices:

S = (df1.melt(id_vars='Students')
        .query('value==1')
        .groupby('Students')['variable']
        .agg(frozenset)
     )
T = (df2.melt(id_vars='Teachers')
        .query('value==1')
        .groupby('Teachers')['variable']
        .agg(frozenset)
     )

def jaccard(s1, s2):
    return len(s1&s2)/len(s1|s2)

from itertools import product

df = (pd
   .Series({(s,t): jaccard(S[s], T[t]) for s,t in product(S.index, T.index)})
   .unstack()
   .rename_axis(index='student', columns='teacher')
)

# df
teacher    A         B         C         D         E
student                                             
A        0.4  0.000000  0.250000  0.333333  0.000000
B        0.6  0.333333  0.200000  0.666667  0.250000
C        0.4  0.000000  0.666667  0.000000  0.333333
D        0.8  0.250000  0.400000  0.500000  0.200000
E        0.6  0.333333  0.500000  0.250000  0.666667

Then, we can solve the assignment problem using scipy.optimize.linear_sum_assignment:

from scipy.optimize import linear_sum_assignment

x, y = linear_sum_assignment(df, maximize=True)

out = pd.DataFrame({'student': df.columns[y], 'teacher': df.index[x]})

# out
  student teacher
0       B       A
1       D       B
2       C       C
3       A       D
4       E       E

Alternatively, if you just want the best teacher for each student, even if this means potentially having teachers without students and others with many students, use idxmax:

df.idxmax(axis=1)

student
A    A
B    D
C    C
D    A
E    E
dtype: object
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文