基于标签的帖子之间的欧几里得距离
我正在玩《编程集体智慧》书中的欧几里德距离示例,
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
这是对影评人进行排名的原始代码,我正在尝试修改它以查找类似的帖子,基于标签我构建了一个地图,例如,
url1 - > tag1 tag2
url2 - > tag1 tag3
但如果将其应用于函数,
pow(prefs[person1][item]-prefs[person2][item],2)
这变成0,因为标签没有权重相同的标签排名1。我修改了代码以手动创建差异来测试,
pow(prefs[1,2)
然后我得到了很多0.5的相似度,但是同一篇文章与其自身的相似度是下降到0.3。我想不出一种方法将欧几里得距离应用于我的情况吗?
I am playing with the euclidian distance example from programming collective intelligence book,
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
this is the original code for ranking movie critics, i am trying to modify this to find similar posts, based on tags i build a map such as,
url1 - > tag1 tag2
url2 - > tag1 tag3
but if apply this to the function,
pow(prefs[person1][item]-prefs[person2][item],2)
this becomes 0 cause tags don't have weight same tags has ranking 1. I modified the code to manually create a difference to test,
pow(prefs[1,2)
then i got a lot of 0.5 similarity, but similarity of the same post to it self is dropped down to 0.3. I can't think of a way to apply euclidian distance to my situation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好的,首先,您的代码看起来不完整:我只看到您的函数有一个返回。我认为你的意思是这样的:
接下来,为了清晰起见,你的帖子需要进行一些编辑。我不知道这意味着什么:“这变为 0,因为标签没有权重,相同的标签排名为 1。”
最后,如果您提供
prefs[person1]
和prefs[person2]
的示例数据,将会有所帮助。然后你就可以知道你正在得到什么以及你期望得到什么。编辑:根据我下面的评论,我将使用这样的代码:
Okay, first off, your code looks incomplete: I see only one return from your function. I think you mean something like this:
Next, your post needs a bit of editing for clarity. I don't know what this means: "this becomes 0 cause tags don't have weight same tags has ranking 1."
Lastly, it would help if you provided sample data for
prefs[person1]
andprefs[person2]
. Then you could tell what you are getting and what you expect to get.Edit: based on my comment below, I would use code like this:
基本上,标签没有权重,也不能用数值表示。所以你不能定义两个标签之间的距离。
如果您想使用标签查找两个帖子之间的相似性,我建议您使用相似标签的比率。例如,如果您有,
则您有 2 个相似标签,表示
2(相似标签)/ 4(总标签)= 0.5
。我认为只要每个帖子有 2 个以上标签,这就可以很好地衡量相似性。Basically, tags don't have weights and can't be represented by numerical values. So you can't define a distance between two tags.
If you want to find the similarity between two posts using their tags, I would suggest that you use the ratio of similar tag. For example, if you have
then you have 2 similar tags, representing
2 (similar tags) / 4 (total tags) = 0.5
. I think this would represent a good measurement for similarity, as long as you have more than 2 tags per post.