基于标签的帖子之间的欧几里得距离

发布于 2024-08-13 16:13:16 字数 953 浏览 9 评论 0原文

我正在玩《编程集体智慧》书中的欧几里德距离示例，


# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

这是对影评人进行排名的原始代码，我正在尝试修改它以查找类似的帖子，基于标签我构建了一个地图，例如，

url1 - > tag1 tag2
url2 - > tag1 tag3

但如果将其应用于函数，

pow(prefs[person1][item]-prefs[person2][item],2)

这变成0，因为标签没有权重相同的标签排名1。我修改了代码以手动创建差异来测试，

pow(prefs[1,2)

然后我得到了很多0.5的相似度，但是同一篇文章与其自身的相似度是下降到0.3。我想不出一种方法将欧几里得距离应用于我的情况吗？

原文

I am playing with the euclidian distance example from programming collective intelligence book,


# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

this is the original code for ranking movie critics, i am trying to modify this to find similar posts, based on tags i build a map such as,

url1 - > tag1 tag2
url2 - > tag1 tag3

but if apply this to the function,

pow(prefs[person1][item]-prefs[person2][item],2)

this becomes 0 cause tags don't have weight same tags has ranking 1. I modified the code to manually create a difference to test,

pow(prefs[1,2)

then i got a lot of 0.5 similarity, but similarity of the same post to it self is dropped down to 0.3. I can't think of a way to apply euclidian distance to my situation?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掀纱窥君容 2024-08-20 16:13:16

好的，首先，您的代码看起来不完整：我只看到您的函数有一个返回。我认为你的意思是这样的：

def sim_distance(prefs, person1, person2): 
  # Get the list of shared_items
  p1, p2 = prefs[person1], prefs[person2]
  si = set(p1).intersection(set(p2))

  # Add up the squares of all the differences 
  matches = (p1[item] - p2[item] for item in si)
  return sum(a * a for a in matches)

接下来，为了清晰起见，你的帖子需要进行一些编辑。我不知道这意味着什么：“这变为 0，因为标签没有权重，相同的标签排名为 1。”

最后，如果您提供 prefs[person1] 和 prefs[person2] 的示例数据，将会有所帮助。然后你就可以知道你正在得到什么以及你期望得到什么。

编辑：根据我下面的评论，我将使用这样的代码：

def sim_distance(prefs, person1, person2):
    p1, p2 = prefs[person1], prefs[person2]
    s, t = set(p1), set(p2)
    return len(s.intersection(t)) / len(s.union(t))

Okay, first off, your code looks incomplete: I see only one return from your function. I think you mean something like this:

def sim_distance(prefs, person1, person2): 
  # Get the list of shared_items
  p1, p2 = prefs[person1], prefs[person2]
  si = set(p1).intersection(set(p2))

  # Add up the squares of all the differences 
  matches = (p1[item] - p2[item] for item in si)
  return sum(a * a for a in matches)

Next, your post needs a bit of editing for clarity. I don't know what this means: "this becomes 0 cause tags don't have weight same tags has ranking 1."

Lastly, it would help if you provided sample data for prefs[person1] and prefs[person2]. Then you could tell what you are getting and what you expect to get.

Edit: based on my comment below, I would use code like this:

def sim_distance(prefs, person1, person2):
    p1, p2 = prefs[person1], prefs[person2]
    s, t = set(p1), set(p2)
    return len(s.intersection(t)) / len(s.union(t))

回复收藏 0 原文

疏忽 2024-08-20 16:13:16

基本上，标签没有权重，也不能用数值表示。所以你不能定义两个标签之间的距离。

如果您想使用标签查找两个帖子之间的相似性，我建议您使用相似标签的比率。例如，如果您有，

url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6

则您有 2 个相似标签，表示 2（相似标签）/ 4（总标签）= 0.5。我认为只要每个帖子有 2 个以上标签，这就可以很好地衡量相似性。

Basically, tags don't have weights and can't be represented by numerical values. So you can't define a distance between two tags.

If you want to find the similarity between two posts using their tags, I would suggest that you use the ratio of similar tag. For example, if you have

url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6

then you have 2 similar tags, representing 2 (similar tags) / 4 (total tags) = 0.5. I think this would represent a good measurement for similarity, as long as you have more than 2 tags per post.

回复收藏 0 原文

~没有更多了~