如何平衡评分数量与评分本身?

发布于 2024-08-25 17:39:55 字数 165 浏览 3 评论 0原文

对于学校项目,我们必须实施排名系统。然而,我们认为愚蠢的平均排名会很糟糕:一个用户排名 5 星的东西会比 188 个用户排名 4 星的东西有更好的平均水平,这太愚蠢了。

所以我想知道你们中是否有人有“智能”排名的示例算法。它只需要考虑给出的排名和排名的数量。

谢谢!

For a school project, we'll have to implement a ranking system. However, we figured that a dumb rank average would suck: something that one user ranked 5 stars would have a better average that something 188 users ranked 4 stars, and that's just stupid.

So I'm wondering if any of you have an example algorithm of "smart" ranking. It only needs to take in account the rankings given and the number of rankings.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

烟沫凡尘 2024-09-01 17:39:56

我很欣赏发布时的最佳答案,因此这里将其编码为JavaScript:

const defaultR = 2;
const defaultW = 3; // should not exceed typicalNumberOfRatingsPerAnswers 0 is equivalent to using only average of ratings

function getSortAlgoValue(ratings) {
  const allRatings = ratings.reduce((sum, r) => sum + r, 0);
  return (defaultR * defaultW + allRatings) / (defaultW + ratings.length);
}

仅列为单独的答案,因为作为回复的代码块的格式不太好

I appreciated the top answer at the time of posting, so here it is codified as JavaScript:

const defaultR = 2;
const defaultW = 3; // should not exceed typicalNumberOfRatingsPerAnswers 0 is equivalent to using only average of ratings

function getSortAlgoValue(ratings) {
  const allRatings = ratings.reduce((sum, r) => sum + r, 0);
  return (defaultR * defaultW + allRatings) / (defaultW + ratings.length);
}

Only listed as a separate answer because the formatting of the code block as a reply wasn't very

七七 2024-09-01 17:39:56

既然您已经说过机器只会获得排名和排名数量,我认为尝试计算加权方法可能是疏忽的。

首先,有两个未知数可以证实以下命题:在足够的情况下,较大数量的评级比较少数量的评级更能表明质量。一个例子是排名多久了?是否对使用相同方法排名的不同项目给予相同的收集持续时间(相同的关注)?其他人是,哪些市场可以访问该项目,当然,谁具体对它进行了排名?

其次,您在问题下方的评论中指出,这不是用于前端使用,而是“评级是由机器生成的,用于机器”,作为对我的评论的回应,“这不一定只是统计数据。一个一个人可能认为 50 次评级就足够了,而对于另一个人来说这可能还不够,而且某些评级者的个人资料对一个人来说可能比对另一个人来说更可靠。当这是透明的时,它可以让用户做出更明智的评估。”

为什么对于机器来说会有什么不同? :)

无论如何,如果这是关于机器对机器的排名,那么这个问题需要更详细的信息,以便我们了解不同的机器如何生成和使用排名。

机器生成的排名是否可能存在缺陷(以便表明更多排名可能会以某种方式补偿那些“有缺陷”的排名?这意味着什么 - 这是机器错误吗?还是因为该项目对此没有用处?例如,这里有很多问题我们可能首先想要解开,包括我们是否能够了解机器如何生成排名,在某种程度上我们可能已经知道该项目对于这台机器的含义,使得综合排名是多余的。

Since you've stated that the machine would only be given the rankings and the number of rankings, I would argue that it may be negligent to attempt a calculated weighting method.

First, there are two many unknowns to confirm the proposition that in enough circumstances a larger quantity of ratings are a better indication of quality than a smaller number of ratings. One example is how long have rankings been given? Has there been equal collection duration (equal attention) given to different items ranked with this same method? Others are, which markets have had access to this item and, of course, who specifically ranked it?

Secondly, you've stated in a comment below the question that this is not for front-end use but rather "the ratings are generated by machines, for machines," as a response to my comment that "it's not necessarily only statistical. One person might consider 50 ratings enough, where that might not be enough for another. And some raters' profiles might look more reliable to one person than to another. When that's transparent, it lets the user make a more informed assessment."

Why would that be any different for machines? :)

In any case, if this is about machine-to-machine rankings, the question needs greater detail in order for us to understand how different machines might generate and use the rankings.

Can a ranking generated by a machine be flawed (so as to suggest that more rankings may somehow compensate for those "flawed" rankings? What does that even mean - is it a machine error? Or is it because the item has no use to this particular machine, for example? There are many issues here we might first want to unpack, including if we have access to how the machines are generating the ranking, on some level we may already know the meaning this item may have for this machine, making the aggregated ranking superfluous.

客…行舟 2024-09-01 17:39:56

在不同的平台上你会发现没有足够投票的评分被清空:“该项目没有足够的投票”
问题是你不能用一个简单的公式来计算排名。

我建议隐藏低于最低投票数的排名,但计算实习生的移动平均值。我总是更喜欢移动平均线而不是总平均线,因为它更喜欢上次的投票而不是非常旧的投票,这些投票可能是在完全不同的情况下给出的。
此外,您不需要添加所有投票的列表。您只需计算出平均值,下一次投票只会更改该值。

newAverage = weight * newVoting + (1-weight) * oldAverage

对于最后 20 个值的偏好,权重约为 0.05。 (只需尝试这个重量)

此外,我将从以下条件开始:
没有投票 = 中等范围值(1-5 颗星 => 从 3 颗星开始)
如果投票数少于 10 票,则不会显示平均值。

What you can find on different plattforms is the blanking of ratings without enough votings: "This item does not have enough votings"
The problem is you can't do it in an easy formula to calculate a ranking.

I would suggest a hiding of ranking with less than minimum votings but caclulate intern a moving average. I always prefer moving average against total average as it prefers votings from the last time against very old votings which might be given for totaly different circumstances.
Additionally you do not need to have too add a list of all votings. you just have the calculated average and the next voting just changes this value.

newAverage = weight * newVoting + (1-weight) * oldAverage

with a weight about 0.05 for a preference of the last 20 values. (just experiment with this weight)

Additionally I would start with these conditions:
no votings = medium range value (1-5 stars => start with 3 stars)
the average will not be shown if less than 10 votings were given.

吾家有女初长成 2024-09-01 17:39:56

一个简单的解决方案可能是加权平均值:

sum(votes) / number_of_votes

这样,3 人投票 1 星,1 人投票 5 星,加权平均值为 (1+1+1+5)/4 = 2 星。

简单、有效,并且可能足以满足您的目的。

A simple solution might be a weighted average:

sum(votes) / number_of_votes

That way, 3 people voting 1 star, and one person voting 5 would give a weighted average of (1+1+1+5)/4 = 2 stars.

Simple, effective, and probably sufficient for your purposes.

霞映澄塘 2024-09-01 17:39:55

您可以使用受贝叶斯概率启发的方法。该方法的要点是对项目的真实评级有一个初步的信念,并使用用户的评级来更新您的信念。

这种方法需要两个参数:

  1. 如果您对该项目根本没有评级,您认为该项目的真正“默认”评级是多少?将此数字称为R,即“初始信念”。
  2. 与用户评分相比,您对初始信念的重视程度如何?将此称为 W,其中初始信念是“值得”该值的 W 用户评分。

使用参数 RW,计算新评级非常简单:假设您的 W 评级值为 R< /code> 以及任何用户评分,并计算平均值。例如,如果 R = 2W = 3,我们会计算以下各种场景的最终分数:

  • 100 个(用户)评分为 4:(3* 2 + 100*4) / (3 + 100) = 3.94
  • 3 个评级为 5,1 个评级为 4:(3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
  • 10 个评级为 4:(3*2 + 10*4) / (3 + 10) = 3.54
  • 1 个评级为 5:(3*2 + 1*5) / (3 + 1) = 2.75
  • 无用户评分:(3*2 + 0) / (3 + 0) = 2
  • 1 个评分为 1:(3*2 + 1*1) / (3 + 1) = 1.75

此计算考虑了用户评分的数量以及这些评分的值。因此,最终得分大致相当于在给定数据的情况下人们对某一特定项目的期望程度。

选择R

当您选择R时,请考虑一下您愿意为没有评级的项目假设什么值。如果您立即让每个人对它进行评分,那么典型的无评分项目实际上是 2.4 分(满分 5 分)吗?如果是这样,R = 2.4 将是一个合理的选择。

您不应该为此参数使用评级量表上的最小值,因为用户评级极差的项目最终应该比没有评级的默认项目“更差”。

如果您想使用数据而不仅仅是直觉来选择R,您可以使用以下方法:

  • 考虑所有至少具有一定用户评分阈值的项目(这样您就可以确信平均用户评分是相当准确)。
  • 对于每个项目,假设其“真实分数”是平均用户评分。
  • 选择 R 作为这些分数的中位数。

如果您想对无评级项目稍微乐观或悲观,您可以选择 R 作为分数的不同百分位,例如第 60 个百分位(乐观)或第 40 个百分位(悲观) )。

选择W

W 的选择应取决于典型项目有多少个评级以及评级的一致性。如果项目自然获得许多评级,W 可以更高,如果您对用户评级信心不足(例如,如果您的垃圾邮件发送者活动较高),则 W 应该更高。请注意,W 不必是整数,并且可以小于 1。

选择 W 比选择 R 更主观。不过,这里有一些指导原则:

  • 如果典型项目获得 C 评级,则 W 不应超过 C,否则最终得分将为更多地依赖于R而不是实际的用户评分。相反,W 应该接近 C 的一小部分,也许在 C/20C/5 之间(取决于评级的噪音或“垃圾邮件”程度)。
  • 如果历史评级通常是一致的(对于单个项目),则 W 应相对较小。另一方面,如果某个项目的评分差异很大,则 W 应该相对较大。您可以将此算法视为“吸收”异常高或低的 W 评级,将这些评级转变为更温和的评级。
  • 在极端情况下,设置 W = 0 相当于仅使用用户评分的平均值。设置 W = infinity 相当于声明每个项目的真实评级为 R,无论用户评级如何。显然,这两个极端都不合适。
  • W 设置得太大可能会产生偏爱具有许多中等高评级的项目而不是具有稍少的极高评级的项目的效果。

You can use a method inspired by Bayesian probability. The gist of the approach is to have an initial belief about the true rating of an item, and use users' ratings to update your belief.

This approach requires two parameters:

  1. What do you think is the true "default" rating of an item, if you have no ratings at all for the item? Call this number R, the "initial belief".
  2. How much weight do you give to the initial belief, compared to the user ratings? Call this W, where the initial belief is "worth" W user ratings of that value.

With the parameters R and W, computing the new rating is simple: assume you have W ratings of value R along with any user ratings, and compute the average. For example, if R = 2 and W = 3, we compute the final score for various scenarios below:

  • 100 (user) ratings of 4: (3*2 + 100*4) / (3 + 100) = 3.94
  • 3 ratings of 5 and 1 rating of 4: (3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
  • 10 ratings of 4: (3*2 + 10*4) / (3 + 10) = 3.54
  • 1 rating of 5: (3*2 + 1*5) / (3 + 1) = 2.75
  • No user ratings: (3*2 + 0) / (3 + 0) = 2
  • 1 rating of 1: (3*2 + 1*1) / (3 + 1) = 1.75

This computation takes into consideration the number of user ratings, and the values of those ratings. As a result, the final score roughly corresponds to how happy one can expect to be about a particular item, given the data.

Choosing R

When you choose R, think about what value you would be comfortable assuming for an item with no ratings. Is the typical no-rating item actually 2.4 out of 5, if you were to instantly have everyone rate it? If so, R = 2.4 would be a reasonable choice.

You should not use the minimum value on the rating scale for this parameter, since an item rated extremely poorly by users should end up "worse" than a default item with no ratings.

If you want to pick R using data rather than just intuition, you can use the following method:

  • Consider all items with at least some threshold of user ratings (so you can be confident that the average user rating is reasonably accurate).
  • For each item, assume its "true score" is the average user rating.
  • Choose R to be the median of those scores.

If you want to be slightly more optimistic or pessimistic about a no-rating item, you can choose R to be a different percentile of the scores, for instance the 60th percentile (optimistic) or 40th percentile (pessimistic).

Choosing W

The choice of W should depend on how many ratings a typical item has, and how consistent ratings are. W can be higher if items naturally obtain many ratings, and W should be higher if you have less confidence in user ratings (e.g., if you have high spammer activity). Note that W does not have to be an integer, and can be less than 1.

Choosing W is a more subjective matter than choosing R. However, here are some guidelines:

  • If a typical item obtains C ratings, then W should not exceed C, or else the final score will be more dependent on R than on the actual user ratings. Instead, W should be close to a fraction of C, perhaps between C/20 and C/5 (depending on how noisy or "spammy" ratings are).
  • If historical ratings are usually consistent (for an individual item), then W should be relatively small. On the other hand, if ratings for an item vary wildly, then W should be relatively large. You can think of this algorithm as "absorbing" W ratings that are abnormally high or low, turning those ratings into more moderate ones.
  • In the extreme, setting W = 0 is equivalent to using only the average of user ratings. Setting W = infinity is equivalent to proclaiming that every item has a true rating of R, regardless of the user ratings. Clearly, neither of these extremes are appropriate.
  • Setting W too large can have the effect of favoring an item with many moderately-high ratings over an item with slightly fewer exceptionally-high ratings.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文