我需要对 2 个以上因素进行加权排序,按“相关性”排序。然而,这些因素并不是完全孤立的,因为我希望一个或多个因素影响其他因素的“紧迫性”(权重)。
示例:贡献的内容(文章)可以进行向上/向下投票,从而获得评级;它们有发布日期,并且还标有类别。用户撰写文章并可以投票,并且自己可能有也可能没有某种排名(专家等)。可能类似于 StackOverflow,对吧?
我想为每个用户提供一个按标签分组但按“相关性”排序的文章列表,其中相关性是根据文章的评分和年龄计算的,并且可能受到文章排名的影响作者。 IE 几年前写的一篇排名较高的文章可能不一定与昨天写的一篇排名中等的文章相关。也许如果一篇文章是由专家撰写的,那么它会被视为比“Joe Schmoe”撰写的文章更相关。
另一个很好的例子是 为酒店分配由价格、评分和景点组成的“元评分”。
我的问题是,多因素排序的最佳算法是什么?这可能是 这个问题,但我对任意数量因素的通用算法感兴趣(更合理的期望是 2 - 4 个因子),最好是一个“全自动”函数,我不必调整或需要用户输入,并且我无法解析线性代数和特征向量古怪。
到目前为止我发现的可能性:
注意:S
是“排序得分”
- “线性加权” - 使用如下函数:
S = (w1 * F1) + (w2 * F2) + (w<子>3 * F3)
,其中 wx
是任意分配的权重,Fx
是因子的值。您还需要标准化 F
(即 Fx_n = Fx / Fmax< /代码>)。我认为这有点Lucene 搜索有效。
- “Base-N 加权” - 更像是分组而不是加权,它只是一种线性加权,其中权重以 10 为基数递增(与 CSS 选择器特异性),所以更重要的因素明显更高:
S = 1000 * F1 + 100 * F2 + 10 * F3 .. .
。
- 估计真实价值 (ETV) - 这显然就是Google Analytics 在其报告中介绍了,其中一个因素的价值会影响(权重)另一个因素 - 其结果是对更“具有统计显着性”的因素进行排序价值观。该链接很好地解释了它,所以这里只是等式:
S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg)
,其中 F 1
是“更重要”的因素(文章中的“跳出率”),F2
是“显着性修改”因素(文章中的“访问次数”)。
- 贝叶斯估计 - 看起来与 ETV 非常相似,这就是 IMDb 计算其评级的方式。请参阅 此 StackOverflow 帖子进行解释;等式:
S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg
,其中Fx
相同如#3,F2_lim
是“显着性”因子的最小阈值限制(即不应考虑小于 X 的任何值)。
选项 #3 或 #4 看起来非常有前途,因为您实际上不必像 #1 和 #2 中那样选择任意加权方案,但问题是如何针对两个以上因素执行此操作?
我还遇到了 双因素的 SQL 实现加权算法,这基本上是我最终需要编写的。
I need to provide a weighted sort on 2+ factors, ordered by "relevancy". However, the factors aren't completely isolated, in that I want one or more of the factors to affect the "urgency" (weight) of the others.
Example: contributed content (articles) can be up-/down-voted, and thus have a rating; they have a post date, and they're also tagged with categories. Users write the articles and can vote, and may or may not have some kind of ranking themselves (expert, etc). Probably similar to StackOverflow, right?
I want to provide each user with a list of articles grouped by tag but sorted by "relevancy", where relevancy is calculated based on the rating and age of the article, and possibly affected by the ranking of the author. I.E. a highly ranked article that was written several years ago may not necessarily be as relevant as a medium ranked article written yesterday. And maybe if an article was written by an expert it would be treated as more relevant than one written by "Joe Schmoe".
Another good example would be assigning hotels a "meta score" comprised of price, rating, and attractions.
My question is, what is the best algorithm for multiple factor sorting? This may be a duplicate of that question, but I'm interested in a generic algorithm for any number of factors (a more reasonable expectation is 2 - 4 factors), preferably a "fully-automatic" function that I don't have to tweak or require user input, and I can't parse linear algebra and eigenvector wackiness.
Possibilities I've found so far:
Note: S
is the "sorting score"
- "Linearly weighted" - use a function like:
S = (w1 * F1) + (w2 * F2) + (w3 * F3)
, where wx
are arbitrarily assigned weights, and Fx
are the values of the factors. You'd also want to normalize F
(i.e. Fx_n = Fx / Fmax
). I think this is kinda how Lucene search works.
- "Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher:
S = 1000 * F1 + 100 * F2 + 10 * F3 ...
.
- Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation:
S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg)
, where F1
is the "more important" factor ("bounce rate" in the article), and F2
is the "significance modifying" factor ("visits" in the article).
- Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation:
S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg
, where Fx
are the same as #3, and F2_lim
is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 or #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but the problem is how do you do this for more than two factors?
I also came across the SQL implementation for a two-factor weighting algorithm, which is basically what I'll need to write eventually.
发布评论
评论(3)
正如评论中提到的,我会向任何有类似问题的人建议所谓的“折衷解决方案”,他们更关心的是不必设置权重,而不是使一个标准比其他标准更重要。
基本上,您将每个标准视为一个坐标(当然是在标准化之后)。根据您的判断,您选择绝对最佳点,例如,在本例中,排名最高的作者、最新文章等。一旦您选择最佳解决方案,每个其他“解决方案”都会根据其与该最佳方案的距离进行评级。示例公式是每篇文章得分的欧几里得距离的倒数:S = 1/(sqrt((rank -rank_ideal)^2 + (age -age_ideal)^2 + ... + (xn - xn_ideal)^2 ))。
这将所有标准视为平等,因此请记住这一点。
As mentioned in the comments, I would suggest what's called the 'compromise solution' to anyone with a similar problem who is more concerned with not having to set weights than with making one criterion more heavily weighted than the others.
Basically, you consider each of your criterion as a coordinate (after normalization, of course). Based on your judgement, you choose the absolute optimal point, e.g. in this case, the highest rank author, the newest article, etc. Once you choose the optimal solution, each other 'solution' is rated based on its distance from that optimal. A sample formula would be the inverse of the Euclidean distance for each article's score: S = 1/(sqrt((rank - rank_ideal)^2 + (age - age_ideal)^2 + ... + (xn - xn_ideal)^2)).
This treats all criteria as equal, so keep that in mind.
@gankoji 很快指出的解决方案是 TOPSIS 方法的简化。
在TOPSIS中,折衷解可以看作是选择与理想解欧式距离最短、与负理想解欧式距离最远的解。
此类问题属于术语 MCDM(多标准决策)。
Python 包 scikit-criteria 和 mcdm 提供最流行方法的实现。该包文档链接到相应的算法论文。
The solution, pointed shortly by @gankoji is a simplification of the TOPSIS method.
In TOPSIS the compromise solution can be regarded as choosing the solution with the shortest Euclidean distance from the ideal solution and the farthest Euclidean distance from the negative ideal solution.
This class of problems falls under the term MCDM - Multiple Criteria Decision Making.
Python packages scikit-criteria and mcdm provide implementations of most popular methods. The package docs link to the respective algorithm papers.
考虑权重的链接。例如,您有 3 个因子:X、Y 和 Z。
您可以将每条记录的 ETVyz 计算为
W = (Z/Zmax * Y) + (1 - Z/Zmax) * Yavg
,然后计算 ETVxw强>为S = (W/Wmax * X) + (1 - W/Wmax) * Xavg
。您可以类似地链接更多因素。
Consider chaining of the weights. E.g. you have 3 factors: X, Y and Z.
You can calculate ETVyz as
W = (Z/Zmax * Y) + (1 - Z/Zmax) * Yavg
for each record and then calculate ETVxw asS = (W/Wmax * X) + (1 - W/Wmax) * Xavg
.You can chain more factors similary.