例如,Reddit 排名的数学算法从何而来?
最近,我正在研究 Reddit 的算法,用于确定什么使帖子成为“热门”主题以及哪些内容适合 Reddit 主页。
我正在读的文章在这里: http://amix.dk/blog/post/19588
我注意到他们有数学对数并创建了某种确定帖子的热门度/相关性的数学函数。
在使用的公式中,每个数学成分来自哪里以及它们如何知道使用它们?
谢谢你!
-- Bakz
编辑:只是为了澄清一下,我刚刚高中毕业,如果这个问题的答案看起来很明显,我深表歉意。再次感谢!
recently I was looking at Reddit's algorithm for determining what makes a post a "hot" topic and which content is suitable for the reddit homepage.
the article I was reading is here:
http://amix.dk/blog/post/19588
I've noticed they have mathematical logorithms and have created some kind of a mathematical function to determine the hotness/relevance of a post.
In the formulas used, where do each of the mathematical components come from and how do they know to use them?
thank you!
-- Bakz
EDIT: just to clarify, I just graduated high school and apologize if the answer to this question seems pretty obvious. thanks again!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我将解决第一个公式,即帖子的“热度”。像这样的公式来自需求。 Reddit 的设计者思考了他们想要实现的目标,并相应地设计了公式。我无法确切地告诉您他们的想法是什么,但我可以查看实施情况并猜测他们想要一个遵循以下原则的系统:
除非票数发生变化,否则不需要重新计算分数。这减少了对数据库的更改次数,并且在复制数据时更容易实现一致性。 (因此,任何基于分数随着文章老化而降低的评分系统都是不好的)。
如果两个故事的年龄相同,则获得更多支持的故事应该更高。 (因此需要有投票的贡献。)
一个故事获得的赞成票越多,它保持在排名靠前的时间就越长。
老故事不应该永远保持在排名的首位,即使它们有很多点赞。很快(一两天后),新故事的排名就会超过它们。 (因此,需要从日期开始做出贡献,并且无论获得多少票,这都必须很快超过由于投票而产生的分数。)
反对票多于赞成票的故事根本不应该出现在排名中.
现在让我们看一下公式:log z + yt / 45000,看看它如何满足这些要求。
如果票数没有变化,则z、y和t都不变。所以分数不变。这满足要求(1)。
如果两个故事的年龄相同,那么它们的 t 值相同。但点赞数越多的 z 值就越高,并且由于 log 是单调的,因此它的得分也越高。这满足要求 (2)。
一个故事获得的支持越多,它的 z 就越高,因此另一个具有更高 t 的故事超越它的时间就越长。这满足要求 (3)。
对数是一个随着它变大而增长得更慢的函数 (看看它的图表)。因此,随着时间的推移,一个故事需要越来越多的赞成票才能跟上新的故事。这满足要求 (4)。
如果故事的反对票数多于赞成票数,则 z = 1 且 y = −1,因此得分为负数。这满足要求 (5)。
常数 45,000 是一个使点赞数和年龄达到平衡的比例因子。一天有 86,400 秒,因此t每天都会增加这个量。 t 除以 45,000 得到 1.92,这意味着一天的相对新鲜度价值为 101.92 = 83 票,两天的相对新鲜度价值约为 7,000 票投票。
I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly. I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines:
Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).
If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes.)
The more upvotes a story gets, the longer it should remain near the top of the ranking.
Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.)
Stories with more downvotes than upvotes should not appear in the rankings at all.
Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements.
If the number of votes does not change, then z, y and t are all unchanged. So the score is unchanged. This satisfies requirement (1).
If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).
The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3).
Logarithm is a function that grows more slowly as it gets larger (take a look at its graph). So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4).
If the story has more downvotes than upvotes, then z = 1 and y = −1 so the score is negative. This satisfies requirement (5).
The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 86,400 seconds in a day, so t gets larger by this amount each day. Dividing t by 45,000 gives 1.92 which means that one day's relative newness is worth is 101.92 = 83 votes, and two days' relative newness are worth roughly 7,000 votes.
他们不来自任何地方。他们没有绝对的真理,也没有什么可以证明的。这只是一种以开发团队认为最合理的方式量化属性的方法。
当您希望某个因素成为一个因素(尽管不太重要)时,您可以使用对数(因为大值确实会增长,尽管非常缓慢)。但出于同样的原因,他们也可以选择立方根。
这些公式只是对那些我们可以假设是那些特征上属于“热”事物的因素的表示,以及它们的组合,以适当的比例考虑每个因素(例如,我们将平方那些非常重要的值,并记录那些不太重要的值)。
一旦他们想出了公式,他们可能会想出 10 或 15 种不同类型的帖子,并将数字代入,发现这很有意义,所以坚持下去。事实上,最初的几次尝试可能效果不太好,经过一番摆弄数字后得出了这个公式。
They don't come from anywhere. There is no absolute truth to them, nor anything to prove. It's simply a way to quantify an attribute in as most sensible a way as seemed to the development team.
You would use log when you want something to be a factor although a less important one (since large values indeed grow, although very slowly). But by the same token, they could have chosen cube root.
The formulae are simply a representation of those factors which we can presume are those which characteristically belong to something "hot", and a composition of them in such a manner that takes each into account in an appropriate proportion (for example, we'll square those values that have huge importance, and take log of those which are less).
Once they came up with the formula, they probably came up with 10 or 15 different types of posts and plugged the numbers in and saw that that made a lot of sense all round, so stuck with it. In fact, there first few attempts probably didn't come out so well, and after a little fiddling with the numbers arrived at that formula.