有效确定用户点击超链接的概率
所以我在网页上有一堆超链接。 根据过去的观察,我知道用户点击每个超链接的概率。 因此,我可以计算这些概率的平均值和标准差。
我现在添加一个新的超链接到此页面。 经过短暂的测试后,我发现在看到此超链接的 20 个用户中,有 5 个用户点击了它。
考虑到其他超链接的点击概率的已知平均值和标准差(这形成“先验期望”),如何有效地估计用户点击新超链接的概率?
一个天真的解决方案是忽略其他概率,在这种情况下我的估计只是 5/20 或 0.25 - 但这意味着我们正在丢弃相关信息,即我们之前对点击概率的期望。
所以我正在寻找一个看起来像这样的函数:
double estimate(double priorMean,
double priorStandardDeviation,
int clicks, int views);
我会问,因为我比数学符号更熟悉代码,所以任何答案都优先使用代码或伪代码而不是数学。
So I have a bunch of hyperlinks on a web page. From past observation I know the probabilities that a user will click on each of these hyperlinks. I can therefore calculate the mean and standard deviation of these probabilities.
I now add a new hyperlink to this page. After a short amount of testing I find that of the 20 users that see this hyperlink, 5 click on it.
Taking into account the known mean and standard deviation of the click-through probabilities on other hyperlinks (this forms a "prior expectation"), how can I efficiently estimate the probability of a user clicking on the new hyperlink?
A naive solution would be to ignore the other probabilities, in which case my estimate is just 5/20 or 0.25 - however this means we are throwing away relevant information, namely our prior expectation of what the click-through probability is.
So I'm looking for a function that looks something like this:
double estimate(double priorMean,
double priorStandardDeviation,
int clicks, int views);
I'd ask that, since I'm more familiar with code than mathematical notation, that any answers use code or pseudocode in preference to math.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我将其作为一个新答案,因为它根本不同。
这是基于 Chris Bishop,机器学习和模式识别,第 2 章“概率分布”p71++ 和 http:/ /en.wikipedia.org/wiki/Beta_distribution。
首先,我们将 beta 分布拟合到给定的均值和方差,以便构建参数的分布。 然后我们返回分布的模式,这是伯努利变量的预期参数。
但是,我非常肯定先验均值/方差对您不起作用,因为您丢弃了有关您拥有多少样本以及先验效果如何的信息。
相反:给定一组 (webpage, link_clicked) 对,您可以计算特定链接被单击的页面数。 设其为 m。 设该链接未被点击的次数为l。
现在设 a 为新链接的点击次数为 a,网站的访问次数为 b。 那么你的新链接的概率是
看起来很微不足道,但实际上有一个有效的概率基础。 从实现的角度来看,可以全局保留m和l。
I made this a new answer since it's fundamentally different.
This is based on Chris Bishop, Machine Learning and Pattern Recognition, Chapter 2 "Probability Distributions" p71++ and http://en.wikipedia.org/wiki/Beta_distribution.
First we fit a beta distribution to the given mean and variance in order to build a distribution over the parametes. Then we return the mode of the distribution which is the expected parameter for a bernoulli variable.
However, I am quite positive that the prior mean/variance will not work for you since you throw away information about how many samples you have and how good your prior thus is.
Instead: Given a set of (webpage, link_clicked) pairs, you can calculate the number of pages a specific link was clicked on. Let that be m. Let the amount of times that link was not clicked be l.
Now let a be the number of clicks to your new link be a and the number of visits to the site be b. Then your probability of your new link is
Which looks pretty trivial but actually has a valid probabilistic foundation. From the implementation perspective, you can keep m and l globally.
从频率论的角度来看,P/N 实际上是正确的。
您还可以使用贝叶斯方法来合并先验知识,但由于您似乎没有这些知识,所以我想 P/N 是可行的方法。
如果需要,您还可以使用拉普拉斯规则,该规则 iirc 归结为统一先验。 只需给页面上的每个链接从 1 而不是 0 开始即可。(因此,如果您计算点击链接的数量,则为每个链接提供 +1 奖励,并类似于您的 N 中的奖励。)
[更新] 这是贝叶斯方法:
令 p(W) 为某人属于特定组 W 的概率。令 p(L) 为点击特定链接的概率。 那么您正在寻找的概率是 p(L|W)。 根据贝叶斯定理,您可以通过以下方式计算
p(L|W) = p(W|L) * p(L) / p(W)
您可以通过点击 L 的数量来估计 p(L),p(W )通过该组相对于其余用户的大小,p(W|L) = p(W 和 L) / p(L) 通过点击 L 的特定组 W 的人数除以L 被点击的概率。
P/N is actually correct from a frequentist perspective.
You could also use a bayesian approach to incorporate prior knowledge, but since you don't seem to have that knowledge, I guess P/N is the way to go.
If you want, you can also use Laplace's rule which iirc comes down to a uniform prior. Just give each link on the page a start of 1 instead of 0. (So if you count the number a link was clicked, give each a +1 bonus and resemble that in your N.)
[UPDATE] Here is a bayesian approach:
Let p(W) be the probability that a person is in a specific group W. Let p(L) be the probability, that a specific link is clicked. then the probability you are looking for is p(L|W). By Bayes' theorem, you can calculate this by
p(L|W) = p(W|L) * p(L) / p(W)
You can estimate p(L) by the amount L was clicked, p(W) by the size of that group with respect to the rest of the users and p(W|L) = p(W and L) / p(L) by the number of persons of the specific group W that clicked L divided by the probability that L is clicked.
贝叶斯定理证明:
因为,
并且将(2)替换为(1),
因此(贝叶斯定理),
后果,
以及独立性的定义是,
应该注意的是,很容易根据您的喜好操纵概率改变先验和思考问题的方式,看看人择原理的讨论和贝叶斯定理。
Bayes' Theorem Proof:
since,
And substituting (2) with (1),
thus (Bayes' Theorem),
Consequences,
and the definition of independence is,
It should be noted, that it is easy to manipulate the probability to your liking by changing the priors and the way the problem is thought of, take a look at this discussion of the Anthropic Principle and Bayes' Theorem.
您需要知道 X 与 W 的相关性有多强。
如果您想开发一个大型网站,您很可能还需要一个更复杂的数学模型。
如果您运营像 digg 这样的网站,您就有很多先验知识,您必须将这些知识纳入您的计算中。
这导致了多元统计。
You need to know how strongly X is correlated with W.
Most likely you also want to have a more complex mathematical model if you want to develop a big website.
If you run a website like digg you have a lot of prior knowledge that you have to factor into your calcualtion.
That leads to multivariate statistics.