在事件尚未发生时确定事件发生的可能性
用户在时间 t 访问我的网站,他们可能会也可能不会点击我关心的特定链接,如果他们点击了该链接,我会记录他们点击该链接的事实,以及自 < 以来的持续时间如果他们点击了它,则称其为d。
我需要一个允许我创建这样的类的算法:
class ClickProbabilityEstimate {
public void reportImpression(long id);
public void reportClick(long id);
public double estimateClickProbability(long id);
}
每个展示都会获得一个唯一的id,并且在报告点击时使用它来指示该点击属于哪个展示。
我需要一种算法,该算法将根据报告展示后经过的时间返回一个概率,根据之前需要的点击时间,该展示将获得点击。显然,如果仍然没有点击,人们会期望这一概率会随着时间的推移而降低。
如果有必要,我们可以设置一个上限,超过该上限我们认为点击概率为 0(例如,如果展示发生已经过了一个小时,我们可以非常确定不会有点击)。
该算法应该具有空间和时间效率,并希望在优雅的同时做出尽可能少的假设。易于实施也很好。有什么想法吗?
A user visits my website at time t, and they may or may not click on a particular link I care about, if they do I record the fact that they clicked the link, and also the duration since t that they clicked it, call this d.
I need an algorithm that allows me to create a class like this:
class ClickProbabilityEstimate {
public void reportImpression(long id);
public void reportClick(long id);
public double estimateClickProbability(long id);
}
Every impression gets a unique id, and this is used when reporting a click to indicate which impression the click belongs to.
I need an algorithm that will return a probability, based on how much time has past since an impression was reported, that the impression will receive a click, based on how long previous clicks required. Clearly one would expect that this probability will decrease over time if there is still no click.
If necessary, we can set an upper-bound, beyond which we consider the click probability to be 0 (eg. if its been an hour since the impression occurred, we can be pretty sure there won't be a click).
The algorithm should be both space and time efficient, and hopefully make as few assumptions as possible, while being elegant. Ease of implementation would also be nice. Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设您保留有关过去展示次数和点击次数的数据,这很简单:假设您有一次展示,并且自该展示以来已经过了 d' 时间。您可以将数据分为三组:
次数当前印象不在组 (1) 中,因此将其消除。您需要计算它位于组 (2) 中的概率,其中
N2
是组 2 中的展示次数,N3
也类似。就实际实现而言,我的第一个想法是保留过去确实收到点击的展示次数 d 的有序列表,以及从未收到点击的展示次数的计数,然后在该列表中对 d' 进行二分搜索。您找到的位置将为您提供
N1
,然后N2
就是列表的长度减去N1
。如果您不需要完美的粒度,您可以将过去的时间存储为直方图,即一个列表,其中每个元素
list[n]
中包含在过去的时间之后收到点击的展示次数至少n
但少于n+1
分钟。 (或秒,或您喜欢的任何时间间隔)在这种情况下,您可能希望将点击总数保留为单独的变量,以便您可以轻松计算N2
。(顺便说一句,这是我编出来的,我不知道是否有针对这种事情的标准算法可能会更好)
Assuming you keep data on past impressions and clicks, it's easy: let's say that you have an impression, and a time d' has passed since that impression. You can divide your data into three groups:
Clearly the current impression is not in group (1), so eliminate that. You want the probability it is in group (2), which is then
where
N2
is the number of impressions in group 2, and similarly forN3
.As far as actual implementation, my first thought would be to keep an ordered list of the times d for past impressions which did receive clicks, along with a count of the number of impressions which never received a click, and just do a binary search for d' in that list. The position you find will give you
N1
, and thenN2
is the length of the list minusN1
.If you don't need perfect granularity, you can store the past times as a histogram instead, i.e. a list that contains, in each element
list[n]
, the number of impressions that received a click after at leastn
but less thann+1
minutes. (Or seconds, or whatever time interval you like) In that case you'd probably want to keep the total number of clicks as a separate variable so you can easily computeN2
.(By the way, I just made this up, I don't know if there are standard algorithms for this sort of thing that may be better)
我建议假设一个到达过程(每分钟点击次数)并尝试使用现有数据来适应该到达过程的分布。我敢打赌,结果是负二项式,如果均值具有伽玛分布,则当您具有非平稳均值的泊松到达过程时,您会得到这个结果。倒数(每次点击的分钟数)给出了到达间隔过程的分布。不知道是否有一个以此命名的发行版,但您可以创建一个经验发行版。
希望这有帮助。
I would suggest hypothesizing an arrival process (clicks per minute) and trying to fit a distribution to that arrival process using your existing data. I'll bet the result is negative binomial which is what you get when you have a poisson arrival process with a non-stationary mean if the mean has a gamma distribution. The inverse (minutes per click) gives you the distribution of the interarrival process. Don't know if there's a distribution named for that, but you can create an empirical one.
Hope this helps.