在事件尚未发生时确定事件发生的可能性

发布于 2024-08-31 07:38:41 字数 651 浏览 8 评论 0原文

用户在时间 t 访问我的网站，他们可能会也可能不会点击我关心的特定链接，如果他们点击了该链接，我会记录他们点击该链接的事实，以及自 < 以来的持续时间如果他们点击了它，则称其为d。

我需要一个允许我创建这样的类的算法：

class ClickProbabilityEstimate {
    public void reportImpression(long id);
    public void reportClick(long id);

    public double estimateClickProbability(long id);
}

每个展示都会获得一个唯一的id，并且在报告点击时使用它来指示该点击属于哪个展示。

我需要一种算法，该算法将根据报告展示后经过的时间返回一个概率，根据之前需要的点击时间，该展示将获得点击。显然，如果仍然没有点击，人们会期望这一概率会随着时间的推移而降低。

如果有必要，我们可以设置一个上限，超过该上限我们认为点击概率为 0（例如，如果展示发生已经过了一个小时，我们可以非常确定不会有点击）。

该算法应该具有空间和时间效率，并希望在优雅的同时做出尽可能少的假设。易于实施也很好。有什么想法吗？

原文

A user visits my website at time t, and they may or may not click on a particular link I care about, if they do I record the fact that they clicked the link, and also the duration since t that they clicked it, call this d.

I need an algorithm that allows me to create a class like this:

class ClickProbabilityEstimate {
    public void reportImpression(long id);
    public void reportClick(long id);

    public double estimateClickProbability(long id);
}

Every impression gets a unique id, and this is used when reporting a click to indicate which impression the click belongs to.

I need an algorithm that will return a probability, based on how much time has past since an impression was reported, that the impression will receive a click, based on how long previous clicks required. Clearly one would expect that this probability will decrease over time if there is still no click.

If necessary, we can set an upper-bound, beyond which we consider the click probability to be 0 (eg. if its been an hour since the impression occurred, we can be pretty sure there won't be a click).

The algorithm should be both space and time efficient, and hopefully make as few assumptions as possible, while being elegant. Ease of implementation would also be nice. Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奶气 2024-09-07 07:38:41

假设您保留有关过去展示次数和点击次数的数据，这很简单：假设您有一次展示，并且自该展示以来已经过了 d' 时间。您可以将数据分为三组：

在 d' 内收到点击的展示次数
在超过 d' 后收到点击的展示
次数从未收到点击的展示

次数当前印象不在组 (1) 中，因此将其消除。您需要计算它位于组 (2) 中的概率，其中

P = N2 / (N2 + N3)

N2 是组 2 中的展示次数，N3 也类似。

就实际实现而言，我的第一个想法是保留过去确实收到点击的展示次数 d 的有序列表，以及从未收到点击的展示次数的计数，然后在该列表中对 d' 进行二分搜索。您找到的位置将为您提供 N1，然后 N2 就是列表的长度减去 N1。

如果您不需要完美的粒度，您可以将过去的时间存储为直方图，即一个列表，其中每个元素 list[n] 中包含在过去的时间之后收到点击的展示次数至少 n 但少于 n+1 分钟。（或秒，或您喜欢的任何时间间隔）在这种情况下，您可能希望将点击总数保留为单独的变量，以便您可以轻松计算 N2。

（顺便说一句，这是我编出来的，我不知道是否有针对这种事情的标准算法可能会更好）

Assuming you keep data on past impressions and clicks, it's easy: let's say that you have an impression, and a time d' has passed since that impression. You can divide your data into three groups:

Impressions which received a click in less than d'
Impressions which received a click after more than d'
Impressions which never received a click

Clearly the current impression is not in group (1), so eliminate that. You want the probability it is in group (2), which is then

P = N2 / (N2 + N3)

where N2 is the number of impressions in group 2, and similarly for N3.

As far as actual implementation, my first thought would be to keep an ordered list of the times d for past impressions which did receive clicks, along with a count of the number of impressions which never received a click, and just do a binary search for d' in that list. The position you find will give you N1, and then N2 is the length of the list minus N1.

If you don't need perfect granularity, you can store the past times as a histogram instead, i.e. a list that contains, in each element list[n], the number of impressions that received a click after at least n but less than n+1 minutes. (Or seconds, or whatever time interval you like) In that case you'd probably want to keep the total number of clicks as a separate variable so you can easily compute N2.

(By the way, I just made this up, I don't know if there are standard algorithms for this sort of thing that may be better)

回复收藏 0 原文