当前位置：文江博客话题详情

algorithm tags information-retrieval

计算趋势主题或标签的最佳方法是什么？

发布于 2024-07-18 03:49:53 字数 752 浏览 7 评论 0 原文

许多网站提供一些统计数据，例如“过去 24 小时最热门的主题”。例如，Topix.com 在其“新闻趋势”部分中显示了这一点。在那里，您可以看到提及次数增长最快的主题。

我也想计算一个主题的这样的“嗡嗡声”。我怎么能这样做呢？该算法应该对始终热门的主题赋予较少的权重。通常（几乎）没有人提及的话题应该是最热门的话题。

Google 提供“热门趋势”，topix.com 显示“热门主题”，fav.or.it 显示“关键字趋势”——所有这些服务都有一个共同点：它们只向您显示当前异常热门的即将出现的趋势。

“布兰妮·斯皮尔斯”、“天气”或“帕丽斯·希尔顿”等术语不会出现在这些列表中，因为它们总是热门且频繁出现。本文将此称为“布兰妮·斯皮尔斯问题”。

我的问题：如何解决你能编写一种算法或使用现有的算法来解决这个问题吗？有了过去 24 小时内搜索的关键字列表，算法应该向您显示 10 个（例如）最热门的关键字。

我知道，在上面的文章中，提到了某种算法。我尝试用 PHP 编写它，但我不认为它会工作的。它只是找到了大多数，不是吗？

我希望你能帮助我（编码示例会很棒）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

对你的占有欲 2024-07-25 03:49:53

这个问题需要 z 分数或标准分数，正如其他人提到的那样，它会考虑历史平均值，而且还会考虑该历史数据的标准差，使其比仅使用平均值更加稳健。

在您的情况下，z 分数是通过以下公式计算的，其中趋势是一个比率，例如观看次数/天。

z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]

使用 z 分数时，z 分数越高或越低，趋势越异常，因此，例如，如果 z 分数高度正，则趋势异常上升，而如果高度负，则趋势异常下降。因此，一旦计算了所有候选趋势的 z 分数，最高的 10 个 z 分数将与最异常增加的 z 分数相关。

有关 z 分数的更多信息，请参阅维基百科。

代码

from math import sqrt

def zscore(obs, pop):
    # Size of population.
    number = float(len(pop))
    # Average population value.
    avg = sum(pop) / number
    # Standard deviation of population.
    std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
    # Zscore Calculation.
    return (obs - avg) / std

示例输出

>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506

注释

如果您不想采取这种方法，可以使用带有滑动窗口（即最近 30 天）的方法考虑大量历史记录，这将使短期趋势更加明显，并可以减少处理时间。
您还可以使用 z 分数作为值，例如从一天到第二天的观看次数变化，以找到每天观看次数增加/减少的异常值。这就像使用每日观看次数图表的斜率或导数。

如果您跟踪当前的人口规模、当前的人口总数以及当前的人口总数 x^2，则无需重新计算这些值，只需更新它们即可只需要保留这些值作为历史记录，而不是每个数据值。以下代码演示了这一点。

 from math import sqrt 

    班级 z 分数： 
        def __init__(self, pop = []): 
            self.number = float(len(pop)) 
            self.总计 = sum(pop) 
            self.sqrTotal = sum(x ** 2 for x in pop) 
        def 更新（自身，值）： 
            自身编号 += 1.0 
            self.total += 值 
            self.sqrTotal += 值 ** 2 
        def 平均值（自身）： 
            返回 self.total / self.number 
        def std(自身): 
            返回 sqrt((self.sqrTotal / self.number) - self.avg() ** 2) 
        def 分数（自身，obs）： 
            返回 (obs - self.avg()) / self.std()

使用此方法，您的工作流程如下。为每个主题、标签或页面创建一个浮点字段，用于表示数据库中的总天数、观看次数总和以及观看次数平方和。如果您有历史数据，请使用该数据初始化这些字段，否则初始化为零。每天结束时，根据三个数据库字段中存储的历史数据，使用当天的观看次数来计算 z 分数。具有最高 X z 分数的主题、标签或页面是当天的 X 个“最热门趋势”。最后用当天的值更新 3 个字段中的每一个，并在第二天重复该过程。

新添加

如上所述，正常 z 分数不考虑数据的顺序，因此“1”或“9”观测值的 z 分数与序列 [1, 1, 1, 1, 9, 9, 9, 9]。显然，对于趋势发现，最新数据应该比旧数据具有更大的权重，因此我们希望“1”观测值比“9”观测值具有更大的幅度分数。为了实现这一目标，我提出了浮动平均 z 分数。应该清楚的是，这种方法不能保证在统计上是合理的，但对于趋势发现或类似方法应该有用。标准 z 分数和浮动平均 z 分数之间的主要区别在于使用浮动平均值来计算平均总体值和平均总体值的平方。有关详细信息，请参阅代码：

代码

class fazscore:
    def __init__(self, decay, pop = []):
        self.sqrAvg = self.avg = 0
        # The rate at which the historic data's effect will diminish.
        self.decay = decay
        for x in pop: self.update(x)
    def update(self, value):
        # Set initial averages to the first value in the sequence.
        if self.avg == 0 and self.sqrAvg == 0:
            self.avg = float(value)
            self.sqrAvg = float((value ** 2))
        # Calculate the average of the rest of the values using a 
        # floating average.
        else:
            self.avg = self.avg * self.decay + value * (1 - self.decay)
            self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
        return self
    def std(self):
        # Somewhat ad-hoc standard deviation calculation.
        return sqrt(self.sqrAvg - self.avg ** 2)
    def score(self, obs):
        if self.std() == 0: return (obs - self.avg) * float("infinity")
        else: return (obs - self.avg) / self.std()

示例 IO

>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf

更新

正如 David Kemp 正确指出的那样，如果给定一系列常量值，然后给定一个 zscore要求观察值与其他值不同，结果可能应为非零。事实上返回的值应该是无穷大。所以我将这一行更改

if self.std() == 0: return 0

为：

if self.std() == 0: return (obs - self.avg) * float("infinity")

此更改反映在 fazscore 解决方案代码中。如果不想处理无限值，可接受的解决方案可能是将行更改为：

if self.std() == 0: return obs - self.avg

This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.

In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.

z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]

When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.

Please see Wikipedia for more information, about z-scores.

Code

from math import sqrt

def zscore(obs, pop):
    # Size of population.
    number = float(len(pop))
    # Average population value.
    avg = sum(pop) / number
    # Standard deviation of population.
    std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
    # Zscore Calculation.
    return (obs - avg) / std

Sample Output

>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506

Notes

You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.

If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.

  from math import sqrt

  class zscore:
      def __init__(self, pop = []):
          self.number = float(len(pop))
          self.total = sum(pop)
          self.sqrTotal = sum(x ** 2 for x in pop)
      def update(self, value):
          self.number += 1.0
          self.total += value
          self.sqrTotal += value ** 2
      def avg(self):
          return self.total / self.number
      def std(self):
          return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
      def score(self, obs):
          return (obs - self.avg()) / self.std()

Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.

New Addition

Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:

Code

class fazscore:
    def __init__(self, decay, pop = []):
        self.sqrAvg = self.avg = 0
        # The rate at which the historic data's effect will diminish.
        self.decay = decay
        for x in pop: self.update(x)
    def update(self, value):
        # Set initial averages to the first value in the sequence.
        if self.avg == 0 and self.sqrAvg == 0:
            self.avg = float(value)
            self.sqrAvg = float((value ** 2))
        # Calculate the average of the rest of the values using a 
        # floating average.
        else:
            self.avg = self.avg * self.decay + value * (1 - self.decay)
            self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
        return self
    def std(self):
        # Somewhat ad-hoc standard deviation calculation.
        return sqrt(self.sqrAvg - self.avg ** 2)
    def score(self, obs):
        if self.std() == 0: return (obs - self.avg) * float("infinity")
        else: return (obs - self.avg) / self.std()

Sample IO

>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf

Update

As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,

if self.std() == 0: return 0

to:

if self.std() == 0: return (obs - self.avg) * float("infinity")

This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:

if self.std() == 0: return obs - self.avg

回复收藏 0 原文

迷荒 2024-07-25 03:49:53

您需要一种算法来测量某个主题的速度 - 或者换句话说，如果您将其绘制成图表，您希望显示那些以令人难以置信的速度增长的主题。

这是趋势线的一阶导数，将其作为整体计算的加权因子并不困难。

标准化

您需要执行的一项技术是标准化所有数据。对于您关注的每个主题，保留一个定义该主题基线的低通滤波器。现在，关于该主题的每个数据点都应该标准化 - 减去它的基线，您将得到所有接近 0 的主题，在线上方和下方有峰值。相反，您可能想要将信号除以其基线幅度，这将使信号达到 1.0 左右 - 这不仅使所有信号彼此一致（标准化基线），而且还标准化尖峰。布兰妮的峰值将比其他人的峰值大很多，但这并不意味着您应该注意它 - 相对于她的基线，峰值可能非常小。

推导

一旦你标准化了所有内容，就可以计算出每个主题的斜率。取两个连续的点，并测量差异。正差值呈上升趋势，负差值呈下降趋势。然后，您可以比较归一化的差异，并找出与其他主题相比哪些主题的受欢迎程度正在上升 - 每个主题都根据其自身的“正常”进行缩放，这可能与其他主题的数量级不同。

这确实是解决问题的第一步。您需要使用更高级的技术（主要是上述技术与其他算法的组合，加权以满足您的需求），但它应该足以让您入门。

关于这篇文章

这篇文章是关于主题趋势的，但它不是关于如何计算什么热门、什么不热门，而是关于如何处理这样的算法在 Lycos 这样的地方必须处理的大量信息和谷歌。为每个主题提供一个计数器并在搜索时找到每个主题的计数器所需的空间和时间是巨大的。本文介绍的是人们在尝试执行此类任务时所面临的挑战。它确实提到了布兰妮效应，但没有谈论如何克服它。

正如 Nixuz 指出的，这也称为 Z或标准分数。

You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.

This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.

Normalize

One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.

Derive

Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.

This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.

Regarding the article

The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.

As Nixuz points out this is also referred to as a Z or Standard Score.

回复收藏 0 原文

爱已欠费 2024-07-25 03:49:53

查德·伯奇和亚当·戴维斯的观点是正确的，你必须回顾过去才能建立基线。正如您所提出的问题，您只想查看过去 24 小时的数据，但这不太可行。

为数据提供一些内存而无需查询大量历史数据的一种方法是使用指数移动平均线。这样做的优点是您可以每个周期更新一次，然后刷新所有旧数据，因此您只需要记住一个值。因此，如果您的经期是一天，则必须为每个主题维护一个“每日平均值”属性，您可以通过以下方式执行此操作：

a_n = a_(n-1)*b + c_n*(1-b)

其中 a_n 是截至 nn 天的移动平均值code>，b 是 0 到 1 之间的某个常数（越接近 1，记忆时间越长），c_n 是 n 天的点击次数。美妙之处在于，如果您在 n 天结束时执行此更新，则可以刷新 c_n 和 a_(n-1)。

需要注意的是，它最初会对您选择的 a 初始值敏感。

编辑

如果有助于可视化此方法，请采用 n = 5、a_0 = 1 和 b = .9。

假设新值为 5,0,0,1,4：

a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854

看起来不太像平均值，不是吗？请注意，即使我们的下一个输入是 5，该值仍然接近 1。这是怎么回事？如果你展开数学计算，你会得到：

a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0

剩余重量是什么意思？好吧，在任何平均值中，所有权重都必须加到 1。如果 n 是无穷大并且……可以永远持续下去，那么所有权重之和将等于 1。但是如果 n 相对较小，则剩下大量的权重在原始输入上。

如果您研究了上面的公式，您应该意识到有关此用法的一些事情：

所有数据都会永远对平均值做出贡献。实际上，有一个点，贡献真的非常小。
最近的值比旧的值贡献更大。
b 越高，新值越不重要，而旧值的重要性越长。然而，b 越高，淡化 a 初始值所需的数据就越多。

我认为前两个特征正是您所寻找的。为了让您对实现有一个简单的想法，这里是一个 python 实现（减去所有数据库交互）：

>>> class EMA(object):
...  def __init__(self, base, decay):
...   self.val = base
...   self.decay = decay
...   print self.val
...  def update(self, value):
...   self.val = self.val*self.decay + (1-self.decay)*value
...   print self.val
... 
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519

Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.

One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:

a_n = a_(n-1)*b + c_n*(1-b)

Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).

The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.

EDIT

If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.

Let's say the new values are 5,0,0,1,4:

a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854

Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:

a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0

What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.

If you study the above formula, you should realize a few things about this usage:

All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.

I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):

>>> class EMA(object):
...  def __init__(self, base, decay):
...   self.val = base
...   self.decay = decay
...   print self.val
...  def update(self, value):
...   self.val = self.val*self.decay + (1-self.decay)*value
...   print self.val
... 
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519

回复收藏 0 原文

口干舌燥 2024-07-25 03:49:53

通常，“嗡嗡声”是使用某种形式的指数/对数衰减机制计算出来的。有关 Hacker News、Reddit 和其他网站如何以简单方式处理此问题的概述，请参阅这篇文章。

这并不能完全解决那些总是流行的事情。您正在寻找的内容似乎类似于 Google 的“热门趋势”功能。为此，您可以将当前值除以历史值，然后减去低于某个噪声阈值的值。

回复收藏 0 原文

别把无礼当个性 2024-07-25 03:49:53

我认为你需要注意的关键词是“异常”。为了确定何时出现“异常”，您必须知道什么是正常的。也就是说，您将需要历史数据，您可以对其进行平均以找出特定查询的正常速率。您可能希望从平均计算中排除异常日期，但这同样需要拥有足够的数据，以便您知道要排除哪些日期。

从那里，您必须设置一个阈值（我确信这需要实验），如果某些内容超出阈值，例如搜索量比正常情况多 50%，您可以将其视为“趋势”。或者，如果您希望能够像您提到的那样找到“Top X Trendiest”，您只需根据商品与正常价格的差距（百分比）进行排序即可。

例如，假设您的历史数据告诉您，布兰妮·斯皮尔斯通常会获得 100,000 次搜索，而帕丽斯·希尔顿通常会获得 50,000 次搜索。如果有一天他们的搜索量都比平时多 10,000 次，那么您应该认为巴黎比布兰妮“更热门”，因为她的搜索量比正常情况增加了 20%，而布兰妮的搜索量只增加了 10%。

天哪，我不敢相信我刚刚写了一段比较布兰妮·斯皮尔斯和帕丽斯·希尔顿的“性感”的段落。你对我做了什么？

回复收藏 0 原文

墟烟 2024-07-25 03:49:53

我想知道在这种情况下是否可以使用常规的物理加速公式？

v2-v1/t or dv/dt

我们可以将 v1 视为每小时的初始点赞数/投票数/评论数，将 v2 视为过去 24 小时内每小时的当前“速度”？

这更像是一个问题而不是答案，但似乎它可能会起作用。任何具有最高加速度的内容都将成为热门话题...

我确信这可能无法解决布兰妮·斯皮尔斯的问题:-)

I was wondering if it is at all possible to use regular physics acceleration formula in such a case?

v2-v1/t or dv/dt

We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?

This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...

I am sure this may not solve Britney Spears problem :-)

回复收藏 0 原文

红衣飘飘貌似仙 2024-07-25 03:49:53

也许一个简单的主题频率梯度会起作用——大的正梯度=流行度快速增长。

最简单的方法是对每天的搜索次数进行分类，这样你就可以得到类似的结果

searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]

，然后找出每天的变化量：

hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]

然后应用某种阈值，以便增加>的日子。 50 被认为是“热门”。如果你愿意的话，你也可以让这个变得更加复杂。您可以采用相对差异，而不是绝对差异，这样从 100 到 150 就被认为是热的，但 1000 到 1050 则不是。或者更复杂的梯度，不仅仅考虑一天到下一天的趋势。

probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.

the easiest way would be to bin the number of searched each day, so you have something like

searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]

and then find out how much it changed from day to day:

hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]

and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.

回复收藏 0 原文

治碍 2024-07-25 03:49:53

我曾参与过一个项目，我的目标是从 Twitter 直播流中查找热门话题，并对热门话题进行情感分析（查找热门话题是否受到积极/消极的谈论）。我使用 Storm 来处理 Twitter 流。

我已将我的报告发布为博客： http:// /sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html

我使用总计数和 Z 分数进行排名。

我使用的方法有点通用，在讨论部分，我提到了如何为非 Twitter 应用程序扩展系统。

希望信息有所帮助。

回复收藏 0 原文

长安忆 2024-07-25 03:49:53

如果您只是查看推文或状态消息来获取主题，您将会遇到很多噪音。即使您删除所有停用词。获得更好的候选主题子集的一种方法是仅关注共享 URL 的推文/消息，并从这些网页的标题中获取关键字。并确保应用词性标记来获取名词+名词短语。

网页的标题通常更具描述性，并且包含描述页面内容的单词。此外，共享网页通常与共享重大新闻相关（即，如果像迈克尔·杰克逊这样的名人去世了，就会有很多人分享有关他去世的文章）。

我进行了实验，只从标题中提取流行的关键字，然后获取所有状态消息中这些关键字的总数，它们肯定消除了很多噪音。如果你这样做，你不需要复杂的算法，只需对关键词频率进行简单的排序，你就成功了一半。

回复收藏 0 原文

姜生凉生 2024-07-25 03:49:53

您可以使用对数似然比将当前日期与上个月或上一年进行比较。这是统计上合理的（假设您的事件不是正态分布的，这是从您的问题中假设的）。

只需按 logLR 对所有术语进行排序，然后选择前十个即可。

public static void main(String... args) {
    TermBag today = ...
    TermBag lastYear = ...
    for (String each: today.allTerms()) {
        System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
    }
} 

public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
    double k1 = t1.occurrences(term); 
    double k2 = t2.occurrences(term); 
    double n1 = t1.size(); 
    double n2 = t2.size(); 
    double p1 = k1 / n1;
    double p2 = k2 / n2;
    double p = (k1 + k2) / (n1 + n2);
    double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
    if (p1 < p2) logLR *= -1;
    return logLR;
}

private static double logL(double p, double k, double n) {
    return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}

PS，TermBag 是无序的单词集合。对于每个文档，您创建一袋术语。只需计算单词出现的次数即可。然后，occurrences 方法返回给定单词出现的次数，size 方法返回单词的总数。最好以某种方式规范化单词，通常 toLowerCase 就足够了。当然，在上面的示例中，您将创建一个包含今天所有查询的文档，以及一个包含去年所有查询的文档。

You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).

Just sort all your terms by logLR and pick the top ten.

public static void main(String... args) {
    TermBag today = ...
    TermBag lastYear = ...
    for (String each: today.allTerms()) {
        System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
    }
} 

public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
    double k1 = t1.occurrences(term); 
    double k2 = t2.occurrences(term); 
    double n1 = t1.size(); 
    double n2 = t2.size(); 
    double p1 = k1 / n1;
    double p2 = k2 / n2;
    double p = (k1 + k2) / (n1 + n2);
    double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
    if (p1 < p2) logLR *= -1;
    return logLR;
}

private static double logL(double p, double k, double n) {
    return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}

PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.