根据浏览量/评论计算页面重要性的算法

发布于 2024-09-02 18:19:24 字数 564 浏览 5 评论 0 原文

我需要一种算法来为我的网站的 字段>站点地图基于页面的浏览量和评论数。

对于那些不熟悉站点地图的人来说,优先级字段用于表示页面相对于同一网站上其他页面的重要性。它必须是 0 到 1 之间的十进制数。

该算法将接受两个参数:viewCountcommentCount,并将返回优先级值。例如:

GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.

For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.

The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:

GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

感悟人生的甜 2024-09-09 18:19:24

您提到在 SQL 查询中执行此操作,因此我将在其中提供示例。

如果您有一个表/视图Pages,类似这样

Pages
-----
page_id:int
views:int  - indexed
comments:int - indexed

然后您可以通过编写来排序它们

SELECT * FROM Pages
ORDER BY 
    (0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +       
    (0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))

我故意在视图和注释之间选择了不相等的权重。与视图/评论保持相同的权重可能出现的一个问题是,排名变成了一个自我实现的预言——一个页面返回到列表的顶部,因此它的访问频率更高,从而获得更多的分数,所以它是显示在列表的末尾,并且访问次数更频繁,并且获得更多积分......对评论给予更多重视反映了这些评论需要真正的努力并表现出真正的兴趣。

上面的公式将为您提供基于历史统计数据的排名。因此,上周积累的浏览量/评论数与去年另一篇文章积累的浏览量/评论数相同的文章将获得相同的优先级。重复该公式可能是有意义的,每次指定一个日期范围,并优先考虑活动较高的页面,例如,

  0.3*(score for views/comments today) - live data
  0.3*(score for views/comments in the last week)
  0.25*(score for views/comments in the last month)
  0.15*(score for all views/comments, all time)

这将确保“热门”页面比最近没有看到太多操作的类似评分页面获得更高的优先级。除了今天的分数之外的所有值都可以通过计划的存储过程保存在表中,这样数据库就不必聚合许多评论/视图统计信息。只有今天的统计数据是“实时”计算的。更进一步,排名公式本身可以通过每天运行的存储过程来计算和存储历史数据。

编辑:要获得从 0.1 到 1.0 的严格范围,您可以像这样设计公式。但我强调 - 这只会增加开销并且是不必要的 - 优先级的绝对值并不重要 - 只有它们与其他 url 的相对值。搜索引擎使用这些来回答以下问题:URL A 是否比 URL B 更重要/相关?它通过比较它们的优先级(哪一个是最大的)而不是它们的绝对值来做到这一点。

// 非标准化 - x 是某个页面 id
un(x) = 0.3*log(观看次数(x)+10)/log(10+最大观看次数()) +
0.7*log(评论数(x)+10)/log(10+最大评论数())
// 原始公式(现在为伪代码)

最大值将为 1.0,最小值将从 1.0 开始,并随着更多视图/评论的增加而向下移动。

我们将un(0)定义为最小值,即(上面公式中views(x)和comments(x)都是0)

要得到从0.1到1.0的归一化公式,然后计算n(x),页面x的规范化优先级

                  (1.0-un(x)) * (un(0)-0.1)
  n(x) = un(x) -  -------------------------    when un(0) != 1.0
                          1.0-un(0)

       = 0.1 otherwise.

You mentioned doing this in an SQL query, so I'll give samples in that.

If you have a table/view Pages, something like this

Pages
-----
page_id:int
views:int  - indexed
comments:int - indexed

Then you can order them by writing

SELECT * FROM Pages
ORDER BY 
    (0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +       
    (0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))

I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.

The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.

  0.3*(score for views/comments today) - live data
  0.3*(score for views/comments in the last week)
  0.25*(score for views/comments in the last month)
  0.15*(score for all views/comments, all time)

This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.

EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.

// unnormalized - x is some page id
un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) +
0.7*log(comments(x)+10)/log(10+maxComments())
// the original formula (now in pseudo code)

The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.

we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)

To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x

                  (1.0-un(x)) * (un(0)-0.1)
  n(x) = un(x) -  -------------------------    when un(0) != 1.0
                          1.0-un(0)

       = 0.1 otherwise.
世界等同你 2024-09-09 18:19:24

优先级 = W1 * 浏览次数 / maxViewsOfAllArticles + W2 * 评论 / maxCommentsOfAllArticles
W1+W2=1

虽然恕我直言,只需使用 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)

Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles
with W1+W2=1

Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)

不气馁 2024-09-09 18:19:24

您在这里寻找的不是算法,而是公式。

不幸的是,您还没有真正指定您想要的详细信息,因此我们无法向您提供公式。

相反,让我们尝试一起解决这个问题。

您有两个传入参数:viewCount 和commentCount。您想要返回一个数字,即优先级。到目前为止,一切都很好。

您说优先级应该介于 0 和 1 之间,但这并不重要。如果我们想出一个我们喜欢的公式,但结果是在 0 和 N 之间的值,我们可以将结果除以 N——所以这个约束并不真正相关。

现在,我们需要决定的第一件事是评论与视图的相对权重。

如果 A 页面有 100 条评论和 10 次浏览,B 页面有 10 条评论和 100 次浏览,哪个应该具有更高的优先级?或者,应该具有相同的优先级吗?您需要决定什么最适合您的优先级定义。

例如,如果您认为评论的价值是观点的 5 倍,那么我们可以从类似的公式开始,

 Priority = 5 * Comments + Views

显然,这可以推广到

Priority = A * Comments + B * Views

其中 A 和 B 是相对权重。

但是,有时我们希望权重是指数的而不是线性的,这样

 Priority = Comment ^ A + Views ^ B

会给出与早期公式截然不同的曲线。

同样,

 Priority = Comment ^ A * Views ^ B

如果权重相等,则具有 20 条评论和 20 次浏览的页面将比具有 1 条评论和 40 次浏览的页面具有更高的价值。

因此,总结一下:

您确实应该制作一个包含视图和评论示例值的电子表格,然后尝试各种公式,直到获得具有您希望的分布的公式。

我们无法为您做这件事,因为我们不知道您想要如何评价事物。

What you're looking for here is not an algorithm, but a formula.

Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.

Instead, let's try to walk through the problem together.

You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.

You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.

Now, the first thing we need to decide is the relative weight of Comments vs Views.

If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.

If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like

 Priority = 5 * Comments + Views

Obviously, this can be generalized to

Priority = A * Comments + B * Views

Where A and B are relative weights.

But, sometimes we want our weights to be exponential instead of linear, like

 Priority = Comment ^ A + Views ^ B

which will give a very different curve than the earlier formula.

Similarly,

 Priority = Comment ^ A * Views ^ B

will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.

So, to summarize:

You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.

We can't do it for you, because we don't know how you want to value things.

我做我的改变 2024-09-09 18:19:24

我知道自从提出这个问题以来已经有一段时间了,但我遇到了类似的问题并且有不同的解决方案。

当您想要有一种方法对某些内容进行排名,并且您使用多个因素来执行该排名时,您正在执行称为多标准决策分析的操作。 (MCDA)。请参阅:http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis

有有几种方法可以处理这个问题。就您而言,您的标准有不同的“单位”。一种是以评论为单位,另一种是以浏览量为单位。此外,您可能希望根据您提出的任何业务规则对这些标准给予不同的权重。

在这种情况下,最好的解决方案是所谓的加权产品模型。请参阅:http://en.wikipedia.org/wiki/Weighted_product_model

要点是您采取您的每个标准并将其转换为百分比(如之前建议的那样),然后您将该百分比计算为 X 次方,其中 X 是 0 到 1 之间的数字。该数字代表您的体重。您的总重量加起来应该为 1。

最后,将每个结果相乘得出一个排名。如果排名大于 1,则分子页面的排名高于分母页面。

每个页面都会通过执行以下操作与其他页面进行比较:

  • p1C = 第 1 页评论
  • p1V = 第 1
  • 页视图 p2C = 第 2 页评论
  • p2V = 第 2 页视图
  • wC = 评论权重
  • wV = 视图权重

排名 = (p1C/ p2C)^(wC) * (p1V/p2V)^(wV)

最终结果是根据排名排序的页面列表。

我在 C# 中通过对实现 IComparable 的对象集合执行排序来实现此功能。

I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.

When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis

There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.

In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model

The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.

Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.

Each page would be compared against every other page by doing something like:

  • p1C = page 1 comments
  • p1V = page 1 view
  • p2C = page 2 comments
  • p2V = page 2 views
  • wC = comment weight
  • wV = view weight

rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)

The end result is a sorted list of pages according to their rank.

I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.

云仙小弟 2024-09-09 18:19:24

几位发帖者在没有澄清概念的情况下本质上提倡的是,使用线性回归来确定网页浏览量和评论数的加权函数来确定优先级。

对于您的问题,这种技术很容易实现,并且基本概念在这篇关于 线性 的维基百科文章中得到了很好的描述回归模型

如何将其应用于您的问题的快速总结是:

  1. 确定最适合您网站的所有网页的查看和评论计数数据的线的参数,即使用线性回归。
  2. 使用行参数导出视图/计数参数的优先级函数。

如果您不想从基本数学公式从头开始实现基本线性回归的代码示例,那么应该不难找到它(使用网络,数值食谱等)。此外,任何通用数学软件包(如 Matlab、R 等)都附带线性回归函数。

What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.

This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.

A quick summary of how to apply it to your problem is:

  1. Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
  2. Use the line parameters to derive your priority function for the view/count parameters.

Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.

几度春秋 2024-09-09 18:19:24

最简单的方法如下:

v[i] 为页面 i 的浏览量,c[i] 为页面的评论数页面 i,然后定义页面 i 的相对观看权重,其中

r_v(i) = v[i]/(sum_j v[j])

sum_j v[j]v[.] 的总和在所有页面上。类似地,将页面 i 的相对评论权重定义为

r_c(i) = c[i]/(sum_j c[j]).

现在您需要一些常数参数 p: 0 < p< 1 表示观点相对于评论的重要性:p = 0 表示只有评论才重要,p = 1 表示只有观点才重要,p = 0.5 表示同等权重。

然后将优先级设置为

p*r_v(i) + (1-p)*r_c(i)

这可能过于简单化,但它可能是最好的起点。

The most naive approach would be the following:

Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be

r_v(i) = v[i]/(sum_j v[j])

where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be

r_c(i) = c[i]/(sum_j c[j]).

Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.

Then set the priority to be

p*r_v(i) + (1-p)*r_c(i)

This might be over-simplistic but its probably the best starting point.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文