社交网络查询的幂律曲线拟合
Twitter 最近宣布,您可以通过以下方式高精度地估算任何给定 Twitter 用户的排名:在以下公式中输入他们的关注者数量:
exp($a + $b * log(follower_count))
where $a=21 and $b=-1.1
这显然比按给定用户的关注者计数对整个用户列表进行排序要高效得多。
如果您有来自不同社交网站的类似数据集,您如何导出 $a 和 $b 的值以适合该数据集?基本上是一些频率列表,其分布被假定为幂律。
Twitter recently announced that you can approximate the rank of any given twitter user with high accuracy by inputting their follower count in the following formula:
exp($a + $b * log(follower_count))
where $a=21 and $b=-1.1
This is obviously a lot more efficient than sorting the entire list of users by follower count for a given user.
If you have a similar data set from a different social site, how could you derive the values for $a and $b to fit that data set? Basically some list of frequencies the distribution of which is assumed to be power law.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您有以下模型:
相当于:
因此,如果您记录数据集,您最终会得到一个线性模型,因此您可以使用 线性回归以确定
a
和b
的最佳拟合值。然而,这一切对我来说听起来毫无意义。谁敢说某个特定的网站可以使用这种关系来确定用户排名?
You have the following model:
which is equivalent to:
Therefore, if you take logs of your data set, you end up with a linear model, so you can then use linear regression to determine the best-fit values of
a
andb
.However, this all sounds pretty meaningless to me. Who's to say that a given networking site determines user rank using this sort of relationship?
您可以使用名为“Solver”的 Microsoft Excel 加载项。它包含在 Excel 中,但默认情况下并不总是安装。在您的 Excel 版本中查找“加载项”和“求解器”并加载它。
安装加载项后,请执行以下操作:
创建新工作表。在 A 列中,您将输入每个人的 ID(可选)
B 列,关注者数量。
如果数据未排序,则使用 B 列排序。
在 C 列上放置排名(你知道,1、2、3等)
在单元格 D1 中输入值 21,在单元格 E1 中输入 -1.1。这些是 $A 和 $B 的 Twitter 值。这些是我们的基本价值观。它们可能会改变。
在单元格 D2 中输入如下公式:=exp($E$1+$F$1*log(B2))
复制数据末尾 D2 处的公式。
在单元格 E2 中输入一个公式,将实际排名与公式结果(即方差)进行比较。例如,=sqrt(c2*c2+d2*d2)。实际值和预测值越接近,该值就会趋于0。
将单元格 E2 复制到数据末尾。
在数据底部的 E 列中,对方差求和。例如,假设您的数据有 10,000 个值。在单元格 E10001 中输入 =sum(e2:e10000)。
转到菜单“数据”,然后查找“求解器”菜单位置。该位置可能很大程度上取决于您的 Excel 版本。使用“帮助”工具搜索“Goal Seek”。
按照“帮助”中的说明(我现在必须走了)来使用求解器插件。显然,变化的单元格是D1和E1,目标是使E10001(方差之和)尽可能接近于零。
You could use the Microsoft Excel add-in named "Solver". It is included with Excel, but not always installed by default. Look for "add-in" and "solver" at your Excel version and load it.
After installing the add-in, do the following:
Create a new worksheet. In column A you would put the id of each individual (optional)
Column B, the number of followers.
If the data is not sorted, sort it using column B.
On column C put ranking (you know, 1, 2, 3, etc.)
Put value 21 at cell D1, and -1.1 at cell E1. Those are the Twitter values for $A and $B. Those are our base values. They will possibly change.
At cell D2 put a formula like this: =exp($E$1+$F$1*log(B2))
Copy down the formula at D2 at the end of the data.
At cell E2 put a formula to compare the actual ranking with the result of the formula (i.e., variance). e.g., =sqrt(c2*c2+d2*d2). The closer are the actual and the predicted values, the value will tend to 0.
Copy down cell E2 to the end of the data.
At the bottom of data, at column E, sum the variances. e.g., Let's say your data has 10,000 values. At cell E10001 enter =sum(e2:e10000).
Go to the menu Data, and look for the "Solver" menu location. The location may very depending on your version of Excel. Use the "Help" facility to search for Goal Seek.
Follow the instructions (I have to go now) in Help to use the Solver add-in. Obviously, the changing cells are D1 and E1, and the goal is to make E10001 (the sum of the variances) as close to zero as possible.