遵循幂律分布标准化/缩放/归一化多个变量以用于线性组合的正确方法

发布于 2024-07-16 08:11:51 字数 714 浏览 12 评论 0原文

我想将社交网络图中节点的一些指标组合成一个值,以便对节点进行排名:

in_ Degree + Betweenness_centrality = informal_power_index

问题是in_ Degree 和 Betweenness_centrality 在不同的尺度上进行测量,例如 0-15 与 0-35000 并遵循幂律分布(至少绝对不是正态分布)

是否有一种好方法来重新调整变量,以便在确定informal_power_index时不会主导对方?

三种明显的方法是:

  • 标准化变量(减去 mean 并除以 stddev)。 这似乎会过度压缩分布,隐藏长尾值和峰值附近值之间的巨大差异。
  • 通过减去 min(variable) 并除以 max(variable),将变量重新缩放到范围 [0,1]。 这似乎更接近解决问题,因为它不会改变分布的形状,但也许它并不能真正解决问题? 特别是手段会有所不同。
  • 通过将每个值除以平均值(变量) 来使均值相等。 这不会解决尺度上的差异,但也许平均值对于比较来说更重要?

还有其他想法吗?

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:

in_degree + betweenness_centrality = informal_power_index

The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)

Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?

Three obvious approaches are:

  • Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
  • Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
  • Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?

Any other ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

方觉久 2024-07-23 08:11:51

您似乎对底层分布有很强的认识。 自然的重新调整是用每个变量的概率来替换它。 或者,如果您的模型不完整,请选择近似实现该目标的转换。 如果做不到这一点,这里有一个相关的方法:如果您有大量单变量数据来构建(每个变量的)直方图,您可以根据它是否处于 0-10% 百分位或10-20% 百分位数 ...90-100% 百分位数。 通过构造,这些变换后的变量在 1,2,...,10 上具有均匀分布,并且您可以根据需要将它们组合起来。

You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.

清风夜微凉 2024-07-23 08:11:51

您可以将每个值转换为百分比,然后将每个值应用到已知的数量。 然后使用新值的总和。

((1 - (in_degee / 15) * 2000) + ((1 - ( Betweenness_centrality / 35000) * 2000) = ?

you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.

((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?

陌若浮生 2024-07-23 08:11:51

非常有趣的问题。 像这样的东西可以工作吗:

让我们假设我们想要将两个变量缩放到 [-1,1] 的范围
以 Betweeness_centrality 的范围为 0-35000 为例,

  1. 按照变量范围的顺序选择一个较大的数字。 例如,让我们选择 25,000
  2. 在原始范围 [0-35000] 中创建 25,000 个 bin,在新范围 [-1,1] 中创建 25,000 个 bin。
  3. 对于每个数字 xi 找出它属于原始 bin 的 bin#。 设其为 Bi
  4. 求 Bi 在 [-1,1] 范围内的范围。
  5. 使用 [-1,1] 中 Bi 范围的最大值/最小值作为 xi 的缩放版本。

这保留了幂律分布,同时还将其缩小到 [-1,1],并且不会出现 (x-mean)/sd 遇到的问题。

Very interesting question. Could something like this work:

Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000

  1. Choose a large number in the order of the range of the variable. As an example lets choose 25,000
  2. create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
  3. For each number x-i find out the bin# it falls in the original bin. Let this be B-i
  4. Find the range of B-i in the range [-1,1].
  5. Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.

This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.

落日海湾 2024-07-23 08:11:51

归一化为 [0,1] 将是我组合这两个值的简短答案建议,因为它将保持您提到的分布形状,并且应该解决组合值的问题。

如果两个变量的分布不同,这听起来可能不会真正给你我认为你之后的东西,这是每个变量在其给定分布内的组合度量。 您必须提出一个度量来确定给定分布中值所在的位置,这可以通过多种方式完成,其中之一是确定给定值与平均值的标准差有多少,然后您可以以某种方式组合这两个值以获得索引。 (添加可能不再足够)

您必须弄清楚什么对您所查看的数据集最有意义。 标准差对于您的应用程序来说可能毫无意义,但您需要查看与分布相关的统计度量并将它们组合起来,而不是组合绝对值(无论是否标准化)。

normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.

if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)

you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文