T-SQL 中的幂律分布

发布于 2024-10-04 02:23:39 字数 634 浏览 12 评论 0原文

我想从人口普查中一次提取一个姓氏提供了姓名表。我想要得到与人口中的分布大致相同的分布。该表有 88,799 个名字，按出现频率排列。 “Smith”排名第 1，出现频率为 1.006%，“Alderink”排名第 88,799，出现频率为 1.7 x 10^-6。 “Sanders”排名 75，出现频率为 0.100%。

曲线根本不必精确拟合。只需给我大约 1% 的“Smith”和大约百万分之一的“Alderink”即可

。这就是我到目前为止所拥有的。

SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)

但这当然会产生均匀分布。

我保证，当一个更聪明的人做出回应时，我仍然会尝试自己解决这个问题。

原文

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.

I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.

The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"

Here's what I have so far.

SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)

But this of course yields a uniform distribution.

I promise I'll still be trying to figure this out myself by the time a smarter person responds.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

慈悲佛祖 2024-10-11 02:23:39

当您可以从实际分布中得出结论时，为什么要满足幂律分布呢？

我建议您更改 LastNames 表以包含一个数字列，该列包含一个数值，表示具有更常见名称的个人的实际数量。您可能需要一个较小但成比例的数字，例如，每个百分比的代表性可能为 10,000。

该列表将类似于：
（除了问题中提到的 3 个名字之外，我猜测是 White、Johnson 等人）

Smith          0   
White     10,060
Johnson   19,123
Williams  28,456
...
Sanders  200,987
..
Alderink 999,997

并且名称选择将是：

SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC

选择第一个名称，其数字不超过 [均匀分布] 随机数。请注意查询如何使用小于并以 desc 结尾的顺序进行排序；这将保证第一个条目（史密斯）被选中。另一种选择是以 10,060 而不是零开始史密斯系列赛，并丢弃小于该值的随机抽签。

除了上面提到的边界管理问题（从 0 开始而不是 10,060 开始）之外，此解决方案以及迄今为止的其他两个响应与 dmckee 的回答中建议的解决方案相同这个问题中提到的问题。本质上，这个想法是使用 CDF（累积分布函数）。

编辑：
如果您坚持使用数学函数而不是实际分布，下面应该提供一个幂律函数，它可以以某种方式传达真实分布的“长尾”形状。您可能想要调整 @PwrCoef 值（顺便说一句，该值不必是整数），本质上来说，系数越大，函数就越偏向列表的开头。

DECLARE @PwrCoef INT
SET @PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, @PwrCoef) * RAND(), 1.0/@PwrCoef), 0)

备注：
- 上面函数中额外的“.0”对于强制 SQL 执行浮点运算而不是整数运算非常重要。
- 我们从 88799 中减去幂计算的原因是，计算的分布是这样的：越接近我们的刻度末端的数字，就越有可能被抽取。姓氏列表按相反顺序排序（最有可能的名字在前），我们需要这个减法。

查询将类似于

SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
     = 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)

假设幂为 3，那么除了最后一行之外，

Which is the query from the questions。 重新编辑：
在查看实际分布时，如人口普查数据所示，曲线极其陡峭，需要非常大的幂系数，这反过来会导致溢出和/或极端舍入误差公式如上所示。
更明智的方法可能是在多个层中进行操作，即在累积分布的三分之三（或四分之四或......）的每一层中执行相同数量的抽奖；在每个零件列表中，我们将使用幂律函数进行绘制，可能具有相同的系数，但范围不同。
例如
假设为三分之一，列表划分如下：

第一个三分之一 = 425 个名字，从 Smith 到 Alvarado
第二个三分之一 = 6,277 个名字，从到 Gainer
最后一个 = 82,097 个名字，从 Frisby 到最后

如果我们需要 1,000 个名字，我们将从列表的前三分之一抽取 334，从第二个三分之一抽取 333，从最后三分之一抽取 333。
对于每一个三分之一，我们都会使用类似的公式，也许前三分之一具有更大的幂系数（我们确实有兴趣支持列表中较早的名称，并且还有相对的频率更具统计相关性）。三个选择查询可能如下所示：

-- Random Drawing of a single Name in top third
--   Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
     =  425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)

-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
     =  (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)

-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
     =  (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)

Why settle for the power-law distribution when you can draw from the actual distribution ?

I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.

The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)

Smith          0   
White     10,060
Johnson   19,123
Williams  28,456
...
Sanders  200,987
..
Alderink 999,997

And the name selection would be

SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC

That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.

Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).

Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the @PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.

DECLARE @PwrCoef INT
SET @PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, @PwrCoef) * RAND(), 1.0/@PwrCoef), 0)

Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.

Assuming a power of, say, 3 the query would then look something like

SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
     = 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)

Which is the query from the question except for the last line.

Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:

First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end

If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:

-- Random Drawing of a single Name in top third
--   Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
     =  425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)

-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
     =  (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)

-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
     =  (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)

回复收藏 0 原文

记忆之渊 2024-10-11 02:23:39

不要将 pdf 存储为排名，而是存储 CDF（从 Aldekirk 开始，直到该名称的所有频率的总和）。

然后修改您的选择以检索排名大于公式结果的第一个逻辑节点。

回复收藏 0 原文

诗笺 2024-10-11 02:23:39

我将这个问题读为“我需要获取一系列姓名，以反映 1990 年美国人口普查中姓氏的频率”，

我对这个问题的解读可能与其他建议略有不同，尽管答案已被接受，并且这是一个非常彻底的答案，我将贡献我在人口普查姓氏方面的经验。

我从 1990 年人口普查中下载了相同的数据。我的目标是在医疗记录应用程序的性能测试期间生成大量要提交用于搜索测试的姓名。我将姓氏和频率百分比插入到表中。我添加了一列并用一个整数填充它，该整数是“所需名称总数 * 频率”的乘积。人口普查的频率数据加起来并不完全是100%，所以我的名字总数也有点低于要求。我能够通过从列表中选择随机名称并增加其数量来纠正该数字，直到我完全获得所需的数量为止，随机添加的数量从未超过 1000 万总数的 0.05%。

我生成了 1 到 88799 范围内的 1000 万个随机数。对于每个随机数，我都会从列表中选择该名称并减少该名称的计数器。我的方法是模拟处理一副牌，但我的牌组有更多不同的牌，并且每张牌的数量各不相同。

回复收藏 0 原文

鹿港巷口少年归 2024-10-11 02:23:39

您是否将实际频率与排名一起存储？

如果您知道 n 使用什么值，那么将代数从接受的答案转换为 MySQL 并不麻烦。 y 将是您当前拥有的 ROUND(88799 * RAND(), 0) 和 x0,x1 = 1,88799 我认为，我可能会误解它。从 T-SQL 角度来看，涉及的唯一非标准数学运算符是 ^，即 POWER(x,y) == x^y。