Zipf 概率分布 通常用于对文件大小分布进行建模或P2P 系统中项目的项目访问分布。例如“Web 缓存和 Zip 之类的分布证据和含义”,但两者都不是Boost 或 GSL(Gnu 科学库) 提供了使用此分布生成随机数的实现。我还没有找到使用常见搜索引擎的(值得信赖的)实现。
如何使用 U(0,1) 随机生成器(例如 梅森扭曲者?
The Zipf probability distribution is often used to model file size distribution or item access distributions on items in P2P systems. e.g. "Web Caching and Zip like Distribution Evidence and Implications", but neither Boost or the GSL (Gnu Scientific Library) provide an implementation to generate random numbers using this distribution. I have not found a (trustworthy) implementation using the common search engines.
How can random numbers that are distributed according to the Zipf distribution by using a U(0,1) random generator, e.g. the Mersenne twister?
发布评论
评论(5)
下面是一个类似 Python Zipf 的分布生成器,适用于参数
alpha >= 0
的n
项:Here's a Python Zipf-like distribution generator for
n
items with parameteralpha >= 0
:zipfR 是一个使用 R 实现的免费开源库。VGAM 是另一个也实现 Zipf 的 R 包。
还值得注意的是 Gnu 科学库 有一个 Pareto 分布 它实际上是离散 Zipf 分布的连续模拟。
此外,Zeta 分布 相当于无限N 的 Zipf。 GSL 有一个实现黎曼 zeta 函数 的 >,因此您可以使用它自己构建分布。
zipfR is a free and open source library implemented with R. VGAM is another R package that also implements Zipf.
It's also worth noting that the Gnu Scientific Library has an implementation of the Pareto distribution which is effectively the continuous analogue of the discrete Zipf distribution.
Also, the Zeta distribution is equivalent to Zipf for infinite N. The GSL has an implementation of the Riemann zeta function, so you could use that to construct the distribution yourself.
numpy.random.zipf 使用以下命令生成 Zipf 样本Python。
numpy.random.zipf generates Zipf samples using python.
最近为 Apache Commons Math 库的下一个版本 (>= 3.6) 开发了一种非常有效的算法来生成 Zipf 分布式随机变量(请参阅代码 此处)。它利用拒绝反转采样,并且也适用于小于 1 的指数。它不需要预先计算 CDF 并将其保存在内存中。此外,生成一个样本的成本是恒定的,不会随着项目数量的增加而增加。
A very efficient algorithm to generate Zipf distributed random variates was recently developed for the next versions (>= 3.6) of the Apache Commons Math library (see code here). It makes use of rejection-inversion sampling and also works for exponents less than 1. It does not require precalculating the CDF and keeping it in memory. Furthermore, the costs for generating one sample are constant and do not increase with the number of items.
我们正在此帖子中讨论 @stanga 的答案。他的算法有一些很好的优化建议。
We were discussing the answer of @stanga in this thread. There are some nice optimizations suggested for his algorithm.