如何从概率不均匀的列表中选择一个值?

发布于 2024-12-22 09:29:23 字数 303 浏览 0 评论 0原文

我正在查看 k-means++ 初始化算法。该算法的以下两个步骤会产生非均匀概率:

对于每个数据点 x,计算 D(x),即 x 与数据点之间的距离 已选择的最近的中心。

使用加权随机选择一个新数据点作为新中心 概率分布,其中以概率选择点 x 与 D(x)^2 成正比。

我如何在 C++ 中使用这种规定的加权概率分布进行选择?

I am looking at the k-means++ initialization algorithm. The following two steps of the algorithm give rise to non-uniform probabilities:

For each data point x, compute D(x), the distance between x and the
nearest center that has already been chosen.

Choose one new data point at random as a new center, using a weighted
probability distribution where a point x is chosen with probability
proportional to D(x)^2.

How can I select with this stated weighted probability distribution in C++?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

半城柳色半声笛 2024-12-29 09:29:23

使用 随机 标头并使用 std::discrete_distribution。这是示例:

#include <iostream>
#include <map>
#include <random>

int main()
{
    std::random_device rd;
    std::mt19937 gen(rd());
    std::discrete_distribution<> d({20,30,40,10});
    std::map<int, int> m;
    for(int n=0; n<10000; ++n) {
        ++m[d(gen)];
    }
    for(auto p : m) {
        std::cout << p.first << " generated " << p.second << " times\n";
    }
}

这是输出的示例:

0 generated 2003 times
1 generated 3014 times
2 generated 4021 times
3 generated 962 times

Discrete distributions is a lot easier to do in C++11 with the random header and using std::discrete_distribution. This is example:

#include <iostream>
#include <map>
#include <random>

int main()
{
    std::random_device rd;
    std::mt19937 gen(rd());
    std::discrete_distribution<> d({20,30,40,10});
    std::map<int, int> m;
    for(int n=0; n<10000; ++n) {
        ++m[d(gen)];
    }
    for(auto p : m) {
        std::cout << p.first << " generated " << p.second << " times\n";
    }
}

and this is a sample of the output:

0 generated 2003 times
1 generated 3014 times
2 generated 4021 times
3 generated 962 times
她如夕阳 2024-12-29 09:29:23

对于一组有限的单独数据点 X,这需要离散概率分布。

最简单的方法是按顺序枚举点 X,并计算代表其累积概率分布函数的数组:(伪代码如下)

/* 
 * xset is an array of points X,
 * cdf is a preallocated array of the same size
 */
function prepare_cdf(X[] xset, float[] cdf)
{
   float S = 0;
   int N = sizeof(xset);
   for i = 0:N-1
   {
      float weight = /* calculate D(xset[i])^2 here */
      // create cumulative sums and write to the element in cdf array
      S += weight;
      cdf[i] = S;
   }

   // now normalize so the CDF runs from 0 to 1
   for i = 0:N-1
   {
      cdf[i] /= S;
   }
}

function select_point(X[] xset, float[] cdf, Randomizer r)
{
   // generate a random floating point number from a 
   // uniform distribution from 0 to 1
   float p = r.nextFloatUniformPDF();
   int i = binarySearch(cdf, p);
   // find the lowest index i such that p < cdf[i]

   return xset[i];
}

您调用prepare_cdf一次,然后根据需要多次调用select_point来生成随机点。

With a finite set of individual data points X, this calls for a discrete probability distribution.

The easiest way to do this is to enumerate the points X in order, and calculate an array representing their cumulative probability distribution function: (pseudocode follows)

/* 
 * xset is an array of points X,
 * cdf is a preallocated array of the same size
 */
function prepare_cdf(X[] xset, float[] cdf)
{
   float S = 0;
   int N = sizeof(xset);
   for i = 0:N-1
   {
      float weight = /* calculate D(xset[i])^2 here */
      // create cumulative sums and write to the element in cdf array
      S += weight;
      cdf[i] = S;
   }

   // now normalize so the CDF runs from 0 to 1
   for i = 0:N-1
   {
      cdf[i] /= S;
   }
}

function select_point(X[] xset, float[] cdf, Randomizer r)
{
   // generate a random floating point number from a 
   // uniform distribution from 0 to 1
   float p = r.nextFloatUniformPDF();
   int i = binarySearch(cdf, p);
   // find the lowest index i such that p < cdf[i]

   return xset[i];
}

You call prepare_cdf once, and then call select_point as many times as you need to generate random points.

最终幸福 2024-12-29 09:29:23

我将采用以下方法:

  • 迭代数据点,将其 D 平方存储在 double distance_squareds[]std::vector中。 distance_squareds 或其他什么,并将它们的 D 平方和存储在 double sum_distance_squareds 中。
  • 使用drand48函数 在 [0.0, 1.0) 中选择一个随机数,并乘以 sum_distance_squareds;将结果存储在random_number中。
  • 迭代 distance_squareds,再次将这些值相加,一旦运行总计达到或超过 random_number,就返回与该 D 平方相对应的数据点你刚刚添加了。
  • 由于舍入误差,您很可能会在没有返回的情况下完成循环;如果是这样,只需返回第一个数据点,或最后一个数据点,或其他任何数据点。 (但别担心,这应该是一种非常罕见的边缘情况。)

I'd take the following approach:

  • iterate over the data-points, storing their D-squared's in a double distance_squareds[] or std::vector<double> distance_squareds or whatnot, and storing the sum of their D-squared's in a double sum_distance_squareds.
  • use the drand48 function to choose a random number in [0.0, 1.0), and multiply it by sum_distance_squareds; store the result in random_number.
  • iterate over distance_squareds, adding together the values (again), and as soon as the running total meets or exceeds random_number, return the data-point corresponding to the D-squared that you'd just added.
  • due to round-off error, it's remotely possible that you'll finish the loop without having returned; if so, just return the first data-point, or the last one, or whatever. (But don't worry, this should be a very rare edge case.)
音盲 2024-12-29 09:29:23

这里有一些可以帮助你的东西,
使用具有给定概率分布(prob..)的(numbers..)数组,它将为您生成具有这些概率的(数字)(在这里它将对它们进行计数)。

#include <iostream>
#include <cmath>
#include <time.h>
#include <stdlib.h>
#include <map>
#include <vector>
using namespace std;
#define ARRAY_SIZE(array) (sizeof(array)/sizeof(array[0]))

int checkDistribution(double random, const map<double, vector<int> > &distribution_map)
{
    int index = 0;
    map<double, vector<int> >::const_iterator it = distribution_map.begin();
    for (; it!=distribution_map.end(); ++it)
    {
        if (random < (*it).first)
        {
                int randomInternal = 0;
                if ((*it).second.size() > 1)
                    randomInternal = rand() % ((*it).second.size());
                index = (*it).second.at(randomInternal);
                break;
        }
    }
    return index;
}

void nextNum(int* results, const map<double, vector<int> > &distribution_map)
{
    double random  = (double) rand()/RAND_MAX;
    int index = checkDistribution(random,distribution_map);
    results[index]+=1;
}

int main() {

    srand (time(NULL));
    int results [] = {0,0,0,0,0};
    int numbers [] = {-1,0,1,2,3};
    double prob [] =  {0.01, 0.3, 0.58, 0.1, 0.01};
    int size = ARRAY_SIZE(numbers);
    // Building Distribution
    map<double, vector<int> > distribution_map;
    map<double, vector<int> >::iterator it;
    for (int i = 0; i < size; i++)
    {
        it = distribution_map.find(prob[i]);
        if (it!=distribution_map.end())
            it->second.push_back(i);
        else
        {
            vector<int> vec;
            vec.push_back(i);
            distribution_map[prob[i]] = vec;
        }
    }
    // PDF to CDF transform
    map<double, vector<int> > cumulative_distribution_map;
    map<double, vector<int> >::iterator iter_cumulative;
    double cumulative_distribution = 0.0;
    for (it=distribution_map.begin();it!=distribution_map.end();++it)
    {
        cumulative_distribution += ((*it).second.size() * (*it).first);
        cumulative_distribution_map[cumulative_distribution] = (*it).second;
    }

    for (int i = 0; i<100; i++)
    {
        nextNum(results, cumulative_distribution_map);
    }
    for (int j = 0; j<size; j++)
        cout<<" "<<results[j]<<" ";
    return 0;
}

Here you have something that may help you,
using (numbers..) array with given probability distribution (prob..) it will generate for you (numbers) with those probabilities (here it will count them).

#include <iostream>
#include <cmath>
#include <time.h>
#include <stdlib.h>
#include <map>
#include <vector>
using namespace std;
#define ARRAY_SIZE(array) (sizeof(array)/sizeof(array[0]))

int checkDistribution(double random, const map<double, vector<int> > &distribution_map)
{
    int index = 0;
    map<double, vector<int> >::const_iterator it = distribution_map.begin();
    for (; it!=distribution_map.end(); ++it)
    {
        if (random < (*it).first)
        {
                int randomInternal = 0;
                if ((*it).second.size() > 1)
                    randomInternal = rand() % ((*it).second.size());
                index = (*it).second.at(randomInternal);
                break;
        }
    }
    return index;
}

void nextNum(int* results, const map<double, vector<int> > &distribution_map)
{
    double random  = (double) rand()/RAND_MAX;
    int index = checkDistribution(random,distribution_map);
    results[index]+=1;
}

int main() {

    srand (time(NULL));
    int results [] = {0,0,0,0,0};
    int numbers [] = {-1,0,1,2,3};
    double prob [] =  {0.01, 0.3, 0.58, 0.1, 0.01};
    int size = ARRAY_SIZE(numbers);
    // Building Distribution
    map<double, vector<int> > distribution_map;
    map<double, vector<int> >::iterator it;
    for (int i = 0; i < size; i++)
    {
        it = distribution_map.find(prob[i]);
        if (it!=distribution_map.end())
            it->second.push_back(i);
        else
        {
            vector<int> vec;
            vec.push_back(i);
            distribution_map[prob[i]] = vec;
        }
    }
    // PDF to CDF transform
    map<double, vector<int> > cumulative_distribution_map;
    map<double, vector<int> >::iterator iter_cumulative;
    double cumulative_distribution = 0.0;
    for (it=distribution_map.begin();it!=distribution_map.end();++it)
    {
        cumulative_distribution += ((*it).second.size() * (*it).first);
        cumulative_distribution_map[cumulative_distribution] = (*it).second;
    }

    for (int i = 0; i<100; i++)
    {
        nextNum(results, cumulative_distribution_map);
    }
    for (int j = 0; j<size; j++)
        cout<<" "<<results[j]<<" ";
    return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文