MATLAB/General CS:从多个集合中进行无需替换的采样(+跟踪未采样的情况)

发布于 2024-10-22 01:35:10 字数 1270 浏览 7 评论 0原文

我目前正在实现一种优化算法,该算法要求我从多个集合中进行无替换的采样。虽然我是在 MATLAB 中编码,但这本质上是一道 CS 问题。

情况如下:

我有有限数量的集合(ABC),每个集合都有有限但可能不同数量的元素(a1,a2...a8b1,b2...b10c1、c2...c25)。我还有每个集合的概率向量,其中列出了该集合中每个元素的概率(即对于集合 A,P_A = [p_a1 p_a2... p_a8] 其中 sum(P_A) = 1 )。我通常使用它们为每个集合创建一个概率生成函数,给定 0 到 1 之间的统一数字,可以从该集合中吐出一个元素(即函数 P_A(u),给定 u = 0.25,将选择a2)。

我希望从A、BC集合中无替换进行采样。每个“完整样本”都是来自每个不同集合的元素序列,即(a1、b3、c2)。请注意,完整样本的空间是A、BC中元素的所有排列的集合。在上面的示例中,该空间为 (a1,a2...a8) x (b1,b2...b10) x (c1, c2. ..c25),我的空间中有 8*10*25 = 2000 个独特的“完整样本”。

不使用此设置进行替换的采样的烦人部分是,如果我的第一个样本是 (a1, b3, c2),那么这并不意味着我无法再次对元素 a1 进行采样- 这只是意味着我无法再次对完整序列(a1、b3、c2)进行采样。另一个烦人的部分是我正在使用的算法要求我对我尚未采样的元素的所有排列进行函数评估。

我现在可以采取的最好方法是跟踪样本案例。这有点低效,因为我的采样器被迫拒绝之前采样过的任何案例(因为我在不进行替换的情况下进行采样)。然后,我通过使用嵌套 for 循环遍历每个排列 (ax, by, cz) 来对未采样的情况进行函数评估,并且仅在 (ax, by, cz) 不包含在抽样案例中。同样,这有点低效,因为我必须“检查”每个排列(ax、by、cz)是否已被采样。

对于这个问题的任何建议,我将不胜感激。特别是,我正在寻找一种无需替换的采样方法并且跟踪未采样的案例,这些案例没有明确列出完整的样本空间(我通常使用 10 个集合,每个集合有 10 个元素,因此列出了完整的样本空间需要 10^10 x 10 矩阵)。我意识到这可能是不可能的,尽管找到有效的方法可以让我证明算法的真正局限性。

I currently implementing an optimization algorithm that requires me to sample without replacement from several sets. Although I am coding in MATLAB, this is essentially a CS question.

The situation is as follows:

I have a finite number of sets (A, B, C) each with a finite but possibly different number of elements (a1,a2...a8, b1,b2...b10, c1, c2...c25). I also have a vector of probabilities for each set which lists a probability for each element in that set (i.e. for set A, P_A = [p_a1 p_a2... p_a8] where sum(P_A) = 1). I normally use these to create a probability generating function for each set, which given a uniform number between 0 to 1, can spit out one of the elements from that set (i.e. a function P_A(u), which given u = 0.25, will select a2).

I am looking to sample without replacement from the sets A, B, and C. Each "full sample" is a sequence of elements from each of the different sets i.e. (a1, b3, c2). Note that the space of full samples is the set of all permutations of the elements in A, B, and C. In the example above, this space is (a1,a2...a8) x (b1,b2...b10) x (c1, c2...c25) and there are 8*10*25 = 2000 unique "full samples" in my space.

The annoying part of sampling without replacement with this setup is that if my first sample is (a1, b3, c2) then that does not mean I cannot sample the element a1 again - it just means that I cannot sample the full sequence (a1, b3, c2) again. Another annoying part is that the algorithm I am working with requires me do a function evaluation for all permutations of elements that I have not sampled.

The best method at my disposal right now is to keep track the sampled cases. This is a little inefficient since my sampler is forced to reject any case that has been sampled before (since I'm sampling without replacement). I then do the function evaluations for the unsampled cases, by going through each permutation (ax, by, cz) using nested for loops and only doing the function evaluation if that combination of (ax, by, cz) is not included in the sampled cases. Again, this is a little inefficient since I have to "check" whether each permutation (ax, by, cz) has already been sampled.

I would appreciate any advice in regards to this problem. In particular, I am looking a method to sample without replacement and keep track of unsampled cases that does not explicity list out the full sample space (I usually work with 10 sets with 10 elements each so listing out the full sample space would require a 10^10 x 10 matrix). I realize that this may be impossible, though finding efficient way to do it will allow me to demonstrate the true limits of the algorithm.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

妄想挽回 2024-10-29 01:35:10

真的需要跟踪所有未抽样的案例吗?即使您有一个存储 逻辑值 true 或 false 指示该排列是否已被采样,这仍然需要大约 10 GB 的存储空间,并且 MATLAB 可能会抛出 “内存不足”错误,或者如果您尝试创建变量,则会导致整个计算机急速停止那个尺寸的。

可以考虑的另一种选择是存储您的排列的指标的稀疏向量 已经采样了。让我们考虑一个较小的示例:

A = 1:8;
B = 1:10;
C = 1:25;
nA = numel(A);
nB = numel(B);
nC = numel(C);
beenSampled = sparse(1,nA*nB*nC);

1×2000 稀疏矩阵 beenSampled 一开始是空的(即它包含所有零),我们将为每个采样排列在给定索引处添加一个 1。我们可以使用函数 RANDI 为我们提供索引来获得新的样本排列转换为 ABC 以获得新的值集:

indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);

然后我们可以将这三个索引转换为单个唯一线性索引,即 < code>beenSampled 使用函数SUB2IND

index = sub2ind([nA nB nC],indexA,indexB,indexC);

现在我们可以测试 beenSampled 中的索引元素,看看它的值是否为 1(即我们已经对其进行了采样)或 0(即它是一个新样本)。如果已经采样,我们将重复上面查找一组新索引的过程。一旦我们有了尚未采样的排列,我们就可以处理它:

while beenSampled(index)
  indexA = randi(nA);
  indexB = randi(nB);
  indexC = randi(nC);
  index = sub2ind([nA nB nC],indexA,indexB,indexC);
end
beenSampled(index) = 1;
newSample = [A(indexA) B(indexB) C(indexC)];
%# ...do your subsequent processing...

如果您最终只采样一小部分,那么使用稀疏数组将为您节省大量空间所有可能的排列。对于较小的排列总数,就像上面的例子一样,我可能只使用逻辑向量而不是稀疏向量。

Do you really need to keep track of all of the unsampled cases? Even if you had a 1-by-1010 vector that stored a logical value of true or false indicating if that permutation had been sampled or not, that would still require about 10 GB of storage, and MATLAB is likely to either throw an "Out of Memory" error or bring your entire machine to a screeching halt if you try to create a variable of that size.

An alternative to consider is storing a sparse vector of indicators for the permutations you've already sampled. Let's consider your smaller example:

A = 1:8;
B = 1:10;
C = 1:25;
nA = numel(A);
nB = numel(B);
nC = numel(C);
beenSampled = sparse(1,nA*nB*nC);

The 1-by-2000 sparse matrix beenSampled is empty to start (i.e. it contains all zeroes) and we will add a one at a given index for each sampled permutation. We can get a new sample permutation using the function RANDI to give us indices into A, B, and C for the new set of values:

indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);

We can then convert these three indices into a single unique linear index into beenSampled using the function SUB2IND:

index = sub2ind([nA nB nC],indexA,indexB,indexC);

Now we can test the indexed element in beenSampled to see if it has a value of 1 (i.e. we sampled it already) or 0 (i.e. it is a new sample). If it has been sampled already, we repeat the process of finding a new set of indices above. Once we have a permutation we haven't sampled yet, we can process it:

while beenSampled(index)
  indexA = randi(nA);
  indexB = randi(nB);
  indexC = randi(nC);
  index = sub2ind([nA nB nC],indexA,indexB,indexC);
end
beenSampled(index) = 1;
newSample = [A(indexA) B(indexB) C(indexC)];
%# ...do your subsequent processing...

The use of a sparse array will save you a lot of space if you're only going to end up sampling a small portion of all of the possible permutations. For smaller total numbers of permutations, like in the above example, I would probably just use a logical vector instead of a sparse vector.

画尸师 2024-10-29 01:35:10

查看 matlab 文档中的 randi 函数;您只需将其与 length 函数结合使用即可从每个向量中选择随机条目。跟踪每个采样向量应该像将其连接到矩阵一样简单;

current_values = [5 89 45];  % lets say this is your current sample set
used_values = [used_values; current_values];
% wash, rinse, repeat

Check the matlab documentation for the randi function; you'll just want to use that in conjunction with the length function to choose random entries from each vector. Keeping track of each sampled vector should be as simple as just concatenating it to a matrix;

current_values = [5 89 45];  % lets say this is your current sample set
used_values = [used_values; current_values];
% wash, rinse, repeat
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文