子集计数算法

发布于 2024-11-29 18:37:20 字数 501 浏览 1 评论 0原文

我有一个想要有效解决的以下问题。我得到了一组布尔值的 k 元组，其中我事先知道每个 k 元组中的每个值的某些部分是 true。例如，我可能有以下 4 元组，其中每个元组至少有 60% 的布尔值设置为 true：

(1, 0, 1, 0)
(1, 1, 0, 1)
(0, 0, 1, 0)

我有兴趣查找具有特定属性的索引集：如果我查看每个值在指示索引处的元组中，至少这些元组的给定部分具有相应的位集。例如，在上面的 4 元组集合中，我可以考虑集合 {0}，因为如果查看上述每个元组的第 0 个元素，其中三分之二是 1，并且 2/3 ~= 66%> 60%。出于同样的原因，我也可以考虑集合 {2}。但是，我无法考虑 {1}，因为在索引 1 处，只有三分之一的元组具有 1，并且 1/3 小于 60%。同样，我不能使用 {0, 2} 作为集合，因为至少 60% 的元组同时设置了位 0 和 2 是不正确的。

我的目标是找到该属性所包含的所有集合。有人有解决这个问题的好算法吗？

谢谢。

原文

I have a following problem I want to solve efficiently. I am given a set of k-tuples of Boolean values where I know in advance that some fraction of each of the values in each of the k-tuples is true. For example, I might have the following 4-tuples, where each tuple has at least 60% of it's Boolean values set to true:

(1, 0, 1, 0)
(1, 1, 0, 1)
(0, 0, 1, 0)

I am interested in finding sets of indices that have a particular property: if I look at each of the values in the tuples at the indicated indices, at least the given fraction of those tuples have the corresponding bit set. For example, in the above set of 4-tuples, I could consider the set {0}, since if you look at the zeroth element of each of the above tuples, two-thirds of them are 1, and 2/3 ~= 66% > 60%. I could also consider the set {2} for the same reason. However, I could not consider {1}, since at index 1 only one third of the tuples have a 1 and 1/3 is less than 60%. Similarly, I could not use {0, 2} as a set, because it is not true that at least 60% of the tuples have both bits 0 and 2 set.

My goal is to find all sets for which this property holds. Does anyone have a good algorithm for solving this?

Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

桃气十足 2024-12-06 18:37:20

正如您所写，可以假设架构是 x86_64 并且您正在寻找实现性能，导致渐近复杂性（因为它不会陷入线性 - 根据问题的定义；）），我建议以下算法（< em>类似 C++ 的伪代码）：

/* N=16 -> int16; N=8 -> int8 etc. Select N according to input sizes. (maybe N=24 ;) ) */
count_occurences_intN(vector<intN> t, vector<long> &result_counters){
   intN counters[2^N]={};
   //first, count bit combinations
   for_each(v in t)
       ++counters[v];
   //second, count bit occurrences, using aggregated data 
   for(column=0; column<N; ++column){
      mask = 1 << column;
      long *result_counter_ptr = &(result_counters[column]);
      for(v=0; v<2^16; ++v)
         if( v & mask )
            ++(*result_counter_ptr);
   }
}

然后，将输入的 k 位向量拆分为 N 位向量，然后应用上述函数。

根据输入大小，您可以选择 N=8、N=16、N=24 或应用朴素方法来提高性能。

正如您所写，您不能在客户端进行任何假设，只需实现 N={8,16,24} 并天真地根据输入的大小从四种实现中选择一种。

As you've wrote, that can be assumed that architecture is x86_64 and you are looking for implementation performance, cause asymptotic complexity (as it is not going to go under linear - by definition of problem ;) ), I propose following algorithm (C++ like pseudocode):

/* N=16 -> int16; N=8 -> int8 etc. Select N according to input sizes. (maybe N=24 ;) ) */
count_occurences_intN(vector<intN> t, vector<long> &result_counters){
   intN counters[2^N]={};
   //first, count bit combinations
   for_each(v in t)
       ++counters[v];
   //second, count bit occurrences, using aggregated data 
   for(column=0; column<N; ++column){
      mask = 1 << column;
      long *result_counter_ptr = &(result_counters[column]);
      for(v=0; v<2^16; ++v)
         if( v & mask )
            ++(*result_counter_ptr);
   }
}

Than, split your input k-bit vectors into N-bit vectors, and apply above function.

Depending on input size you might improve performance you choosing N=8, N=16, N=24 or applying naive approach.

As you've wrote, you can not assume anything on client side, just implement N={8,16,24} and naive and select one from four implementations depending on size of input.

回复收藏 0 原文