算法的想法?对列表进行随机排序,强调多样性
我有一个包含 [ID, ATTR1, ATTR2, ATTR3]
的项目表。我想选择大约一半的项目,但尝试获得未聚集的随机结果集。换句话说,ATTR1 值、ATTR2 值和 ATTR3 值的分布相当均匀。这并不一定代表数据的整体,换句话说,总表可能通常集中在某些属性值上,但我想选择一个更多样化的子集。这些属性并不相互关联,因此 ATTR1 和 ATTR2 之间并不存在真正的相关性。
例如,假设 ATTR1 =“状态”。我希望子集中的每个行项目都来自不同的州,即使在整个集合中,我的大部分数据都集中在几个州。对于其他两个属性来说,这也同时成立。 (我意识到有些表可能无法实现这一点,但有足够的数据,不可能没有解决方案)
对于有效算法有什么想法吗?谢谢!我什至不知道如何搜索这个:)
(顺便说一句,如果这需要对整个集合进行预先计算或索引,只要我可以快速提取随机变化的子集,那就可以了)
I have a table of items with [ID, ATTR1, ATTR2, ATTR3]
. I'd like to select about half of the items, but try to get a random result set that is NOT clustered. In other words, there's a fairly even spread of ATTR1 values, ATTR2 values, and ATTR3 values. This does NOT necessarily represent the data as a whole, in other words, the total table may be generally concentrated on certain attribute values, but I'd like to select a subset with more variety. The attributes are not inter-related, so there's not really a correlation between ATTR1 and ATTR2.
As an example, imagine ATTR1 = "State". I'd like each line item in my subset to be from a different state, even if in the whole set, most of my data is concentrated on a few states. And for this to simultaneously be true of the other 2 attributes, too. (I realize that some tables might not make this possible, but there's enough data that it's unlikely to have no solution)
Any ideas for an efficient algorithm? Thanks! I don't really even know how to search for this :)
(by the way, it's OK if this requires pre-calculation or -indexing on the whole set, so long as I can draw out random varied subsets quickly)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
有趣的问题。既然您想要大约一半的列表,那么这样如何:-
创建一个包含完全随机选择的一半值的列表。计算每个选定项目的 ATTR1、ATTR2、ATTR3 值的直方图。
:loop
现在随机选择当前列表中的一个项目和不在当前列表中的项目。
如果新项目增加了直方图中唯一属性数量的“熵”,请保留它并更新直方图以反映您刚刚所做的更改。
重复 N/2 次或更多,具体取决于您想要强制其覆盖每个值而不是随机的程度。您还可以使用“模拟退火”并逐渐改变接受交换的概率 - 从“有时允许交换,即使它会使情况变得更糟”开始到“仅在增加多样性时才交换”。
Interesting problem. Since you want about half of the list, how about this:-
Create a list of half the values chosen entirely at random. Compute histograms for the value of ATTR1, ATTR2, ATTR3 for each of the chosen items.
:loop
Now randomly pick an item that's in the current list and an item that isn't.
If the new item increases the 'entropy' of the number of unique attributes in the histograms, keep it and update the histograms to reflect the change you just made.
Repeat N/2 times, or more depending on how much you want to force it to move towards covering every value rather than being random. You could also use 'simulated annealing' and gradually change the probability to accepting the swap - starting with 'sometimes allow a swap even if it makes it worse' down to 'only swap if it increases variety'.
我不知道(希望知道的人能回答)。我想到的是:为 MCMC 制定一个分布,将最大的权重放在子集上'种类'。
I don't know (and I hope someone who does will answer). Here's what comes to mind: make up a distribution for MCMC putting the most weight on the subsets with 'variety'.
假设表中的项目通过某种形式的 id 进行索引,我将在循环中迭代表中一半的项目,并使用随机数生成器来获取数字。
Assuming the items in your table are indexed by some form of id, I would in a loop, iterate through half of the items in your table, and use a random number generator to get the number.
恕我直言
所以我们可以生成多种组合
然后在表中搜索具有这些组合的记录。
如果表已排序,那么搜索也会变得容易。
示例 python 代码:
输出:
但是假设您的表非常大(否则为什么需要算法;)并且数据分布相当均匀,在实际场景中会出现更多命中。在这种虚拟情况下,有太多的失误,这使得算法看起来效率低下。
IMHO
So we can generate variety of combinations and
then seach the table for records with those combinations.
If the table is sorted then searching also becomes easy.
Sample python code:
Output:
But assuming your table is very big (otherwise why would you need algorithm ;) and data is fairly uniformly distributed there will be more hits in actual scenario. In this dummy case there are too many misses which makes algorithm look inefficient.
我们假设 ATTR1、ATTR2 和 ATTR3 是独立的随机变量(在均匀随机项上)。 (如果 ATTR1、ATTR2 和 ATTR3 仅近似独立,则此样本在每个属性上应近似均匀。)要对属性均匀分布的一项 (VAL1、VAL2、VAL3) 进行采样,请从集合中均匀随机选择 VAL1 ATTR1 的值,从 ATTR2 的值集中随机选择 VAL2,而不是 ATTR1 = VAL1 的项目;从 ATTR3 的值集中随机选择 VAL3,而不是 ATTR1 = VAL1 和 ATTR2 = VAL2 的项目。
要获取不同项目的样本,请重复应用上述过程,并在选择每个项目后将其删除。实现这一点的最佳方法可能是一棵树。例如,如果我们有
树,那么在 JavaScript 对象表示法中,
删除是递归完成的。如果我们对 id 4 进行采样,那么我们将其从叶级别的列表中删除。该列表为空,因此我们从树[“a”][“d”]中删除条目“f”:[]。如果我们现在删除 3,那么我们从其列表中删除 3,该列表会清空,因此我们从 tree["a"]["d"] 中删除条目 "e": [],这会清空 tree["a"][ "d"],所以我们依次删除。在一个好的实现中,每个项目应该花费 O(属性#) 的时间。
编辑:为了重复使用,请在收集整个样本后将项目重新插入树中。这不会影响渐近运行时间。
Let's assume that ATTR1, ATTR2, and ATTR3 are independent random variables (over a uniform random item). (If ATTR1, ATTR2, and ATTR3 are only approximately independent, then this sample should be approximately uniform in each attribute.) To sample one item (VAL1, VAL2, VAL3) whose attributes are uniformly distributed, choose VAL1 uniformly at random from the set of values for ATTR1, choose VAL2 uniformly at random from the set of values for ATTR2 over items with ATTR1 = VAL1, choose VAL3 uniformly at random from the set of values for ATTR3 over items with ATTR1 = VAL1 and ATTR2 = VAL2.
To get a sample of distinct items, apply the above procedure repeatedly, deleting each item after it is chosen. Probably the best way to implement this would be a tree. For example, if we have
then the tree is, in JavaScript object notation,
Deletion is accomplished recursively. If we sample id 4, then we delete it from its list at the leaf level. This list empties, so we delete the entry "f": [] from tree["a"]["d"]. If we now delete 3, then we delete 3 from its list, which empties, so we delete the entry "e": [] from tree["a"]["d"], which empties tree["a"]["d"], so we delete it in turn. In a good implementation, each item should take time O(# of attributes).
EDIT: For repeated use, reinsert the items into the tree after the whole sample is collected. This doesn't affect the asymptotic running time.
想法#2。
计算原始表上每个属性的直方图。
对于每个项目,计算其唯一性得分 = p(ATTR1) xp(ATTR2) xp(ATTR3)(乘以它所具有的每个属性的概率)。
按独特性排序。
为随机数选择一条概率分布曲线,范围从仅选取集合前半部分中的值(阶跃函数)到在整个集合中均匀选取值(一条平坦线)。在这种情况下,也许 1/x 曲线可能适合您。
使用您选择的概率曲线从排序列表中选择值。
这使您可以通过调整用于生成随机数的概率曲线来将其偏向更独特的值或更均匀。
Idea #2.
Compute histograms for each attribute on the original table.
For each item compute it's uniqueness score = p(ATTR1) x p(ATTR2) x p(ATTR3) (multiply the probabilities for each attribute it has).
Sort by uniqueness.
Chose a probability distribution curve for your random numbers ranging from picking only values in the first half of the set (a step function) to picking values evenly over the entire set (a flat line). Maybe a 1/x curve might work well for you in this case.
Pick values from the sorted list using your chosen probability curve.
This allows you to bias it towards more unique values or towards more evenness just by adjusting the probability curve you use to generate the random numbers.
以您的示例为例,为每个可能的“状态”分配一个数值(例如,1 到 9 之间)。对其他属性执行相同操作。
现在,假设每个属性的可能值不超过 10 个,请将 ATTR3 的值乘以 100,ATTR2 的值乘以 1000,ATTR1 的值乘以 10000。将结果相加,您最终会得到类似于以下内容的模糊散列:该项目。大约
10,000 * |ATTR1| + 1000 * |ATTR2| + 100 * |ATTR3|
这里的优点是您知道 10000 到 19000 之间的值具有相同的“State”值;换句话说,第一个数字代表 ATTR1。 ATTR2 和其他属性也是如此。
您可以对所有值进行排序,并使用桶排序之类的方法为每种类型选择一个,检查您正在考虑的数字是否尚未被选择。
示例:如果您的最终值为
A:15,700 = 10,000 * 1 + 1,000 * 5 + 100 * 7
B:13,400 = 10,000 * 1 + 1,000 * 3 + 100 * 4
C:13,200 = ...
日:12,300
电子:11,400
F:10,900
你知道你所有的值都有相同的 ATTR1; 2 个具有相同的 ATTR2(即 B 和 C);和 2 具有相同的 ATTR3 (B, E)。
当然,这是假设我正确理解你想要做什么。毕竟是周六晚上。
ps:是的,我可以使用“10”作为第一个乘数,但示例会更混乱;是的,这显然是一个幼稚的例子,这里有很多可能的优化,留给读者作为练习
Taking over your example, assign every possible 'State' a numeric value (say, between 1 and 9). Do the same for the other attributes.
Now, assuming you don't have more than 10 possible values for each attribute, multiply the values for ATTR3 for 100, ATTR2 for 1000, ATTR1 for 10000. Add up the results, you will end up with what can resemble a vague hash of the item. Something like
10,000 * |ATTR1| + 1000 * |ATTR2| + 100 * |ATTR3|
the advantage here is that you know that values between 10000 and 19000 have the same 'State' value; in other words, the first digit represents ATTR1. Same for ATTR2 and the other attributes.
You can sort all values and using something like bucket-sort pick one for each type, checking that the digit you're considering hasn't been picked already.
An example: if your end values are
A: 15,700 = 10,000 * 1 + 1,000 * 5 + 100 * 7
B: 13,400 = 10,000 * 1 + 1,000 * 3 + 100 * 4
C: 13,200 = ...
D: 12,300
E: 11,400
F: 10,900
you know that all your values have the same ATTR1; 2 have the same ATTR2 (that being B and C); and 2 have the same ATTR3 (B, E).
This, of course, assuming I understood correctly what you want to do. It's saturday night, afterall.
ps: yes, I could have used '10' as the first multiplier but the example would have been messier; and yes, it's clearly a naive example and there are lots of possible optimizations here, which are left as an exercise to the reader
这是一个非常有趣的问题,我可以看到很多应用程序。特别是对于测试软件:您会获得许多“主流”交易,但只需要一个来测试它是否有效,并且在选择获取极其多样化的样本时您会更愿意。
我不认为你真的需要直方图结构,或者至少只需要一个二进制结构(不存在/存在)。
这实际上用于生成谓词列表:
该列表将包含
N
元素。现在您希望从此列表中选择
K
元素。如果K
小于N
就可以了,否则我们将重复列表i
次,这样K <= N* i
且i
当然是最小的,所以i = ceil(K/N)
(请注意,尽管如果K <= N
,其中i == 1
)。最后,在那里选择一个谓词,并寻找一个满足它的元素……这就是随机性真正发挥作用的地方,而我在这里还不够。
两点注释:
K > N
您可能愿意实际选择每个谓词i-1
次,然后从谓词列表中随机选择以完成您的选择。从而确保即使是最不常见的元素也能得到过度表达。3
来获得元组(1,2,3)
>,因此也许一种改进是将一些相关属性分组在一起,尽管出于效率原因它可能会增加生成的谓词数量It's a very interesting problem, for which I can see a number of applications. Notably for testing software: you get many 'main-flow' transactions, but only one is necessary to test that it works and you would prefer when selecting to get an extremely varied sample.
I don't think you really need a histogram structure, or at least only a binary one (absent/present).
This is used in fact to generate a list of predicates:
This list will contain say
N
elements.Now you wish to select
K
elements from this list. IfK
is less thanN
it's fine, otherwise we will duplicate the listi
times, so thatK <= N*i
and withi
minimal of course, soi = ceil(K/N)
(note that it works although ifK <= N
, withi == 1
).And finally, pick up a predicate there, and look for an element that satisfies it... that's where randomness actually hits and I am less than adequate here.
Two remarks:
K > N
you may be willing to actually selecti-1
times each predicate and then select randomly from the list of predicates only to top off your selection. Thus ensuring the over representation of even the least common elements.(1,2,3)
by selecting on the third element being3
, so perhaps a refinement would be to group some related attributes together, though it would probably increase the number of predicates generated