从加权组产生长度L的第一n组组合,按重量顺序
我有一组带有权重的字母,这给出了在字符串中出现的概率:
a - 0.7
b - 0.1
c - 0.3
...
z - 0.01
因此,单词aaaa
具有0.7*0.7*0.7*0.7*0.7*0.7*0.7*0.7 = 0.24 = 0.24 。单词
AAAC
将具有概率0.7*0.7*0.7*0.3 = 0.10
。同一单词的所有排列都具有相同的概率,因此我们只需要担心组合。
我想生成第一个唯一n
长度l
的字符串(例如,在这里,有4个字母和长度4,应该是aaaaa在
CCCC
等)。
假设在这里不可能地生成所有组合与它们的概率和按重量排序的蛮力方法。该算法(如果存在)必须能够适用于任何设置的大小和任何长度的字符串(例如,所有具有加权概率的256个字节,1024长的字符串,生成首万亿。)
I have a set of letters with weights, which gives their probability of appearing in a string:
a - 0.7
b - 0.1
c - 0.3
...
z - 0.01
As such, the word aaaa
has a probability of 0.7*0.7*0.7*0.7 = 0.24
. The word aaac
would have probability 0.7*0.7*0.7*0.3 = 0.10
. All permutations of the same word have the same probability, so we only need to worry about combinations.
I would like to generate the first unique N
strings of length L
in order of probability (for example here, with 4 letters and length 4, that should be aaaa
, aaac
, aacc
, aaab
, accc
, aabc
, cccc
, etc).
Assume that the brute force approach of generating all combinations with their probabilities and sorting by weight to not be possible here. The algorithm, should it exist, must be able to work for any set size and any length of string (e.g. all 256 bytes with weighted probabilities, 1024 length string, generate first trillion.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以下是一些使用堆的枚举代码。实施原则与用户3386109在评论中提出的原则略有不同。
通过降低概率来订购符号。在s符号的长度-L组合与l-1零的长度s + l-1的二进制字符串之间有建设性的一对一对应关系(用l-1中的l-1分界符中的每个符号计算出每个符号)。我们可以对后者的可能性进行一次列举。
使我们免于必须列举每个组合的部分是,对于每个二进制前缀,可以通过重复仍然可用的最可能的字母来找到最可能的单词。通过将前缀存储在堆中,我们只能打开出现在顶部N中的那些。
请注意,这使用与枚举组合数成比例的内存。这可能仍然太多了,在这种情况下,您可能想要迭代深度深度搜索之类的东西。
输出:
Below is some enumeration code that uses a heap. The implementation principle is slightly different from what user3386109 proposes in their comment.
Order the symbols by decreasing probability. There’s a constructive one-to-one correspondence between length-L combinations of S symbols, and binary strings of length S + L − 1 with L − 1 zeros (counting out each symbol in unary with L − 1 delimiters). We can do bit-at-a-time enumeration of the possibilities for the latter.
The part that saves us from having to enumerate every single combination is that, for each binary prefix, the most probable word can be found by repeating the most probable letter still available. By storing the prefixes in a heap, we can open only the ones that appear in the top N.
Note that this uses memory proportional to the number of enumerated combinations. This may still be too much, in which case you probably want something like iterative deepening depth-first search.
Output:
该答案提供了 @user3386109评论的实现:
我编写了一个生成器函数,遵循以下确切步骤:
prio(word)
返回单词的优先级(作为负数为负号,因为python heaps是min-heaps);next_letter.get(Letter)
是Letter
之后的下一个更高优先级字母,如果有;hapq
带有第一个字aaaa
的队列及其相应的优先级;forts
通过通过下一个概率字母替换字母来推动获得的单词;由于它是一个生成器函数,因此很懒惰:您可以在不计算所有单词的情况下获得第一个
n
单词。但是,仍将计算一些额外的单词,因为优先级队列的整个想法是我们不知道预先确切的顺序。This answer provides an implementation of @user3386109's comment:
I wrote a generator function, following these exact steps:
prio(word)
which returns the priority of a word (as a negative number because python heaps are min-heaps);next_letter.get(letter)
is the next higher-priority letter afterletter
, if any;heapq
queue with first wordaaaa
, and its corresponding priority;yield
this word, and push the words obtained by replacing a letter by the next probability letter;Since it is a generator function, it is lazy: you can get the first
N
words without computing all words. However, some extra words will still be computed, since the whole idea of the priority queue is that we don't know the exact order in advance.首先,我指出您的概率并不总计为1,这立即使我觉得有些不对劲。
我想分享我的版本(归功于Stef,David和user3386109),这在太空中更容易被证明是O(nl),而O(n*max(l,log(n)))及时更容易。因为您需要o(nl)来存储/打印结果,并为prio-Queue添加日志(n),因此您很难进一步改进。我看不到这里的人们在复杂性方面没有足够有效的效率。 PS,使其成为生成器无济于事,因为该空间主要由堆本身引起。 PS2对于大L,N的大小可以是天文学的。
复杂性分析:堆始于1个元素,每个POP(这是顶部N)最多可将2个元素推入。因此,总空间为O(N 2 项目大小) = O(NL)。时间成本为n*max(l,log(n)),因为按下/弹出每个成本log(n),而内部循环中的许多步骤成本o(l)。
您可以看到它打印出与Stef相同的35个结果。现在,该算法的核心部分是如何选择在每个弹出声之后推动的两个元素。首先考虑相反的问题,如何找到每个孩子的父母,所以(1)所有可能的连击形成部分顺序(最大元素是'aaaa',不,我们在这里不需要Zorn的引理),(2)我们赢了“错过了比孩子大的任何东西,但比父母小(2个字母共享相同概率的说明不是完全正确的陈述,但实际上我发现它没关系),(3)没有重复,所以我们没有管理缓存访问的字符串。我选择父母的方式总是会降低字符串前面的字母,直到达到“ a”。例如,“ cbbz”的父母将是“ abbz”而不是“ ccbz”。相反,我们找到了一个给定父母的孩子,很高兴看到我们有多达2个孩子:在这种情况下,提高了第一封非 - ''的信,直到它等于下一个字母('acbb'=> 'abbb',而'accb'没有这个方向的孩子)或提高最后一个“ a”('acbb'=>'ccbb','accb'=>'cccb')
First pardon me for pointing out your probabilities do not sum to 1 which instantly makes me feel something is wrong.
I would like to share my version (credit to Stef, David, and user3386109), which is easier to prove to be O(NL) in space and O(N*max(L,log(N))) in time. Because you need O(NL) to store/print out the result, and add a log(N) for prio-queue, you can see it is hard to further make improvement. I don't see the folks here have proved efficient enough on the complexity-wise. P.S, making it a generator won't help since the space is mostly caused by the heap itself. P.S.2 for large L, the size of N can be astronomical.
Complexity analysis: The heap starts with 1 element and each pop, which is one of the top-N, pushes up to 2 elements back in. Therefore, the total space is O(N2item size)=O(NL). Time cost is N*max(L,log(N)) because push/pop each costs log(N) while many steps in the inner loop costs O(L).
You can see it print the same 35 results as Stef's. Now the core part of this algo is how to choose the 2 elements to push after each pop. First consider the opposite question, how to find the parent of each child so (1) all possible combos form a partial order (maximal element is 'aaaa', and No we don't need Zorn's lemma here), (2) we won't miss anything larger than the child but smaller than the parent (not a entirely correct statement for 2 letters sharing the same probability, but in actuality I found it does not matter), and (3) no repeats so we don't have to manage caching the visited strings. My way of choosing the parent is always lower the letters on the front of the string until it reaches 'a'. E.g, parent of 'cbbz' will be 'abbz' instead of 'ccbz'. Conversely, we find the children of a given parent, and it is nice to see we have up to 2 children in this case: raise the first non-'a' letter until it equals the next letter, ('acbb'=>'abbb', while 'accb' has no child in this direction) or raise the last 'a' ('acbb'=>'ccbb', 'accb'=>'cccb')