从加权组产生长度L的第一n组组合，按重量顺序

发布于 2025-01-26 16:07:48 字数 522 浏览 3 评论 0原文

我有一组带有权重的字母，这给出了在字符串中出现的概率：

a - 0.7
b - 0.1
c - 0.3
...
z - 0.01

因此，单词aaaa具有0.7*0.7*0.7*0.7*0.7*0.7*0.7*0.7 = 0.24 = 0.24 。单词AAAC将具有概率0.7*0.7*0.7*0.3 = 0.10。同一单词的所有排列都具有相同的概率，因此我们只需要担心组合。

我想生成第一个唯一n长度l的字符串（例如，在这里，有4个字母和长度4，应该是aaaaa在CCCC等）。

假设在这里不可能地生成所有组合与它们的概率和按重量排序的蛮力方法。该算法（如果存在）必须能够适用于任何设置的大小和任何长度的字符串（例如，所有具有加权概率的256个字节，1024长的字符串，生成首万亿。）

原文

I have a set of letters with weights, which gives their probability of appearing in a string:

a - 0.7
b - 0.1
c - 0.3
...
z - 0.01

As such, the word aaaa has a probability of 0.7*0.7*0.7*0.7 = 0.24. The word aaac would have probability 0.7*0.7*0.7*0.3 = 0.10. All permutations of the same word have the same probability, so we only need to worry about combinations.

I would like to generate the first unique N strings of length L in order of probability (for example here, with 4 letters and length 4, that should be aaaa, aaac, aacc, aaab, accc, aabc, cccc, etc).

Assume that the brute force approach of generating all combinations with their probabilities and sorting by weight to not be possible here. The algorithm, should it exist, must be able to work for any set size and any length of string (e.g. all 256 bytes with weighted probabilities, 1024 length string, generate first trillion.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷雾森÷林ヴ 2025-02-02 16:07:48

以下是一些使用堆的枚举代码。实施原则与用户3386109在评论中提出的原则略有不同。

通过降低概率来订购符号。在s符号的长度-L组合与l-1零的长度s + l-1的二进制字符串之间有建设性的一对一对应关系（用l-1中的l-1分界符中的每个符号计算出每个符号）。我们可以对后者的可能性进行一次列举。

使我们免于必须列举每个组合的部分是，对于每个二进制前缀，可以通过重复仍然可用的最可能的字母来找到最可能的单词。通过将前缀存储在堆中，我们只能打开出现在顶部N中的那些。

请注意，这使用与枚举组合数成比例的内存。这可能仍然太多了，在这种情况下，您可能想要迭代深度深度搜索之类的东西。

symbol_probability_dict = {"a": 0.7, "b": 0.1, "c": 0.3, "z": 0.01}
L = 4

import heapq
import math

loss_symbol_pairs = [(-math.log(p), c) for (c, p) in symbol_probability_dict.items()]
loss_symbol_pairs.sort()

heap = [(0, 0, "")]
while heap:
    min_loss, i, s = heapq.heappop(heap)
    if len(s) < L:
        heapq.heappush(heap, (min_loss, i, s + loss_symbol_pairs[i][1]))
        if i + 1 < len(loss_symbol_pairs):
            heapq.heappush(
                heap,
                (
                    min_loss
                    + (L - len(s))
                    * (loss_symbol_pairs[i + 1][0] - loss_symbol_pairs[i][0]),
                    i + 1,
                    s,
                ),
            )
    else:
        print(s)

输出：

aaaa
aaac
aacc
aaab
accc
aacb
cccc
accb
aabb
aaaz
cccb
acbb
aacz
ccbb
abbb
accz
aabz
cbbb
cccz
acbz
bbbb
ccbz
abbz
aazz
cbbz
aczz
bbbz
cczz
abzz
cbzz
bbzz
azzz
czzz
bzzz
zzzz

Below is some enumeration code that uses a heap. The implementation principle is slightly different from what user3386109 proposes in their comment.

Order the symbols by decreasing probability. There’s a constructive one-to-one correspondence between length-L combinations of S symbols, and binary strings of length S + L − 1 with L − 1 zeros (counting out each symbol in unary with L − 1 delimiters). We can do bit-at-a-time enumeration of the possibilities for the latter.

The part that saves us from having to enumerate every single combination is that, for each binary prefix, the most probable word can be found by repeating the most probable letter still available. By storing the prefixes in a heap, we can open only the ones that appear in the top N.

Note that this uses memory proportional to the number of enumerated combinations. This may still be too much, in which case you probably want something like iterative deepening depth-first search.

symbol_probability_dict = {"a": 0.7, "b": 0.1, "c": 0.3, "z": 0.01}
L = 4

import heapq
import math

loss_symbol_pairs = [(-math.log(p), c) for (c, p) in symbol_probability_dict.items()]
loss_symbol_pairs.sort()

heap = [(0, 0, "")]
while heap:
    min_loss, i, s = heapq.heappop(heap)
    if len(s) < L:
        heapq.heappush(heap, (min_loss, i, s + loss_symbol_pairs[i][1]))
        if i + 1 < len(loss_symbol_pairs):
            heapq.heappush(
                heap,
                (
                    min_loss
                    + (L - len(s))
                    * (loss_symbol_pairs[i + 1][0] - loss_symbol_pairs[i][0]),
                    i + 1,
                    s,
                ),
            )
    else:
        print(s)

Output:

aaaa
aaac
aacc
aaab
accc
aacb
cccc
accb
aabb
aaaz
cccb
acbb
aacz
ccbb
abbb
accz
aabz
cbbb
cccz
acbz
bbbb
ccbz
abbz
aazz
cbbz
aczz
bbbz
cczz
abzz
cbzz
bbzz
azzz
czzz
bzzz
zzzz

回复收藏 0 原文

笑着哭最痛 2025-02-02 16:07:48

该答案提供了 @user3386109评论的实现：

我认为解决方案是优先队列。队列的初始输入是一个带有最高概率字母的字符串（aaaa）。从队列中读取字符串后，将字母替换为下一个较低的概率字母，然后将新字符串添加到队列中。因此，阅读AAAA后，写AAAC。阅读AAAC后，写AACC（将A更改为c）和aaab （将C更改为b）。

我编写了一个生成器函数，遵循以下确切步骤：

定义助手函数：prio（word）返回单词的优先级（作为负数为负号，因为python heaps是min-heaps）;
定义助手dict：next_letter.get（Letter）是Letter之后的下一个更高优先级字母，如果有；
初始化a hapq带有第一个字aaaa的队列及其相应的优先级；
当从队列中弹出单词时，通过将当前单词与上一个单词进行比较来避免可能重复。
从队列中弹出一个单词后，如果它不是重复的话，则forts 通过通过下一个概率字母替换字母来推动获得的单词；

由于它是一个生成器函数，因此很懒惰：您可以在不计算所有单词的情况下获得第一个n单词。但是，仍将计算一些额外的单词，因为优先级队列的整个想法是我们不知道预先确切的顺序。

import heapq
from math import prod
from itertools import pairwise

def gen_combs(weights, r = 4):
    next_letter = dict(pairwise(sorted(weights, key=weights.get, reverse=True)))
    def prio(word, weights=weights):
        return -prod(map(weights.get, word))
    first_word = max(weights, key=weights.get) * r
    queue = [(prio(first_word), first_word)]
    prev_word = None
    while queue:
        w, word = heapq.heappop(queue)
        if word != prev_word:
            yield word
            prev_word = word
            seen_letters = set()
            for i, letter in enumerate(word):
                if letter not in seen_letters:
                    new_letter = next_letter.get(letter)
                    if new_letter:
                        new_word = ''.join((word[:i], new_letter, word[i+1:]))
                        heapq.heappush(queue, (prio(new_word), new_word))
                        seen_letters.add(letter)

# TESTING

weights = {'a': 0.7, 'b': 0.1, 'c': 0.3, 'z': 0.01}

# print all words
print(list(gen_combs(weights)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc',
#  'bcca', 'bbaa', 'zaaa', 'bccc', 'bbca', 'zcaa', 'bbcc',
#  'bbba', 'zcca', 'zbaa', 'bbbc', 'zccc', 'zbca', 'bbbb',
#  'zbcc', 'zbba', 'zzaa', 'zbbc', 'zzca', 'zbbb', 'zzcc',
#  'zzba', 'zzbc', 'zzbb', 'zzza', 'zzzc', 'zzzb', 'zzzz']

n = 10

# print first n words, one per line
for _,word in zip(range(n), gen_combs(weights)):
    print(word)

# print first n words
from itertools import islice
print(list(islice(gen_combs(weights), n)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc', 'bcca', 'bbaa', 'zaaa']

This answer provides an implementation of @user3386109's comment:

I think the solution is a priority queue. The initial input to the queue is a string with the highest probability letter (aaaa). After reading a string from the queue, replace a letter with the next lower probability letter, and add that new string to the queue. So after reading aaaa, write aaac. After reading aaac, write aacc (changing an a to c) and aaab (changing a c to a b).

I wrote a generator function, following these exact steps:

Define a helper functions: prio(word) which returns the priority of a word (as a negative number because python heaps are min-heaps);
Define a helper dict: next_letter.get(letter) is the next higher-priority letter after letter, if any;
Initialise a heapq queue with first word aaaa, and its corresponding priority;
When popping words from the queue, avoid possible duplicates by comparing the current word with the previous word;
After popping a word from the queue, if it is not a duplicate, then yield this word, and push the words obtained by replacing a letter by the next probability letter;

Since it is a generator function, it is lazy: you can get the first N words without computing all words. However, some extra words will still be computed, since the whole idea of the priority queue is that we don't know the exact order in advance.

import heapq
from math import prod
from itertools import pairwise

def gen_combs(weights, r = 4):
    next_letter = dict(pairwise(sorted(weights, key=weights.get, reverse=True)))
    def prio(word, weights=weights):
        return -prod(map(weights.get, word))
    first_word = max(weights, key=weights.get) * r
    queue = [(prio(first_word), first_word)]
    prev_word = None
    while queue:
        w, word = heapq.heappop(queue)
        if word != prev_word:
            yield word
            prev_word = word
            seen_letters = set()
            for i, letter in enumerate(word):
                if letter not in seen_letters:
                    new_letter = next_letter.get(letter)
                    if new_letter:
                        new_word = ''.join((word[:i], new_letter, word[i+1:]))
                        heapq.heappush(queue, (prio(new_word), new_word))
                        seen_letters.add(letter)

# TESTING

weights = {'a': 0.7, 'b': 0.1, 'c': 0.3, 'z': 0.01}

# print all words
print(list(gen_combs(weights)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc',
#  'bcca', 'bbaa', 'zaaa', 'bccc', 'bbca', 'zcaa', 'bbcc',
#  'bbba', 'zcca', 'zbaa', 'bbbc', 'zccc', 'zbca', 'bbbb',
#  'zbcc', 'zbba', 'zzaa', 'zbbc', 'zzca', 'zbbb', 'zzcc',
#  'zzba', 'zzbc', 'zzbb', 'zzza', 'zzzc', 'zzzb', 'zzzz']

n = 10

# print first n words, one per line
for _,word in zip(range(n), gen_combs(weights)):
    print(word)

# print first n words
from itertools import islice
print(list(islice(gen_combs(weights), n)))
# ['aaaa', 'caaa', 'ccaa', 'baaa', 'ccca', 'bcaa', 'cccc', 'bcca', 'bbaa', 'zaaa']

回复收藏 0 原文

梦醒灬来后我 2025-02-02 16:07:48

首先，我指出您的概率并不总计为1，这立即使我觉得有些不对劲。

我想分享我的版本（归功于Stef，David和user3386109），这在太空中更容易被证明是O（nl），而O（n*max（l，log（n）））及时更容易。因为您需要o（nl）来存储/打印结果，并为prio-Queue添加日志（n），因此您很难进一步改进。我看不到这里的人们在复杂性方面没有足够有效的效率。 PS，使其成为生成器无济于事，因为该空间主要由堆本身引起。 PS2对于大L，N的大小可以是天文学的。

复杂性分析：堆始于1个元素，每个POP（这是顶部N）最多可将2个元素推入。因此，总空间为O（N 2 项目大小） = O（NL）。时间成本为n*max（l，log（n）），因为按下/弹出每个成本log（n），而内部循环中的许多步骤成本o（l）。

您可以看到它打印出与Stef相同的35个结果。现在，该算法的核心部分是如何选择在每个弹出声之后推动的两个元素。首先考虑相反的问题，如何找到每个孩子的父母，所以（1）所有可能的连击形成部分顺序（最大元素是'aaaa'，不，我们在这里不需要Zorn的引理），（2）我们赢了“错过了比孩子大的任何东西，但比父母小（2个字母共享相同概率的说明不是完全正确的陈述，但实际上我发现它没关系），（3）没有重复，所以我们没有管理缓存访问的字符串。我选择父母的方式总是会降低字符串前面的字母，直到达到“ a”。例如，“ cbbz”的父母将是“ abbz”而不是“ ccbz”。相反，我们找到了一个给定父母的孩子，很高兴看到我们有多达2个孩子：在这种情况下，提高了第一封非 - ''的信，直到它等于下一个字母（'acbb'=＆gt; 'abbb'，而'accb'没有这个方向的孩子）或提高最后一个“ a”（'acbb'=＆gt;'ccbb'，'accb'=＆gt;'cccb'）

import heapq, math

letters=[[-math.log(x[1]),x[0]] for x in zip('abcz',[0.7,0.1,0.3,0.01])]
N=35 #something big to force full print in testing. Here 35 is calculated by DP or Pascal triangle
L=4
letters.sort()
letters=[[i,x[0],x[1]] for i,x in enumerate(letters)]
heap=[[letters[0][1]*L,[0]*L,L-1]]

while heap and N>0:
    N-=1
    p,nextLargest,childPtr=heapq.heappop(heap)
    print(''.join([letters[x][2] for x in nextLargest]))
    if childPtr<L-1 and nextLargest[childPtr+1]<len(letters)-1:
        child=list(nextLargest)
        child[childPtr+1]+=1
        if childPtr+2==L or child[childPtr+1]<=child[childPtr+2]:
            prob=sum([letters[x][1] for x in child]) #this can be improved by using a delta, but won't change overall complexity
            heapq.heappush(heap,[prob,child,childPtr])
    if childPtr>=0:
        child=list(nextLargest)
        child[childPtr]+=1
        prob=sum([letters[x][1] for x in child])
        heapq.heappush(heap,[prob,child,childPtr-1])

First pardon me for pointing out your probabilities do not sum to 1 which instantly makes me feel something is wrong.

I would like to share my version (credit to Stef, David, and user3386109), which is easier to prove to be O(NL) in space and O(N*max(L,log(N))) in time. Because you need O(NL) to store/print out the result, and add a log(N) for prio-queue, you can see it is hard to further make improvement. I don't see the folks here have proved efficient enough on the complexity-wise. P.S, making it a generator won't help since the space is mostly caused by the heap itself. P.S.2 for large L, the size of N can be astronomical.

Complexity analysis: The heap starts with 1 element and each pop, which is one of the top-N, pushes up to 2 elements back in. Therefore, the total space is O(N2item size)=O(NL). Time cost is N*max(L,log(N)) because push/pop each costs log(N) while many steps in the inner loop costs O(L).

You can see it print the same 35 results as Stef's. Now the core part of this algo is how to choose the 2 elements to push after each pop. First consider the opposite question, how to find the parent of each child so (1) all possible combos form a partial order (maximal element is 'aaaa', and No we don't need Zorn's lemma here), (2) we won't miss anything larger than the child but smaller than the parent (not a entirely correct statement for 2 letters sharing the same probability, but in actuality I found it does not matter), and (3) no repeats so we don't have to manage caching the visited strings. My way of choosing the parent is always lower the letters on the front of the string until it reaches 'a'. E.g, parent of 'cbbz' will be 'abbz' instead of 'ccbz'. Conversely, we find the children of a given parent, and it is nice to see we have up to 2 children in this case: raise the first non-'a' letter until it equals the next letter, ('acbb'=>'abbb', while 'accb' has no child in this direction) or raise the last 'a' ('acbb'=>'ccbb', 'accb'=>'cccb')

import heapq, math

letters=[[-math.log(x[1]),x[0]] for x in zip('abcz',[0.7,0.1,0.3,0.01])]
N=35 #something big to force full print in testing. Here 35 is calculated by DP or Pascal triangle
L=4
letters.sort()
letters=[[i,x[0],x[1]] for i,x in enumerate(letters)]
heap=[[letters[0][1]*L,[0]*L,L-1]]

while heap and N>0:
    N-=1
    p,nextLargest,childPtr=heapq.heappop(heap)
    print(''.join([letters[x][2] for x in nextLargest]))
    if childPtr<L-1 and nextLargest[childPtr+1]<len(letters)-1:
        child=list(nextLargest)
        child[childPtr+1]+=1
        if childPtr+2==L or child[childPtr+1]<=child[childPtr+2]:
            prob=sum([letters[x][1] for x in child]) #this can be improved by using a delta, but won't change overall complexity
            heapq.heappush(heap,[prob,child,childPtr])
    if childPtr>=0:
        child=list(nextLargest)
        child[childPtr]+=1
        prob=sum([letters[x][1] for x in child])
        heapq.heappush(heap,[prob,child,childPtr-1])

回复收藏 0 原文

~没有更多了~