组合学:对字符进行分组挑战
我在工作中正在研究一些分组问题。 问题有点多,请耐心解答。 我觉得它们很有趣。 如果这里有人也对组合学感兴趣,请帮助我。
好吧,我们有一堆角色,我在这里接受了援助。
我们可以通过哪些方式对元素进行分组? 假设我们有 4 个字符 help 。 有效的组(保留顺序)将是:
援助
爱DS
一个id
援助
爱DS
一个 ID
援助
帮助
你如何枚举所有组? 你能告诉我任意 n 个字母有多少种组合吗?
2. 特殊情况
如果情况有所不同,例如 Ai sd 和 ai sd 是两组怎么办?
您需要多少时间来列举所有案例? 找到 4 个字母和 5 个字母的大小写之间的时间差是多少?
如果把“空格”当作一个字符。 枚举完之后,
如果您根据距离定义从一个单词到另一个单词的转换。 假设“ai ds”和“ai ds”的距离为 1,因为您应该将字母“i”移动一步。 你能找到任意单词两侧距离为 n 的单词吗?
编辑:
“艾滋病”是一个词。
“a ids”“aid s”是与原始单词“aids”相距 1 的两个单词。
“a id s”是与原始单词“aids”相距两倍距离的单词。
“aid s”是与该词相距三倍的词。
这个问题似乎是金矿。
奖励:两个单词之间的最小距离是多少。 就像“aids”与“a ids”的距离是一距离,也是两距离。 是否有一个“中点”单词可以让您以最小的距离到达枚举中的任何其他单词? 从一个词到另一个词有多少条路径?
I was working on some grouping problems at my work. There are quite a few questions, please bear with me. I find them quite interesting. If anyone here is also interested in combinatorics, please help me out.
Ok so we have a bunch of characters , here i have taken a i d s.
What are the ways we can group the elements ? Let us say we have 4 characters a i d s.
Valid groups(preserving the order) would be:a i d s
a i ds
a id s
ai d s
ai ds
a ids
aid s
aids
How do you enumerate all the groups? Could you tell me how many combinations are there for any n letters?
2 . Special Cases
What if the case made a difference like Ai sd and ai sd are two groups?
How much time would it take you to enumerate all the cases? What would be the time difference between finding the 4 letters and 5 letters case?
If you take the "blank space" as a character. After all the enumeration, how many characters would you have written?
If you define a transformation from a word to another word in terms of distance.
Say "ai ds" and "a i ds" has 1 distance because you should moved the letter "i" one step.
Could you find the words at n distance in either side of any word?
Edit:
"aids" is a single word.
"a ids" "aid s" are two words which are at 1 distance from the original word "aids".
"a id s" is a word which is two distance from the original word "aids".
"a i d s" is a word which is three distance from the word.
This problem seems to be gold mine.
Bonus:What is the smallest distance between two words. Like "aids" is one distance from "a ids" and also two distance. Is there a "midpoint" word from where you can reach any other word in the enumeration with least distance? How many paths are there from a word to another?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
计算组合的数量是微不足道的。 基本上,每两个字母之间都有一个字符。 它可能是“epsilon”(空)或空间。 因此,对于 n 个字母,您有 n-1 个这样的分隔符。 由于每个字符只能有两个值,这与 n-1 位的二进制数相同。 所以你可以有 2 的 n-1 次方组合。
现在,针对特殊情况。 如果每个字符可以是小写或大写,则为字符的 2 的 n 次方变体(只要它们是字母)。 现在,上面的每个组合根据大小写都有 2n 种变化,因此最终结果是 (2(n-1)) * (2**n),等于 2 ** (2 *n -1)。
对于每个附加字母,您可以将枚举它们所需的时间增加一倍或四倍(取决于大小写),这可以从公式中轻松理解。
字符总数有点困难,但足以注意到每个“空格”有一半的时间是“epsilon”。 所以我们有 (2 ** (n-1)) * n (字母)+ ((2 ** (n-1)) * (n-1)) / 2 (空格)。 在示例中:
最后,距离问题与Levenshtein 距离有关。 我考虑过使用 Burkhard-Keller 树,但我现在发现这根本没有必要,因为问题相当简单。
首先,使一个字符串等于另一个字符串所需的最小插入/删除/更改次数称为 Levenshtein 距离< /a>. 这直接适用于该问题:根据需要添加空格、删除空格、更改小写/大写。 通常,最好使用动态编程来解决这个问题,通常可以将其视为保存有关解决问题的小部分,然后重用计算其他部分/更大的部分。
但考虑到我们问题的特殊限制,我们只能比较表示空格/epsilon 的“二进制”数字。
假设函数 f(word) 将返回表示该单词中空格的二进制数。 例如,对于“ai ds”,它将返回“010”,对于“aid s”,它将返回“111”。 每个组合之间的变化次数是通过对 f 的结果进行异或 (010 xor 111 = 101) 给出的,然后计算等于 1 的位数。让我们在 Scala 中为此编写几个函数:
现在,比较小写/一旦我们丢弃空格,大写就很容易了:
因此,编辑距离是:
我们还知道有关编辑距离的以下属性。 假设 d(x,y) 是 x 和 y 之间的编辑距离。 然后我们知道:
最后一个标准称为三角不等式。 简而言之,从 x 到 z 的路径至少与从 x 到 y 的路径加上从 y 到 z 的路径一样小(想象一个具有顶点 x、y 和 z 的三角形)。
好的,让我们考虑一下奖励问题。
两个单词之间有多少条路径?好吧,如果编辑距离为 n,则意味着您有“n”个唯一修改,a1、a2、...、an。 对于这些修改的每个不同顺序,您都有一个路径。 “n”个元素的排列数是“n”(或n!)的阶乘:
是否存在“中心”组合?实际上,没有。 如果我们回过头来将组合视为表示空格/无空格和小写/大写的二进制数对,那么很明显,您可以简单地反转这些位以产生一个新的组合,其与所选组合的距离是最大可能。
或者换句话说,对于每一个 n 个字母的组合,都有一个且只有一个对应的组合,因此两个组合之间的编辑距离为 2*n - 1,即任意两个组合之间的最大可能距离。
我发现我忘记计算到 s 的(最小)距离为 n 的所有组合。 好吧,我们知道每个组合都可以表示为两个二进制数,分别表示每个字母的空格和大小写,第一个是 n-1 位长,第二个是 n 位长。
我们还知道,我们可以简单地反转这些“数字”中的任何一个来获得差异。 因此,如果我们得到一个 2*n - 1 位长的大二进制数,并且枚举它的所有数字,则与它的最小距离 d 处的组合由大小为 2*n-1 位的组中的 2*n-1 位数字的组合给出“d”不重复。 对于N=2*n-1,这样的组合的数量为N!/((Nd)! * d!)。
例如,对于距离2和“辅助”,n=4,d=2,N=2*4-1=7,组合数为7!/(5!*2!) = 7 * 6/ 2 = 21。
我们可以这样组成组合:
这将返回要更改的字母/空格列表。 我们通过要切换的位数来指示需要更改哪个字母或空格。 为了简化事情,我们假设大写的二进制数和空格/无空格的二进制数连接在一个二进制数中。
接下来,我们必须找到一种方法来根据这些信息产生变化。 大写/小写很简单,假设我们收到的单词不带空格:
切换空格更困难。 我们将使用上面定义的 spaceToBinary,将其转换为已设置的位数列表,切换请求的位并返回该值。 之后,我们使用不同的函数将空格插入到正确的位置:
现在我们必须计算空格差,删除空格,切换大小写,然后将空格添加回来。 让我们看看:
最后,我们计算所有组合,然后为每个组合生成一个修改后的字符串:
此代码均已在 Scala 2.8 上进行了测试(除了我刚刚进行的一些重命名)。 这是运行的结果:
Computing the number of combinations is trivial. Basically, you have a character between each two letters. It might be "epsilon" (empty), or space. So, for n letters you have n-1 such separator characters. As each character can only have two values, that's the same as a binary number of n-1 digits. So you can have 2 to the power of n-1 combinations.
Now, for the special cases. If each character can be either lowercase or uppercase, that's 2 to the power of n variations for the characters (as long as they are letters). Now, EACH combination above has 2n variations based on capitalization, so the end result is (2(n-1)) * (2**n), which is equal to 2 ** (2*n -1).
For each aditional letter you either double or quadruple (depending on the capitalization thing) the time taken to enumerate them, as can be easily understood from the formula.
The total number of characters is a bit more difficult, but suffice to notice that each "space" is "epsilon" half the time. So we have (2 ** (n-1)) * n (letters) + ((2 ** (n-1)) * (n-1)) / 2 (spaces). In the example:
Finally, the distance problem is related to the Levenshtein distance. I thought about using a Burkhard-Keller tree, but I see now this is not necessary at all, as the problem is rather simpler.
First, the minimum number of inserts/deletions/changes necessary to make one string equal to another is called the Levenshtein distance. This is directly applicable to the problem: you add spaces, delete spaces, change lowercase/uppercase as needed. Usually, this is best solved with Dynamic Programming, which can be generally thought of as keeping data about the solution to small parts of the problem, which then get reused computing other parts/bigger parts.
But given the particular restrictions of our problem, we can just compare the "binary" numbers representing spaces/epsilon.
Say a function f(word) will return the binary number representing spaces in that word. For example, it will return "010" for "ai ds" and "111" for "a i d s". The number of changes between each combination is given by XORing the results of f (010 xor 111 = 101), and then counting the number of bits equal to 1. Let's write a couple of functions in Scala for that:
Now, comparing lowercase/uppercase is quite easy, once we discard spaces:
So, the Levenshtein distance is:
We also know the following properties about the Levenshtein distance. Say d(x,y) is the Levenshtein distance between x and y. Then we know:
The last criteria i known as the triange inequality. Simply put, the path from x to z is at least as small as the path from x to y plus the path from y to z (think a triangle with vertices x, y and z).
Ok, so let's think about the bonus questions.
How many paths there are between two words? Well, if the Levenshtein distance is n, that means you have "n" unique modifications, a1, a2, ..., an. For each different ordering of these modifications you have a path. The number of permutations of "n" elements is the factorial of "n" (or n!):
Is there a "central" combination? Actually, no. If we go back and think of combinations as pairs of binary numbers representing the spaces/no-spaces and lower/uppercases, then it should be obvious that you can simply invert the bits to produce a new combination whose distance to the one chosen is the maximum possible.
Or, in other words, for every combination of n letters, there is one, and only one, corresponding combination, so that the Levenshtein distance between the two combinations is 2*n - 1, the maximum possible distance between any two combinations.
I see I forgot to compute all combinations whose (minimum) distance to s is n. Well, we know each combination can be represented as two binary numbers, representing the spaces and the capitalization of each letter, the first being n-1 digits long, and the second n digits long.
We also know that we can simply invert any of these "digits" to get a difference. So, if we get one big binary number 2*n - 1 digits long, and we enumerate all its digits, the combinations at a minimum distance d from it are given by the combination of the 2*n-1 digits in groups of size "d" with no repetitions. For N=2*n-1, the number of such combination is N!/((N-d)! * d!).
For example, for distance 2 and "aids", n=4, d=2, N=2*4-1=7, and the number of combinations is 7!/(5!*2!) = 7 * 6 / 2 = 21.
We can compose the combinations this way:
This will return lists of lists of letter/spaces to change. We indicate which letter or space needs changing through the number of the bit we want to toggle. To simplify things, we assume the binary number for the capitalization and the binary number for the space/no-space are concatenated in a single binary number.
Next, we have to find a way to produce the variations from this information. The upper/lowercase is simple, assuming we receive the word without spaces:
Toggling spaces is more difficult. We'll use spacesToBinary, defined above, convert that into a list of bit numbers which are set, toggle the requested bits and return that. Afterwards, we use a different function to insert the spaces in the proper places:
Now we have to compute the space difference, remove the spaces, toggle cases, and then add the spaces back. Let's see:
Finally, we compute all combinations, and then produce a modified string for each one:
This code has all been tested on Scala 2.8 (except for some renaming I just made). Here is the result of a run:
正如其他答案已经说过的,第 1 点有 2^(n-1) 种可能性。关于一些特殊情况(第 2 点):
好吧,在这种情况下,您有 2^n 种不同的情况组合,因此您有 2^n * 2^(n-1) = 2^(2 * n - 1) 种可能性。
这是一个更有趣的问题。 您有 1 种可能不放置空格、3 种可能放置 1 个空格、3 种可能放置 2 个空格以及 1 种可能放置 3 个空格。 如果我没记错的话,这是一个二项分布,并且有计算数字的公式。 您还可以使用帕斯卡三角形来实现此目的:
获得这些数字后,计算总字符数像这样:
As the other answers already said, there are 2^(n-1) possiblities for point 1. About some of the special cases (point 2):
Well, in that case you had 2^n different case combinations, so you had 2^n * 2^(n-1) = 2^(2 * n - 1) possibilities at all.
That's a more interesting question. You have 1 possibility to place no space, 3 possibilities to place 1 space, 3 possibilites to place 2 spaces and 1 possibility to place 3 spaces. This is a binomial distribution, if I recall correctly, and there are formulas to calculate the numbers. You can also use Pascal's triangle for this:
After you got these numbers, calculate the total character count like this:
http://www-cs-faculty.stanford.edu/~ knuth/fasc3b.ps.gz(如果看不到 postscript,请下载 Ghostscript/Ghostview)详细讨论了分区。
对于长度为 n 的序列,有 2^(n-1) 个分区。 在每对连续的项目之间考虑一点。 如果设置了该位,则它们将被分隔(如您列出的那样,用空格分隔)。 “aids”(长度为 4)有 2^3 个可能的分区。
回答您的其他问题:
枚举时间: O(n*2^n) - 输出长度不变。 不仅项目数量随着输入长度的增加而增加,而且每个项目中的字符数量也会增加。
写入的字符数:我们不要计算换行符(如果计算的话,请添加另外 2^(n-1) 个字符)。 那么你就有了 n*2^(n-1) 个非空格字符,加上所有唯一的 n-1 位字符串中 1 的数量。 k 位比特串写出来有 k*2^k 位,其中一半是 1,所以字符总数为 [n+(n-1)/2]*2^(n-1) ),不计算换行符。 在“aids”的 8 个变体列表中,有 32 个非空格字符和 12 个空格 - 分别为 4*2^3 和 (3/2)*2^3。
编辑距离:您必须更精确地了解转换及其成本。 我认为“单词”指的是单个分区(8 个示例行之一)。 如果编辑是删除或添加单个空格,那么您讨论的是 n-1 位字符串上的汉明距离。
http://www-cs-faculty.stanford.edu/~knuth/fasc3b.ps.gz (download Ghostscript/Ghostview if you can't view postscript) discusses partitions in detail.
For a sequence of length n, there are 2^(n-1) partitions. Think of a bit between each pair of consecutive items. If the bit is set, then they're separated (by a space, as you listed them). "aids" (length 4) has 2^3 possible partitions.
In reply to your other questions:
Time to enumerate: O(n*2^n) - constant in the length of the output. Not only does the number of items increase with input length, but the number of characters in each item increases as well.
Number of characters written: let's not count newlines (if you do, add another 2^(n-1) characters). Then you have n*2^(n-1) nonspace characters, plus the number of 1s in all the unique n-1 digit bit strings. The k digit bit strings, when written out, have k*2^k bits, and half of those are 1. So the total number of characters is [n+(n-1)/2]*2^(n-1)), not counting newlines. In your list of the 8 variations on "aids", there are 32 nonspace characters, and 12 spaces - 4*2^3, and (3/2)*2^3 respectively.
Edit distance: you'll have to be more precise about the transformations and their cost. By "word" I assume you mean a single partition (one of your 8 example lines). If an edit is the removal or addition of a single space, then you're talking about Hamming distance on n-1 digit bit strings.
访问距离 k 或更小的每个单词的简单算法:使用哈希表仅访问每个位串一次(或 2^(n-1) 位的数组,但这可能太大),递归访问每个相邻单次编辑差异的数量(假设汉明距离:对于来自 1..(n-1) 的 i,与源位串进行 XOR 2^i,切换第 i 位)。
执行此操作的深度为 k(深度随递归一起传递),您将访问距离 k 内的所有编辑。 当然,如果您只想要深度为 k 的那些,您将需要使用广度优先搜索:而不是立即访问每个邻居,而是将它们保留在要访问的队列中。 在访问给定一代 (j) 的项目的队列时(所有项目都具有相同的最佳编辑距离),将未来的项目放入下一代 (j+1) 的不同队列中。 这样,您可以使用尽可能少的编辑次数首先访问每个字符串(当每个转换具有相同的成本时,宽度优先 = 最佳优先)。
如果您不想进行广度优先搜索,那么您始终可以计算 k 或更少内的单词集以及 k-1 或更少内的单词集,并取差值(您将使用两个单独的表)。 这实际上是“迭代深化搜索”。
BK 树在这里不合适,除非您正在考虑一组非结构化的单词(通用词典)。 我们已经确切地知道单个单词的分区结构。
A simple algorithm for visiting each of the words within distance k or less: using a hash table to visit each bitstring just once (or an array of 2^(n-1) bits, but that may be too big), recursively visit each of the adjacent single-edit differences (assuming Hamming distance: for i from 1..(n-1), XOR 2^i with the source bitstring, toggling the ith bit).
Do this to a depth of k (the depth is passed along with your recursion), and you'll have visited all edits within distance k. Of course, if you want just those of exactly depth k, you'll want to use breadth first search: instead of immediately visiting each neighbor, keep them in a queue to be visited. While visiting the queue for items of a given generation (j) (all having the same best edit distance), queue future items in a different queue for the next generation (j+1). This way you visit each string first using the least possible number of edits (breadth first = best first when each transition has the same cost).
If you don't want to do breadth first search, then you can always compute the set of words within k or less, and the set within k-1 or less, and take the difference (you'd use two separate tables). This is effectively "iterative deepening search".
B-K trees aren't appropriate here unless you're considering an unstructured set of words (a general dictionary). We know exactly the structure of the partitions for a single word already.
计数论据是正确的。
我有一个通用的方法来解决这样的问题,即使用分支定界。 这是一个示例。
基本上,您无需编写循环来扫描字符串,而是编写一个递归函数,并跟踪成本作为其参数之一。 然后,在每一步中,您可以 1) 逐步降低字符串,额外成本为零,然后 2) 对字符串进行一些小更改,为成本添加增量,然后向前迈进,3) 重复 2您想要考虑多种不同类型的更改。
然后制定总体成本预算,拒绝任何成本超出预算的分支机构。
最后,作为外循环,以 0 的预算执行整个操作一次。如果没有产生任何匹配项,则以 1 的成本再次执行一次,依此类推,直到获得一个或多个匹配项。
The counting arguments are right.
There's a general way I program problems like this, using branch-and-bound. Here's an example.
Basically, rather than write a loop to scan the string, you write a recursive function, and keep track of cost as one of its arguments. Then at each step, you can 1) step down the string, for an additional cost of zero, then 2) make some small change to the string, add an increment to the cost, and then step forward, and 3) repeat 2 for as many different kinds of changes you want to consider.
Then have an overall cost budget, and refuse to take any branch where the cost would exceed the budget.
Finally, as an outer loop, do the whole thing once with a budget of 0. If that doesn't produce any matches, do it again with a cost of 1, and so forth until you get one or more matches.