当前位置：文江博客话题详情

组合学：对字符进行分组挑战

发布于 2024-07-26 22:04:47 字数 911 浏览 4 评论 0原文

我在工作中正在研究一些分组问题。问题有点多，请耐心解答。我觉得它们很有趣。如果这里有人也对组合学感兴趣，请帮助我。

好吧，我们有一堆角色，我在这里接受了援助。

我们可以通过哪些方式对元素进行分组？假设我们有 4 个字符 help 。有效的组（保留顺序）将是：
援助
爱DS
一个id
援助
爱DS
一个 ID
援助
帮助

你如何枚举所有组？你能告诉我任意 n 个字母有多少种组合吗？

2. 特殊情况

如果情况有所不同，例如 Ai sd 和 ai sd 是两组怎么办？
您需要多少时间来列举所有案例？找到 4 个字母和 5 个字母的大小写之间的时间差是多少？
如果把“空格”当作一个字符。枚举完之后，
如果您根据距离定义从一个单词到另一个单词的转换。假设“ai ds”和“ai ds”的距离为 1，因为您应该将字母“i”移动一步。你能找到任意单词两侧距离为 n 的单词吗？

编辑：
“艾滋病”是一个词。
“a ids”“aid s”是与原始单词“aids”相距 1 的两个单词。
“a id s”是与原始单词“aids”相距两倍距离的单词。
“aid s”是与该词相距三倍的词。

这个问题似乎是金矿。

奖励：两个单词之间的最小距离是多少。就像“aids”与“a ids”的距离是一距离，也是两距离。是否有一个“中点”单词可以让您以最小的距离到达枚举中的任何其他单词？从一个词到另一个词有多少条路径？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

め可乐爱微笑 2024-08-02 22:04:47

计算组合的数量是微不足道的。基本上，每两个字母之间都有一个字符。它可能是“epsilon”（空）或空间。因此，对于 n 个字母，您有 n-1 个这样的分隔符。由于每个字符只能有两个值，这与 n-1 位的二进制数相同。所以你可以有 2 的 n-1 次方组合。

aids = 4 characters (n = 4)
n - 1 = 3
2 ** (n-1) = 8 combinations

现在，针对特殊情况。如果每个字符可以是小写或大写，则为字符的 2 的 n 次方变体（只要它们是字母）。现在，上面的每个组合根据大小写都有 2n 种变化，因此最终结果是 (2(n-1)) * (2**n)，等于 2 ** (2 *n -1)。

对于每个附加字母，您可以将枚举它们所需的时间增加一倍或四倍（取决于大小写），这可以从公式中轻松理解。

字符总数有点困难，但足以注意到每个“空格”有一半的时间是“epsilon”。所以我们有 (2 ** (n-1)) * n （字母）+ ((2 ** (n-1)) * (n-1)) / 2 （空格）。在示例中：

(2 ** 3) * 4 = 32 characters
(2 ** 3) * 3 / 2 = 12 spaces
Total = 44 characters

最后，距离问题与Levenshtein 距离有关。我考虑过使用 Burkhard-Keller 树，但我现在发现这根本没有必要，因为问题相当简单。

首先，使一个字符串等于另一个字符串所需的最小插入/删除/更改次数称为 Levenshtein 距离< /a>. 这直接适用于该问题：根据需要添加空格、删除空格、更改小写/大写。通常，最好使用动态编程来解决这个问题，通常可以将其视为保存有关解决问题的小部分，然后重用计算其他部分/更大的部分。

但考虑到我们问题的特殊限制，我们只能比较表示空格/epsilon 的“二进制”数字。

假设函数 f(word) 将返回表示该单词中空格的二进制数。例如，对于“ai ds”，它将返回“010”，对于“aid s”，它将返回“111”。每个组合之间的变化次数是通过对 f 的结果进行异或 (010 xor 111 = 101) 给出的，然后计算等于 1 的位数。让我们在 Scala 中为此编写几个函数：

def spacesToBinary(s: String) = {
  (s zip s.drop(1)) // transform "ai ds" into ((a,i), (i, ), ( ,d), (d, s))
  .foldLeft(0) { (binary, pair) => pair match {
      case (' ', _) => binary            // A space is always followed by a letter,
                                         // so we ignore it
      case (_, ' ') => (binary << 1) | 1 // If a letter is followed by a space, bit = 1
      case _ => binary << 1              // If not, bit = 0
  }}
}

def numberOfOnes(d: Int) = {
  var bits = 0
  var n = d
  while(n != 0) {
    if ((n & 1) == 1)
      bits += 1
    n >>= 1
  }
  bits
} 

def spaceDistance(s1: String, s2: String) = 
  numberOfOnes(spacesToBinary(s1) ^ spacesToBinary(s2))

现在，比较小写/一旦我们丢弃空格，大写就很容易了：

def caseDistance(s1: String, s2: String) = {
  val ns1 = s1 filter (_ != ' ')
  val ns2 = s2 filter (_ != ' ')
  (ns1 zip ns2).foldLeft(0){(acc, pair) => acc + (if (pair._1 == pair._2) 0 else 1)}
}

因此，编辑距离是：

def levenshtein(s1: String, s2: String) = spaceDistance(s1, s2) + caseDistance(s1, s2)

我们还知道有关编辑距离的以下属性。假设 d(x,y) 是 x 和 y 之间的编辑距离。然后我们知道：

d(x, y) = 0 if, and only if, x = y
d(x, y) = d(y, x)
d(x, y) + d(y, z) >= d(x, z)

最后一个标准称为三角不等式。简而言之，从 x 到 z 的路径至少与从 x 到 y 的路径加上从 y 到 z 的路径一样小（想象一个具有顶点 x、y 和 z 的三角形）。

好的，让我们考虑一下奖励问题。

两个单词之间有多少条路径？好吧，如果编辑距离为 n，则意味着您有“n”个唯一修改，a1、a2、...、an。对于这些修改的每个不同顺序，您都有一个路径。 “n”个元素的排列数是“n”（或n！）的阶乘：

def factorial(n: Int): Int = n match {
  case 0 => 1
  case _ => n * factorial(n-1)
}

2! = 2
3! = 6
4! = 24
5! = 120
etc

是否存在“中心”组合？实际上，没有。如果我们回过头来将组合视为表示空格/无空格和小写/大写的二进制数对，那么很明显，您可以简单地反转这些位以产生一个新的组合，其与所选组合的距离是最大可能。

或者换句话说，对于每一个 n 个字母的组合，都有一个且只有一个对应的组合，因此两个组合之间的编辑距离为 2*n - 1，即任意两个组合之间的最大可能距离。

我发现我忘记计算到 s 的（最小）距离为 n 的所有组合。好吧，我们知道每个组合都可以表示为两个二进制数，分别表示每个字母的空格和大小写，第一个是 n-1 位长，第二个是 n 位长。

我们还知道，我们可以简单地反转这些“数字”中的任何一个来获得差异。因此，如果我们得到一个 2*n - 1 位长的大二进制数，并且枚举它的所有数字，则与它的最小距离 d 处的组合由大小为 2*n-1 位的组中的 2*n-1 位数字的组合给出“d”不重复。对于N=2*n-1，这样的组合的数量为N!/((Nd)! * d!)。

例如，对于距离2和“辅助”，n=4，d=2，N=2*4-1=7，组合数为7!/(5!*2!) = 7 * 6/ 2 = 21。

我们可以这样组成组合：

def computeCombinations(n: Int, d: Int): List[List[Int]] = {
  def gen(d: Int, l: List[Int]): List[List[Int]] = d match {
    case 0 => List(List())
    case _ => for(el <- l;
                  sl <- gen(d-1, l.dropWhile(_ != el).tail))
              yield el :: sl
  }

  gen(d, (0 until n).toList)
}

这将返回要更改的字母/空格列表。我们通过要切换的位数来指示需要更改哪个字母或空格。为了简化事情，我们假设大写的二进制数和空格/无空格的二进制数连接在一个二进制数中。

接下来，我们必须找到一种方法来根据这些信息产生变化。大写/小写很简单，假设我们收到的单词不带空格：

def toggleCharCase(c: Char) = (c ^ ('A' ^ 'a')).toChar
def toggleCases(s: String, bits: List[Int]) = (
  s.zipWithIndex
  map (t => if (Set(bits: _*) contains t._2) (toggleCharCase(t._1), t._2) else t)
  map (_._1)
)

切换空格更困难。我们将使用上面定义的 spaceToBinary，将其转换为已设置的位数列表，切换请求的位并返回该值。之后，我们使用不同的函数将空格插入到正确的位置：

def toggleSpaces(s: String, bits: List[Int]): List[Int] = {
  val before = spacesToBinary(s)
  val beforeBits = Set() ++ 
                   (for(i <- 0 to 30; // Assuming 32 bits, 1 for sign
                        if (scala.Math.pow(2, i).toInt & before) != 0)
                    yield i)
  val afterBits = (beforeBits union Set(bits: _*)) diff 
                  (beforeBits intersect Set(bits : _*))
  afterBits.toList
}

def addSpaces(s: String, bits: List[Int]): String = (
  s.filter(_ != ' ').zipWithIndex 
  map (t => t._1.toString + (if (bits contains t._2) " " else ""))
  mkString
)

现在我们必须计算空格差，删除空格，切换大小写，然后将空格添加回来。让我们看看：

def changeCombination(s: String, n: Int, bits: List[Int]) = {
  // Split the binary number with all changes into case changes and space changes
  val spaceBits = bits filter (_ >= n) map (_ - n)
  val caseBits = bits filter (_ < n)

  // Compute the set of spaces after changing them
  val newSpaces = toggleSpaces(s, spaceBits)

  // Remove spaces and toggle case
  val noSpaces = s filter (_ != ' ')
  val caseChanged = toggleCases(noSpaces, caseBits).mkString

  // Now add the spaces as computed before
  val spacesAdded = addSpaces(caseChanged, newSpaces).mkString
  spacesAdded
}

最后，我们计算所有组合，然后为每个组合生成一个修改后的字符串：

def makeCombinations(s: String, d: Int) = {
  val n = (s filter (_ != ' ')).length
  for(combination <- computeCombinations(n*2-1, d))
  yield changeCombination(s, n, combination)
}

此代码均已在 Scala 2.8 上进行了测试（除了我刚刚进行的一些重命名）。这是运行的结果：

scala> makeCombinations("ai ds", 2) foreach println
AI ds
Ai Ds
Ai dS
A i ds
Aids
Ai d s
aI Ds
aI dS
a I ds
aIds
aI d s
ai DS
a i Ds
aiDs
ai D s
a i dS
aidS
ai d S
a ids
a i d s
aid s

Computing the number of combinations is trivial. Basically, you have a character between each two letters. It might be "epsilon" (empty), or space. So, for n letters you have n-1 such separator characters. As each character can only have two values, that's the same as a binary number of n-1 digits. So you can have 2 to the power of n-1 combinations.

aids = 4 characters (n = 4)
n - 1 = 3
2 ** (n-1) = 8 combinations

Now, for the special cases. If each character can be either lowercase or uppercase, that's 2 to the power of n variations for the characters (as long as they are letters). Now, EACH combination above has 2n variations based on capitalization, so the end result is (2(n-1)) * (2**n), which is equal to 2 ** (2*n -1).

For each aditional letter you either double or quadruple (depending on the capitalization thing) the time taken to enumerate them, as can be easily understood from the formula.

The total number of characters is a bit more difficult, but suffice to notice that each "space" is "epsilon" half the time. So we have (2 ** (n-1)) * n (letters) + ((2 ** (n-1)) * (n-1)) / 2 (spaces). In the example:

(2 ** 3) * 4 = 32 characters
(2 ** 3) * 3 / 2 = 12 spaces
Total = 44 characters

Finally, the distance problem is related to the Levenshtein distance. I thought about using a Burkhard-Keller tree, but I see now this is not necessary at all, as the problem is rather simpler.

First, the minimum number of inserts/deletions/changes necessary to make one string equal to another is called the Levenshtein distance. This is directly applicable to the problem: you add spaces, delete spaces, change lowercase/uppercase as needed. Usually, this is best solved with Dynamic Programming, which can be generally thought of as keeping data about the solution to small parts of the problem, which then get reused computing other parts/bigger parts.

But given the particular restrictions of our problem, we can just compare the "binary" numbers representing spaces/epsilon.

Say a function f(word) will return the binary number representing spaces in that word. For example, it will return "010" for "ai ds" and "111" for "a i d s". The number of changes between each combination is given by XORing the results of f (010 xor 111 = 101), and then counting the number of bits equal to 1. Let's write a couple of functions in Scala for that:

def spacesToBinary(s: String) = {
  (s zip s.drop(1)) // transform "ai ds" into ((a,i), (i, ), ( ,d), (d, s))
  .foldLeft(0) { (binary, pair) => pair match {
      case (' ', _) => binary            // A space is always followed by a letter,
                                         // so we ignore it
      case (_, ' ') => (binary << 1) | 1 // If a letter is followed by a space, bit = 1
      case _ => binary << 1              // If not, bit = 0
  }}
}

def numberOfOnes(d: Int) = {
  var bits = 0
  var n = d
  while(n != 0) {
    if ((n & 1) == 1)
      bits += 1
    n >>= 1
  }
  bits
} 

def spaceDistance(s1: String, s2: String) = 
  numberOfOnes(spacesToBinary(s1) ^ spacesToBinary(s2))

Now, comparing lowercase/uppercase is quite easy, once we discard spaces:

def caseDistance(s1: String, s2: String) = {
  val ns1 = s1 filter (_ != ' ')
  val ns2 = s2 filter (_ != ' ')
  (ns1 zip ns2).foldLeft(0){(acc, pair) => acc + (if (pair._1 == pair._2) 0 else 1)}
}

So, the Levenshtein distance is:

def levenshtein(s1: String, s2: String) = spaceDistance(s1, s2) + caseDistance(s1, s2)

We also know the following properties about the Levenshtein distance. Say d(x,y) is the Levenshtein distance between x and y. Then we know:

d(x, y) = 0 if, and only if, x = y
d(x, y) = d(y, x)
d(x, y) + d(y, z) >= d(x, z)

The last criteria i known as the triange inequality. Simply put, the path from x to z is at least as small as the path from x to y plus the path from y to z (think a triangle with vertices x, y and z).

Ok, so let's think about the bonus questions.

How many paths there are between two words? Well, if the Levenshtein distance is n, that means you have "n" unique modifications, a1, a2, ..., an. For each different ordering of these modifications you have a path. The number of permutations of "n" elements is the factorial of "n" (or n!):

def factorial(n: Int): Int = n match {
  case 0 => 1
  case _ => n * factorial(n-1)
}

2! = 2
3! = 6
4! = 24
5! = 120
etc

Is there a "central" combination? Actually, no. If we go back and think of combinations as pairs of binary numbers representing the spaces/no-spaces and lower/uppercases, then it should be obvious that you can simply invert the bits to produce a new combination whose distance to the one chosen is the maximum possible.

Or, in other words, for every combination of n letters, there is one, and only one, corresponding combination, so that the Levenshtein distance between the two combinations is 2*n - 1, the maximum possible distance between any two combinations.

I see I forgot to compute all combinations whose (minimum) distance to s is n. Well, we know each combination can be represented as two binary numbers, representing the spaces and the capitalization of each letter, the first being n-1 digits long, and the second n digits long.

We also know that we can simply invert any of these "digits" to get a difference. So, if we get one big binary number 2*n - 1 digits long, and we enumerate all its digits, the combinations at a minimum distance d from it are given by the combination of the 2*n-1 digits in groups of size "d" with no repetitions. For N=2*n-1, the number of such combination is N!/((N-d)! * d!).

For example, for distance 2 and "aids", n=4, d=2, N=2*4-1=7, and the number of combinations is 7!/(5!*2!) = 7 * 6 / 2 = 21.

We can compose the combinations this way:

def computeCombinations(n: Int, d: Int): List[List[Int]] = {
  def gen(d: Int, l: List[Int]): List[List[Int]] = d match {
    case 0 => List(List())
    case _ => for(el <- l;
                  sl <- gen(d-1, l.dropWhile(_ != el).tail))
              yield el :: sl
  }

  gen(d, (0 until n).toList)
}

This will return lists of lists of letter/spaces to change. We indicate which letter or space needs changing through the number of the bit we want to toggle. To simplify things, we assume the binary number for the capitalization and the binary number for the space/no-space are concatenated in a single binary number.

Next, we have to find a way to produce the variations from this information. The upper/lowercase is simple, assuming we receive the word without spaces:

def toggleCharCase(c: Char) = (c ^ ('A' ^ 'a')).toChar
def toggleCases(s: String, bits: List[Int]) = (
  s.zipWithIndex
  map (t => if (Set(bits: _*) contains t._2) (toggleCharCase(t._1), t._2) else t)
  map (_._1)
)

Toggling spaces is more difficult. We'll use spacesToBinary, defined above, convert that into a list of bit numbers which are set, toggle the requested bits and return that. Afterwards, we use a different function to insert the spaces in the proper places:

def toggleSpaces(s: String, bits: List[Int]): List[Int] = {
  val before = spacesToBinary(s)
  val beforeBits = Set() ++ 
                   (for(i <- 0 to 30; // Assuming 32 bits, 1 for sign
                        if (scala.Math.pow(2, i).toInt & before) != 0)
                    yield i)
  val afterBits = (beforeBits union Set(bits: _*)) diff 
                  (beforeBits intersect Set(bits : _*))
  afterBits.toList
}

def addSpaces(s: String, bits: List[Int]): String = (
  s.filter(_ != ' ').zipWithIndex 
  map (t => t._1.toString + (if (bits contains t._2) " " else ""))
  mkString
)

Now we have to compute the space difference, remove the spaces, toggle cases, and then add the spaces back. Let's see:

def changeCombination(s: String, n: Int, bits: List[Int]) = {
  // Split the binary number with all changes into case changes and space changes
  val spaceBits = bits filter (_ >= n) map (_ - n)
  val caseBits = bits filter (_ < n)

  // Compute the set of spaces after changing them
  val newSpaces = toggleSpaces(s, spaceBits)

  // Remove spaces and toggle case
  val noSpaces = s filter (_ != ' ')
  val caseChanged = toggleCases(noSpaces, caseBits).mkString

  // Now add the spaces as computed before
  val spacesAdded = addSpaces(caseChanged, newSpaces).mkString
  spacesAdded
}

Finally, we compute all combinations, and then produce a modified string for each one:

def makeCombinations(s: String, d: Int) = {
  val n = (s filter (_ != ' ')).length
  for(combination <- computeCombinations(n*2-1, d))
  yield changeCombination(s, n, combination)
}

This code has all been tested on Scala 2.8 (except for some renaming I just made). Here is the result of a run:

scala> makeCombinations("ai ds", 2) foreach println
AI ds
Ai Ds
Ai dS
A i ds
Aids
Ai d s
aI Ds
aI dS
a I ds
aIds
aI d s
ai DS
a i Ds
aiDs
ai D s
a i dS
aidS
ai d S
a ids
a i d s
aid s

回复收藏 0 原文

不疑不惑不回忆 2024-08-02 22:04:47

正如其他答案已经说过的，第 1 点有 2^(n-1) 种可能性。关于一些特殊情况（第 2 点）：

如果像 Ai sd 和 ai sd 是两组这样的情况产生差异怎么办？

好吧，在这种情况下，您有 2^n 种不同的情况组合，因此您有 2^n * 2^(n-1) = 2^(2 * n - 1) 种可能性。

如果把“空格”当作一个字符。列举完之后，你会写多少个字符？

这是一个更有趣的问题。您有 1 种可能不放置空格、3 种可能放置 1 个空格、3 种可能放置 2 个空格以及 1 种可能放置 3 个空格。如果我没记错的话，这是一个二项分布，并且有计算数字的公式。您还可以使用帕斯卡三角形来实现此目的：

   1
  1 1
 1 2 1
1 3 3 1 <- your case (4th row)
  ...

获得这些数字后，计算总字符数像这样：

1*4 + 3*5 + 3*6 + 1*7 = 44

As the other answers already said, there are 2^(n-1) possiblities for point 1. About some of the special cases (point 2):

What if the case made a difference like Ai sd and ai sd are two groups?

Well, in that case you had 2^n different case combinations, so you had 2^n * 2^(n-1) = 2^(2 * n - 1) possibilities at all.

If you take the "blank space" as a character. After all the enumeration, how many characters would you have written?

That's a more interesting question. You have 1 possibility to place no space, 3 possibilities to place 1 space, 3 possibilites to place 2 spaces and 1 possibility to place 3 spaces. This is a binomial distribution, if I recall correctly, and there are formulas to calculate the numbers. You can also use Pascal's triangle for this:

   1
  1 1
 1 2 1
1 3 3 1 <- your case (4th row)
  ...

After you got these numbers, calculate the total character count like this:

1*4 + 3*5 + 3*6 + 1*7 = 44

回复收藏 0 原文

在风中等你 2024-08-02 22:04:47

http://www-cs-faculty.stanford.edu/~ knuth/fasc3b.ps.gz（如果看不到 postscript，请下载 Ghostscript/Ghostview）详细讨论了分区。

对于长度为 n 的序列，有 2^(n-1) 个分区。在每对连续的项目之间考虑一点。如果设置了该位，则它们将被分隔（如您列出的那样，用空格分隔）。 “aids”（长度为 4）有 2^3 个可能的分区。

回答您的其他问题：

枚举时间： O(n*2^n) - 输出长度不变。不仅项目数量随着输入长度的增加而增加，而且每个项目中的字符数量也会增加。

写入的字符数：我们不要计算换行符（如果计算的话，请添加另外 2^(n-1) 个字符）。那么你就有了 n*2^(n-1) 个非空格字符，加上所有唯一的 n-1 位字符串中 1 的数量。 k 位比特串写出来有 k*2^k 位，其中一半是 1，所以字符总数为 [n+(n-1)/2]*2^(n-1) )，不计算换行符。在“aids”的 8 个变体列表中，有 32 个非空格字符和 12 个空格 - 分别为 4*2^3 和 (3/2)*2^3。

编辑距离：您必须更精确地了解转换及其成本。我认为“单词”指的是单个分区（8 个示例行之一）。如果编辑是删除或添加单个空格，那么您讨论的是 n-1 位字符串上的汉明距离。

回复收藏 0 原文

小苏打饼 2024-08-02 22:04:47

访问距离 k 或更小的每个单词的简单算法：使用哈希表仅访问每个位串一次（或 2^(n-1) 位的数组，但这可能太大），递归访问每个相邻单次编辑差异的数量（假设汉明距离：对于来自 1..(n-1) 的 i，与源位串进行 XOR 2^i，切换第 i 位）。

执行此操作的深度为 k（深度随递归一起传递），您将访问距离 k 内的所有编辑。当然，如果您只想要深度为 k 的那些，您将需要使用广度优先搜索：而不是立即访问每个邻居，而是将它们保留在要访问的队列中。在访问给定一代 (j) 的项目的队列时（所有项目都具有相同的最佳编辑距离），将未来的项目放入下一代 (j+1) 的不同队列中。这样，您可以使用尽可能少的编辑次数首先访问每个字符串（当每个转换具有相同的成本时，宽度优先 = 最佳优先）。

如果您不想进行广度优先搜索，那么您始终可以计算 k 或更少内的单词集以及 k-1 或更少内的单词集，并取差值（您将使用两个单独的表）。这实际上是“迭代深化搜索”。

BK 树在这里不合适，除非您正在考虑一组非结构化的单词（通用词典）。我们已经确切地知道单个单词的分区结构。

回复收藏 0 原文