如何加速这个 Anagram 算法

发布于 2024-11-18 12:48:21 字数 1489 浏览 4 评论 0原文

我正在制作一个移动应用程序来查找字谜和部分匹配。移动设备很重要，因为它没有太多的计算能力，而效率是关键。

该算法采用任意数量的字母（包括重复字母），并找到由其字母组成的最长单词，每个字母仅使用一次。我也有兴趣快速找到顶部结果，只要满足 N，我并不真正关心底部（较短的结果）。例如：

STACK => stack, tacks, acts, cask, cast, cats…

我进行了一些谷歌搜索，发现了一些算法，我想出了一种我认为有效的算法，但没有我想要的那么有效。

我有一个预先制作的查找字典，它将排序的键映射到生成该键的真实单词。

"aelpp" => ["apple", "appel", "pepla"]

我根据密钥的长度进一步将每个字典分成不同的字典。因此，5 个字母长的键在一本字典中，6 个字母长的键在另一本字典中。每个字典都位于一个数组中，其中索引是在字典中找到的键的长度。

anagramArray[5] => dictionary5
dictionary5["aelpp"] => ["apple", "appel", "pepla"]

我的算法首先获取输入单词“lappe”，然后对其进行排序：

"lappe" => "aelpp"

现在，对于每本最多包含 5 个字母的字典，我会进行比较以将其取出。这是伪代码：

word = input.sort
for (i = word.length; i > 0; i--)
    dictionaryN = array[i]
    for (key in dictionaryN)
        if word matches key
            add to returnArray
        end
    end
    if returnArray count > N
      break
    end
end

returnArray.sort by longest word, alphabetize

字典中只有大约 170,000 个单词，但搜索 12 个字母输入最多需要 20 秒。我的 match 方法从密钥中生成一个正则表达式：

"ackst" => /a.*c.*k.*s.*t.*/

例如，4 个字母的密钥（如 acst (acts)）将匹配 ackst （堆栈）因为：

"ackst" matches /a.*c.*s.*t.*/

我已经看到其他应用程序在更短的时间内完成相同的事情，我想知道我的方法是否是垃圾或者只需要一些调整。

如何获得最大计算效率，从一个单词生成按最大长度排序的前 N 个字谜？

原文

I am making a mobile app to find anagrams and partial matches. Mobile is important because there is not a whole lot of computational power, and efficiency is key.

The algorithm takes any number of letters, including repeats, and finds the longest words made up from its letters, using every letter only once. I am also interested in finding the top results quickly, and am not really concerned with the bottoms (shorter ones) as long as N is met. For example:

STACK => stack, tacks, acts, cask, cast, cats…

I have done some googling and have found a few algorithms, and I came up with one which I thought would be efficient, but is not as efficient as I would like.

I have a lookup dictionary pre-made that maps a sorted key to the real words that generate that key.

"aelpp" => ["apple", "appel", "pepla"]

I have further split each dictionary into different ones based on the length of the key. So keys that are 5 letters long are in one dictionary, keys that are 6 are in another. Each of these dictionaries are in an array in which the index is the length of the keys that are found in the dictionary.

anagramArray[5] => dictionary5
dictionary5["aelpp"] => ["apple", "appel", "pepla"]

My algorithm starts by taking an input word "lappe", and it sorts it:

"lappe" => "aelpp"

Now, for each dictionary that has contents of at most 5 letters, I do a comparison to pull it out. Here is pseudocode:

word = input.sort
for (i = word.length; i > 0; i--)
    dictionaryN = array[i]
    for (key in dictionaryN)
        if word matches key
            add to returnArray
        end
    end
    if returnArray count > N
      break
    end
end

returnArray.sort by longest word, alphabetize

The dictionary only has about 170,000 words in it, but searches are taking up to 20 seconds for 12 letter inputs. My match method makes a regex out of the key:

"ackst" => /a.*c.*k.*s.*t.*/

such that, for example, a 4 letter key such as acst (acts), will match ackst (stack) because:

"ackst" matches /a.*c.*s.*t.*/

I have seen other apps do the same thing in much less time, and I am wondering if my approach is garbage or just needs some tweaking.

How can I get the maximum computational efficiency for generating the top N anagrams from a word, sorted by max length?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

弱骨蛰伏 2024-11-25 12:48:21

如果您将字典视为（甚至可能将其表示为）字母树，则可以避免查看大量节点。如果“stack”在字典中，那么将有一条从根到标记为 ackst 的叶子的路径。如果输入单词是“attacks”，则对其进行排序以获得 aackstt。您可以编写一个递归例程来从根开始跟踪链接，同时消耗来自 aackstt 的字母。当您到达 ack 时，字符串中将剩下 stt，因此您可以跟随 s 到达 ackst，但您可以排除跟随 u 到达 acku 及其后代，v 到达 ackv 及其后代，依此类推。

事实上，通过这种方案，您可以仅使用一棵树来保存任意数量字母的单词，这应该可以节省您进行多次搜索，每个目标长度一次搜索。

回复收藏 0 原文

泡沫很甜 2024-11-25 12:48:21

生成正则表达式有点昂贵，因此您可能不想在循环内执行此操作。

我想到的一个选项（不一定非常高效，但在这种特殊情况下似乎很有用）是，不要搜索字典中的所有单词，而是尝试删除各种组合中的字母并检查结果字符串是否在你的字典。这将在 2^n 次迭代时达到最大值（其中 n 是单词中的字母数），对于 n < 170k 来说，这比 170k 更好。 18. 请注意，这种特殊方法不适用于长输入，但否则应该非常快。

回复收藏 0 原文

狼亦尘 2024-11-25 12:48:21

按如下方式构建您的字典：

 For each word W in the English language (or whatever word set you have)

     Sort the characters in W by alphabetical order (e.g. "apple" -> "aelpp") into a new string called W'

     Compute Hash H into W' using any fast hash algorithm (e.g CRC32.  You could likely invent anything yourself that has a low number of collisions)

     Store W and H as an element in the dictionary array
     That is:
        Word.original = W;
        Word.hash = Hash(W');
        Dictionary.append(Word);

  Sort the dictionary by hash values.

现在找到所有字谜词或搜索词 S

  Sort the characters in S by alphabetical order (e.g. "apple" -> "aelpp") into a new string called S'

  Compute Hash H of S' using the same fast hash algorithm above

  Now do a binary search on the dictionary for H.  The binary search should return an index F into Dictionary

  If the binary search fails to return an index into the Dictionary array, exit and return nothing

  I = F

  // Scan forward in the dictionary array looking for matches
  // a matching hash value is not a guarantee of an anagram match
  while (I < Dictionary.size) && (Dictionary[I].hash == H)
       if (IsAnagram(Dictonary[I], S)
           ResultSet.append(Dictionary[I].original)

  // Scan backwards in the dictionary array looking for matches
  I = F-1;
  while (I >= 0) && (Dictionary[I].hash == H)
       if (IsAnagram(Dictonary[I], S)
           ResultSet.append(Dictionary[I].original)


  return ResultSet

现在我没有介绍如何处理“子字符串”搜索（搜索长度小于搜索词的字谜词。我有点困惑如果这是一个要求，则您的指示意味着生成的字谜集应具有与搜索词完全相同的字符集，但是您可以枚举搜索字符串的所有子字符串并通过搜索算法运行每个子字符串。如上所述。

Build your dictionary as follows:

 For each word W in the English language (or whatever word set you have)

     Sort the characters in W by alphabetical order (e.g. "apple" -> "aelpp") into a new string called W'

     Compute Hash H into W' using any fast hash algorithm (e.g CRC32.  You could likely invent anything yourself that has a low number of collisions)

     Store W and H as an element in the dictionary array
     That is:
        Word.original = W;
        Word.hash = Hash(W');
        Dictionary.append(Word);

  Sort the dictionary by hash values.

And now to find all the Anagrams or search word S

  Sort the characters in S by alphabetical order (e.g. "apple" -> "aelpp") into a new string called S'

  Compute Hash H of S' using the same fast hash algorithm above

  Now do a binary search on the dictionary for H.  The binary search should return an index F into Dictionary

  If the binary search fails to return an index into the Dictionary array, exit and return nothing

  I = F

  // Scan forward in the dictionary array looking for matches
  // a matching hash value is not a guarantee of an anagram match
  while (I < Dictionary.size) && (Dictionary[I].hash == H)
       if (IsAnagram(Dictonary[I], S)
           ResultSet.append(Dictionary[I].original)

  // Scan backwards in the dictionary array looking for matches
  I = F-1;
  while (I >= 0) && (Dictionary[I].hash == H)
       if (IsAnagram(Dictonary[I], S)
           ResultSet.append(Dictionary[I].original)


  return ResultSet

Now I didn't cover how to handle "substring" searches (searching for anagram words that are smaller in length that the search word. I was a bit confused if this was a requirement. Your instructions imply that that resuling set of anagrams should have the exact same set of characters as the search word. But you can likely enumerate through all the substrings of your search string and run each substring through the Search algorithm described above.

回复收藏 0 原文

水溶 2024-11-25 12:48:21

这只是一个想法，但也许这就是您正在寻找的。您只有一种可迭代的结构，所有大小的单词都在其中。在每个迭代步骤中，您都会再引入一个字母，并且您将搜索范围缩小到不具有比已引入的字母“更大”的字母的单词。例如，如果您引入 M，您就不能再引入 NZ 范围内的任何内容。

该结构可以是一棵二叉树，其中引入一个字母将使您进一步进入几个树级别。每个节点都有一个字母，分支到其余较小的字母，分支到其余较大的字母，分支到下一个缩小搜索的根，以及一个指向完全由字母构建的单词列表的指针到目前为止已经介绍过了。如果该搜索子空间中没有可能的单词，则分支可能为 null，但不能同时为 3 个分支设置为 null，并且为指向单词列表的指针设置为 null。（好吧，你可以，作为一种优化，这现在无关紧要）。您可以使用一个标志来表示具有给定字母的单词的存在，而不是指向单词列表的指针，但这些单词可以存储在其他字典中。

假设我们有字母 ACKST。从结构的根开始，您可以循环搜索所有这些字母，但在 K 之后，您只能继续搜索 A 和 C（因为 S 和 T 位于 K 之上）。因为我们对最大的单词最感兴趣，所以我们应该从最大的字母（在本例中为 T）开始搜索，并继续搜索下一个最大的字母。对于单词 CAT，我们只能以特定的顺序搜索字母 T、C、A。一旦我们到达 A，就会有一个指向以下单词列表的指针：ACT、CAT。

回复收藏 0 原文

一个人练习一个人 2024-11-25 12:48:21

检查 2 个字符串是否是字谜的 O(N) 时间和 O(1) 解决方案

bool Anagram( const  char *s1, const char *s2)
{
    unsigned int sum=0;

    if ( s1 == NULL || s2 == NULL)
        return false;

    while ( *s1 != '\0' && s2 != '\0')
    {
                   sum ^= *s1;
                   sum ^= *s2;
                   s1++;
                   s2++;
    }

    if ( s1 != '\0' || s2 != '\0')
        return false;

    if (sum) return false;

    return true;
}

如果对两个相等的数字进行异或..结果是 0。（因此该算法）

O(N) Time and O(1) solution to check if 2 strings are anagrams

bool Anagram( const  char *s1, const char *s2)
{
    unsigned int sum=0;

    if ( s1 == NULL || s2 == NULL)
        return false;

    while ( *s1 != '\0' && s2 != '\0')
    {
                   sum ^= *s1;
                   sum ^= *s2;
                   s1++;
                   s2++;
    }

    if ( s1 != '\0' || s2 != '\0')
        return false;

    if (sum) return false;

    return true;
}

If you xor two equal numbers..your result is 0. (hence the algorithm)

回复收藏 0 原文

~没有更多了~