找到最长字谜的算法

发布于 2024-08-06 10:20:17 字数 239 浏览 4 评论 0原文

假设我们有一本大约有 250,000 个单词的词典。算法应将 12 个字母作为数组或字符串,并从字典中找到与最长单词匹配的变体。

当然,人们总是可以对其进行暴力破解,但我想知道最优雅的方法是什么?

如果不使用任何特定于语言的函数作为主要问题的快捷方式,使用 PHP 以外的语言的答案也将被接受。

注意:单词存储在数据库中,但我可以将它们拉入内存以提高速度。虽然我不确定 PHP 的索引是否比 MySQL 数据库更好?

Let's say that we have a dictionary of about 250.000 words. Algorithm should take in 12 letters as an array or a string and find the variation that matches longest word from a dictionary.

Of course, one can always brute-force it, but I wonder what would be the most elegant way to do this?

Answers using languages other than PHP will also be accepted if they do not use any language-specific functions as a shortcut for the main problem.

Note: Words are stored in the database, but I could pull them into memory for speed. Although I'm not sure PHP's indexing is better than that of an MySQL database?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

长途伴 2024-08-13 10:20:17

您应该计算每个单词的签名,只需执行一次并将其与单词一起保存到数据库中。

该表应该是这样的:

   word varchar(12), 
   a int,
   b int, 
   c int,
    ...
   w int,
   z int;

从 a 到 z 的字段必须包含单词中包含的字母数,
例如 anagram 会有这样的记录:

word,    a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
anagram, 3,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

一旦你有了 12 个字母,你就必须计算该集合的签名并使用它
创建这样的选择:

select word, length(word) as wordlen
from dictionary
where
a <= 4 and
b <= 0 and
c <= 1 and
d <= 2 and
e <= 0 and
f <= 0 and
 ....
z <= 0
order by wordlen desc;

以便拥有可以使用您拥有的字母集创建的所有单词。

没有排列,没有组合,虽然工作(编译字典)已经完成
仅一次且离线。

另一个提示,从数据库中删除所有长度超过 12 个字符的单词

You should calculate the signature of every word, you do it only once and save it into your database along with the word.

The table should be something like this:

   word varchar(12), 
   a int,
   b int, 
   c int,
    ...
   w int,
   z int;

and the fields from a to z have to contains the number of letter contained in the word,
for example anagram would have a record like:

word,    a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
anagram, 3,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

once you have the twelve letters you have to calculate the signature of the set and use it
to create a select like this:

select word, length(word) as wordlen
from dictionary
where
a <= 4 and
b <= 0 and
c <= 1 and
d <= 2 and
e <= 0 and
f <= 0 and
 ....
z <= 0
order by wordlen desc;

in order to have all the word that can be created using the letter set you have.

No permutation, no combination and the though work (compiling the dictionary) is done
only once and offline.

Just another hint, strip from the database all the words that are longer than twelve chars

岁吢 2024-08-13 10:20:17

我都会对这里的字谜问题的答案进行稍微修改的版本

对于每个单词, 在字典中,按字母顺序对字母进行排序。所以“foobar”变成了“abfoor”。

从您的完整输入开始,按字母顺序排序。如果没有找到,则删除一个字母,然后再次搜索。对每个字母都这样做。然后删除两个字母...等等。

最坏的情况:根本没有找到“字谜”。您必须测试所有可能的输入组合,这将为您提供大约 2^n 次查找,其中 n 是输入字符的数量(在您的示例中:12)
然而,算法的速度并不取决于运行时字典的大小(当然,按字母顺序对单词进行排序确实如此),我认为这是这里最重要的事情。

I'd go with a slightly modified version of the answer to the anagram question here

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Start with your complete input, alphabetically sorted. If its not found, remove one letter, do the search again. Do this for every letter. Then remove two letters... and so on.

Worst case: No 'anagram' found at all. You will have to test all possible input combinations, which will give you around 2^n lookups where n is the number of input characters (in your example: 12)
However, the speed of the algorithm does not depend on the size of the dictionary at run time (of course, sorting the words alphabetically does) which in my opinion is the most important thing here.

余罪 2024-08-13 10:20:17

Eric Lippert 撰写了一篇内容丰富的 关于字谜搜索的博客文章。这些示例均使用 C#,但这些技术可用于任何语言。

在字典中有效搜索字谜词的技巧是认识到所有字谜词都具有相同的字母,只是顺序不同。如果您“规范化”每个单词,使其字母为大写且按字母顺序排列,那么检查一个单词是否是另一个单词的字谜词就像比较它们的规范形式一样简单

通过这种技术,您可以轻松地从哈希表中查找字谜词或平衡树。

Eric Lippert has written an informative blog post about anagram searching. The examples all use c#, but the techniques are usable in any language.

The trick to efficiently searching for anagrams in a dictionary is to realize that all anagrams have the same letters, just in different order. If you "canonicalize" each word so that its letters are uppercase and in alphabetical order, then checking whether one word is an anagram of another is as simple as comparing their canonical forms

With this technique, you can easily look up anagrams from a hash table or balanced tree.

情魔剑神 2024-08-13 10:20:17

如果您想找到最长的匹配单词,我会首先尝试按单词长度对词典进行排序,这样您就可以将最大的精力集中在最长的单词上

If you are trying to find the longest matching word, I would start by trying to sort the dictionary by word length, so you can focus the most effort on the longest words

度的依靠╰つ 2024-08-13 10:20:17

我的想法:

伪代码:

int_32 letter_mask
int_32 permutation_match_mask
if(((letter_mask XOR permutation_match_mask) AND letter_mask)  == 0)
        YOU_HAVE_HIT;

当您在 lettermask 中包含非重复字母时,这会起作用,但是如果您的字母(可能有)多于您可以扩展的字母和 permutationmatchmask

编辑

另一个想法

按字母顺序对词汇表中的单词进行排序。

如果有 12 个字母并且所有字母都不同,那么正好有 4095 种可能的组合(只是总和 i= 1->12 二项式(12 over i))(对于字母 ABCD,有 (ABCD,ABC,ABD, ACD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D) 正如我所说,12 个不同的字母中有 4095 个,如果某些字母相同,则更少

4095*Log2(。 250000)大约是 75000。值得尝试

对每个组合进行精确搜索。

My idea:

pseudocode:

int_32 letter_mask
int_32 permutation_match_mask
if(((letter_mask XOR permutation_match_mask) AND letter_mask)  == 0)
        YOU_HAVE_HIT;

well this works when you have non repetive letters in lettermask, but if you have more letters (as you probably have) than you can extend leter and permutationmatchmask

EDIT

Another idea

Sort words in vocabulary by alphabeticaly order.

if there are 12 letteres and all of them are different, than there are exactly 4095 posible cobinations (just sum i= 1->12 binomial(12 over i) ) (for letters ABCD, there are (ABCD,ABC,ABD,ACD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D) And as I said there are 4095 in 12 different letters and even less if some of letters are same.

Complexity 4095*Log2(250000) what is aproximetly 75000. Well it is worth to try.

Go for exact search on each combination.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文