填字游戏搜索的最佳数据结构

发布于 2024-08-22 00:46:59 字数 221 浏览 4 评论 0原文

我有一个用于解决填字游戏的大型数据库,其中包含单词和描述。 我的应用程序允许搜索特定长度的单词和特定位置上的字符(这是通过困难的方式完成的......遍历所有单词并检查每个单词)。 再加上按描述的搜索(如有必要)

例如查找单词_ _ A _ _ B(6个字母的单词,第三个字符A和最后一个B)

我想以这样的方式对单词进行索引,以便搜索速度非常快。 我的第一个想法是使用平衡树结构,还有其他建议吗?

I have a large database for solving crossword puzzles, consisting of a word and a description.
My application allows searching for words of a specific length and characters on specific positions (this is done the hard way ... go through all words and check each).
Plus a search by description (if necessary)

For instance find word _ _ A _ _ B (6 letter word, third character A and last B)

I would like to index the words in such way that the searching would be really fast.
My first idea was to use a balanced tree structure, any other suggestion?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

默嘫て 2024-08-29 00:46:59

好吧,我要提出一些奇怪的建议,但来自 C++ 的我已经使用 Boost 很长时间了,我来看看 MultiIndex库。

该库的想法是创建一个集合,但有多种不同的方式来查询它。事实上,它可以对数据库进行建模。

因此,让我们将单词放入表中,并将必要的索引放在适当的位置:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

现在查询将如下所示:

Select word From table Where length=9 And c2='n' And c8='u';

很简单不是吗?

为了获得最大效率,表应按长度分区,并且索引(每个 cX 列一个)应位于分区的本地。

对于内存中的解决方案,每个长度有一个容器,包含与长度一样多的索引,每个索引都是一个指向排序列表的哈希表(更容易合并)

这是一个Python描述:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

我自愿提供了<代码>length 参数,以最小化哈希的大小,从而使搜索更好。此外,这些集合按长度排序,以便更好地计算交集:)

如果您愿意,请继续针对其他解决方案进行测试:)

Okay, I am going to propose something weird, but coming from C++ I have been using Boost for a long time and I have come to see the MultiIndex library.

The idea of this library is to create one collection, but have many different ways to query it. It could model, in fact, a database.

So, let's put our words in a table, and put the necessary indexes in place:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

Now the query will look like:

Select word From table Where length=9 And c2='n' And c8='u';

Easy enough isn't it ?

For maximum efficiency, the table should be partitioned on the length, and the indexes (one per cX column) should be local to the partition.

For an in-memory solution you would have one container per length, containing as many indexes as the length, each index being a hash table pointing to a sorted list (easier merge)

Here is a python description:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

I have voluntarily provided the length argument, to minimize the size of the hashes and thus make the search better. Also, the sets are sorted by length so that the computation of the intersection be better :)

Go ahead and test it against other solutions if you wish :)

风流物 2024-08-29 00:46:59

这个问题:好的算法和数据结构用于查找缺少字母的单词? 一开始与您所要求的完全一样,但后来被编辑为相当不同且更简单的内容。不过,您仍然可以在那里找到一些想法。

简而言之,大家建议将整个词典加载到内存中,并根据单词的长度将单词分组。从那里,您可以走向许多不同的方向。您愿意使用的内存越多,速度就越快。

一个很好的建议是保留一个给定长度的单词列表的哈希表,这些单词在给定位置具有给定字母。你可以像这样构建它(在 Python 中):

# Build a whole lot of sorted word lists
wordlists = collections.defaultdict(list)
for word in sorted(all_words):
    for position, letter in enumerate(word):
        wordlists[len(word), position, letter].append(word)

现在,如果你需要一个以 B 结尾的 6 个字母的单词,你可以只要求 wordlists[6, 5, 'B'] 并且你已经得到了完整的清单。当您知道多个字母时(如..A..B),您可以选择最短的列表并根据所需的模式测试每个单词。我的电脑词典只有21个以B结尾的六字母单词,其中只有SCARAB匹配。

This question: Good algorithm and data structure for looking up words with missing letters? started out exactly like the one you're asking, but then it was edited to something rather different and easier. Still, you can find some ideas there.

In short, everyone recommends loading the whole dictionary into memory and dividing the words into groups based on their length. From there, you can go many different directions. The more memory you are willing to use up, the faster you can go.

One nice suggestion is to keep a hash table of lists of words of a given length that have a given letter in a given position. You can build it like this (in Python):

# Build a whole lot of sorted word lists
wordlists = collections.defaultdict(list)
for word in sorted(all_words):
    for position, letter in enumerate(word):
        wordlists[len(word), position, letter].append(word)

Now if you need a 6-letter word ending in B, you can just ask for wordlists[6, 5, 'B'] and you've got the complete list. When you know more than one letter, as in ..A..B, you can pick whichever list is shortest and test each word against the desired pattern. My computer's dictionary only has 21 six-letter words ending with B, of which only SCARAB matches.

单身狗的梦 2024-08-29 00:46:59

由于您使用数据库,因此创建一个后缀表。
例如:

  Suffix          |   WordID   | SN
  ----------------+------------+----   
  StackOverflow           10      1
  tackOverflow            10      2
  ackOverflow             10      3
  ckOverflow              10      4
  kOverflow               10      5
  ...

使用该表,可以轻松获取在特定位置包含特定字符的所有单词,
像这样:

SELECT WordID FROM suffixes
WHERE suffix >= 't' AND suffix < 'u' AND SN = 2

获取在位置2处包含't'的所有单词。

更新:如果您想节省空间并牺牲一点速度,您可以使用 后缀数组

您可以将所有单词存储在一行(数组)中,其中之间有分隔符,即 $,并创建
一个后缀数组,其中包含指向字符的指针。现在,给定一个 char c ,您可以相当快地找到包含它的单词的所有实例。不过,您必须检查它是否处于正确的位置。
(通过检查它与 $ 的距离)

可能使用上述技术,搜索速度将比搜索原始程序中的所有单词快 10 倍。

更新 2: 我在我的一个实用程序中使用了数据库方法,例如,我需要查找“ne”等后缀,但我忘记针对此特定问题调整(优化)它。

您可以只存储单个字符作为后缀:

  Suffix   |   WordID   | SN
  ---------+------------+----   
  S                10      1
  t                10      2
  a                10      3
  c                10      4
  k                10      5
  ...

这可以节省大量空间。现在,查询变成

SELECT WordID FROM suffixes
WHERE suffix = 't' AND SN = 2

Since you use a database, create a Suffixes table.
For example :

  Suffix          |   WordID   | SN
  ----------------+------------+----   
  StackOverflow           10      1
  tackOverflow            10      2
  ackOverflow             10      3
  ckOverflow              10      4
  kOverflow               10      5
  ...

With that table it's easy to get all words that contain a particular char in a specific position,
like this:

SELECT WordID FROM suffixes
WHERE suffix >= 't' AND suffix < 'u' AND SN = 2

Get all words which contain 't' at position 2.

Update: if you want to save space, and sacrifice a bit of speed, you can use a suffix array.

You can store all the words in a line (array) with a separator among them, ie the $, and create
a suffix array which will have pointers to chars. Now, given a char c you can find all instances of words which contain it rather fast. Still, you'll have to examine if it's in the right position.
(by checking how far it is from the $s)

Probably with the above technique the search will be x10 faster than searching all the words in your original program.

Update 2: I've used the database approach in one of my utilities where I needed to locate suffixes such as "ne", for example, and I forgot to adjust (optimize) it for this specific problem.

You can just store a single char as a suffix:

  Suffix   |   WordID   | SN
  ---------+------------+----   
  S                10      1
  t                10      2
  a                10      3
  c                10      4
  k                10      5
  ...

which saves a lot of space. Now, the query becomes

SELECT WordID FROM suffixes
WHERE suffix = 't' AND SN = 2
妥活 2024-08-29 00:46:59

您可以使用 后缀树 或 Trie。

You can use a Suffix Tree, or a Trie.

阳光下慵懒的猫 2024-08-29 00:46:59

您可以将信息存储在某种类型的字典树中(可能是三元搜索树)。 这篇论文,作者:Sedgewick 和 Bentley。你当然希望对不同长度的单词进行不同的尝试。论文称,部分搜索算法需要 O(n^((ks)/k)) 的时间来在 n k 长度单词的 trie 中指定 s 个字母。

You could store your information in a trie of some sort (perhaps a ternary search tree). An algorithm for partial search using a trie is described in section 6 of this paper by Sedgewick and Bentley. You of course want to have different tries for the various lengths of words. The paper says that the partial search algorithm requires a time of O(n^((k-s)/k)) for s letters being specified in a trie of n k-length words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文