使用 搜索字符串。通配符
我有一个包含如此多字符串的数组,想要在其上搜索模式。 这个模式可以有一些“.”。通配符匹配(每个)1 个字符(任意)。
例如:
myset = {"bar", "foo", "cya", "test"}
find(myset, "f.o") -> returns true (matches with "foo")
find(myset, "foo.") -> returns false
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")
我试图找到一种快速实现该算法的方法,因为 myset 实际上是一个非常大的数组,但我的想法都没有令人满意的复杂度(例如 O(size_of(myset) * lenght(pattern)))
编辑:
myset
是一个巨大的数组,里面的单词并不大。 我可以进行缓慢的预处理。但我会有很多 find()
查询,因此 find()
我希望 find()
尽可能快。
I have an array with so much strings and want to search for a pattern on it.
This pattern can have some "." wildcard who matches (each) 1 character (any).
For example:
myset = {"bar", "foo", "cya", "test"}
find(myset, "f.o") -> returns true (matches with "foo")
find(myset, "foo.") -> returns false
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")
I tried to find a way to implement this algorithm fast because myset
actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern))
)
Edit:
myset
is an huge array, the words in it aren't big.
I can do a slow preprocessing. But I'll have so much find()
queries, so find()
I want find()
to be as fast as possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以构建集合中所有可能单词的语料库的后缀树(请参阅此链接)
使用这种数据结构,您的复杂性将包括构建树的一次性成本 O(n),其中 n 是所有单词的长度之和。
构建树后,查找字符串是否匹配只需 O(n),其中 n 是字符串的长度。
You could build a suffix tree of the corpus of all possible words in your set (see this link)
Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.
Once the tree is built finding if a string matches should take just O(n) where n is length of the string.
如果该集合是固定的,您可以预先计算字符
c
位于位置p
的频率(对于您考虑的任意多个p
值)值得),然后在数组中搜索一次,每个元素按顺序测试特定位置的字符,以便您最有可能提前退出。If the set is fixed, you could pre-calculate frequencies of a character
c
being at positionp
(for as manyp
values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.首先,将语料库按字长分成几组。然后您的查找算法可以搜索适当的集合,因为
find()
的输入始终要求匹配具有特定的长度,并且该算法可以设计为能够很好地处理相同的所有单词长度。接下来(对于每个集合),创建一个从字符 x 位置的哈希到匹配单词列表的哈希映射。出现大量的哈希冲突是完全可以的。您可以使用增量编码和游程编码来减小匹配单词列表的大小。
要进行搜索,请为查找输入长度选择适当的哈希映射,并为每个非
.
字符计算该字符 x 位置的哈希,然后将列表AND
在一起换句话说,得到一个大大减少的列表。暴力搜索那个小得多的列表。
First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to
find()
always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.
To search, pick the appropriate hash map for the find input length, and for each non
.
character, calculate the hash for that character x position, andAND
together the lists of words, to get a much reduced list.Brute force search through that much smaller list.
如果你确定你的集合中的单词长度不大。您可能可以创建一个包含以下内容的表:
第一个字符为“a”的单词列表,第一个字符为“b”的单词列表,..
第二个字符为“a”的单词列表,其中的单词列表有第二个字符“b”,..
等等。
当你在寻找这个词的时候。您可以查找第一个字符与搜索字符串的第一个字符相同的单词列表。通过这个精炼列表,查找第二个字符与搜索字符串的第二个字符相同的单词,依此类推。您可以忽略“.”每当你遇到他们时。
我知道构建表格可能会占用大量空间,但所花费的时间会显着减少。
例如,如果您有 myset = {"bar", "foo", "cya", "test"} 并且您正在搜索 'fo'
当您检查以 f 开头的单词列表时,您就消除了其余的该集。只是一个想法..希望有帮助。
If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:
List of Words which have first Character 'a' , List of Words which have first Character 'b', ..
List of Words which have second Character 'a', List of words which have second Character 'b', ..
and so on.
When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.
I understand that building the table may take a large amount of space but the time taken will come down significantly.
For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'
The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.
我有同样的问题,并且我对在互联网上找到的大多数想法/解决方案并不完全满意。我认为“正确”的方法是使用有向非循环词图。我并没有完全做到这一点,但我向 Trie 添加了一些额外的逻辑来获得类似的效果。
请参阅我的
isWord()
实现,类似于您想要的find()
接口。它的工作原理是向下递归 Trie,在通配符上分支,然后将结果收集回一个公共集合中。 (参见findNodes()
。)getMatchingWords()
本质上是相似的,只不过它返回匹配单词的集合,而不仅仅是一个关于是否匹配单词的布尔值。查询匹配任何内容。I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.
See my
isWord()
implementation, analogous to your desiredfind()
interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (SeefindNodes()
.)getMatchingWords()
is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.