使用通配符(GLOB)支持搜索数百万个文件名的更好方法是什么
我正在开发一个小型搜索引擎来显示带有完整路径的匹配文件名。重要的是我需要提供通配符(GLOB)搜索,例如 *.doc
或 *list*.xlx
或 *timesheet*
或???.doc
或类似的东西。
我找到了一些相关的解决方案
但我正在寻找有效的算法,可以在不到一秒的时间内找到百万个文件名中的匹配项,因此比 O(n) 更好必需..
我正在考虑两阶段算法,在第一阶段进行子字符串数组(后缀数组+前缀数组)搜索,并通过第一阶段第二阶段的结果进行正常的正则表达式搜索。
任何帮助将不胜感激...
i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc
or *list*.xlx
or *timesheet*
or ???.doc
or something like that.
i found some related solution
Search for strings matching the pattern "abc:*:xyz" in less than O(n)
but i am looking for efficient algorithms which can find matches out of million file names in a less than a second, so better than O(n) is required..
i am thinking of two phase algorithm with substring array (Suffix array + prefix array) search in first phase and normal RegEx search thru the results of first phase second phase.
any help would be greatly appreciated...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我所知,对于广义全局搜索来说,没有比 O(n) 更好的方法了。
然而,对于前缀和后缀搜索的特殊情况,您可以自己创建排序索引来执行二分搜索,从而导致前缀和后缀搜索的 O(log n) 。前缀索引将根据第一个字符排序,然后是第二个字符,依此类推。后缀索引将根据最后一个字符排序,然后是倒数第二个字符,依此类推(反向字符串)。
我会按照您在帖子中建议的那样进行搜索,分两个阶段进行搜索,搜索前缀或后缀索引,然后使用由 glob 生成的正则表达式对第一阶段提供的简化列表进行强力搜索。
由于字符串长度比较比正则表达式更快,因此我还会预过滤 ???.doc 示例的最小匹配字符串长度或固定长度匹配字符串。
从原始帖子的声音来看,索引还需要引用每个条目的完整路径,以便您可以在找到最终结果后显示它。
As far as I know there is no way to do better than O(n) for generalized glob searching.
However for the special cases of prefix and suffix search you can make yourself sorted indexes to do a binary search on, resulting in O(log n) for prefix and suffix search. The prefix index would be sorted based on the first character, then the second, and so on. The suffix index would be sorted based on the last character, then the second last, and so on (the reverse strings).
I would do as you suggest in your post and do the search in two phases, search the prefix or suffix indexes, followed by a brute force search through the reduced list provided from the first phase using a regex made from the glob.
Since string length comparisons are faster than regexes, I would also pre-filter for the minimum matching string length, or fixed length matching string for the ???.doc example.
From the sound of your original post the indexes would need to refer to the full path of each entry as well, so that you can display it after the final results are found.
查看自索引:此堆栈溢出问题,以及这篇 DrDobbs 文章。
Check out self indexing: This Stack Overflow question, and this DrDobbs article on it.