快速正则表达式搜索
有什么方法可以索引 50-100GB 的文本行,然后能够执行快速的正则表达式搜索?至少比一行一行地走要快。 正则表达式模式并不总是相同,因此在构建索引时不能考虑它。
Lucene 可以实现这样的功能吗? 我知道后缀树可能是可行的,但索引占用了太多内存(比 100GB 多得多)。
What would be a way to somehow index 50-100GB of text lines and then be able to perform fast regex searches? At least faster than going line by line.
The regex pattern is not always the same so can't take it into account when building the index.
Is it possible to achieve something like this with Lucene?
I know it might be possible with suffix trees but the index takes too much memory (much more than those 100GB).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你要做的主要事情是提前识别常见的搜索词,然后据此建立索引。
例如,也许您预计会有大量搜索以“Foo”开头的行。然后,您可以提前运行该搜索并存储以“Foo”开头的行列表。然后,如果有人搜索以“Foobar”开头的行,您就已经获得了要搜索的行的缩小子集。
如果您想变得非常聪明,您可以以编程方式分析常见搜索以查找重复搜索组件,然后根据这些常见组件建立索引。
The main thing you have to do is identify the common search terms in advance, and then index based on that.
For instance, maybe you anticipate that there will be a lot of searches for lines starting with "Foo". Then you can run that search in advance and store a list of lines starting with "Foo". Then, if someone searches for lines starting with "Foobar", you've already got a narrowed-down subset of lines to search.
If you want to get really clever, you can programmatically analyze common searches to find recurring search components, and then index based on those common components.