如何有效地搜索字符串
我有一篇大约 300 - 500 字的文本。我还得到了大约 20 万个关键字,我想知道每个关键字是否都包含在文本中。字符串包含它非常慢,有什么方法可以预处理字符串吗?
我考虑过使用后缀树,但我不确定这是最好的选择。
另外,有什么好的库可以完成这项任务吗?例如,semanticdiscoverytoolkit 有一个后缀树实现,但在添加字符串后,我无法弄清楚如何查找字符串是否包含在树中。
问候,
尼科
I have a text with about 300 - 500 words. Also i got about 200k keywords and i want to know if each of the keywords is contained in the text. A String contains ist quite slow, is there some way to preprocess the String?
I thought about using a SuffixTree but im not sure this is the best choice.
Also, are there any good librarys for this task? semanticdiscoverytoolkit for example has a suffixtree implementation but after adding the string i cant figure out how to look up if a string is contained in the tree.
Greetings,
Nico
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以尝试 rabin-karp 字符串搜索算法。由于您主要进行哈希(整数)比较,因此性能比字符串比较要好得多。
打个比方,滚动哈希就像一个沿着文本滚动的“滑动窗口”。哈希比较是使用“滑动窗口”中子字符串的哈希与关键字的哈希来完成的。
you can try the rabin-karp string search algorithm. since you are doing mostly hash (integer) comparisons, the performance is much better than string comparisons.
as a analogy, the rolling hash is like a "sliding window" that scrolls along the text. the hash comparison is done using the hash of the substring in the "sliding window" against the hash of the keyword.
您可以使用 StringTokenizer 获取每个单词,然后填充您随后检查的哈希图。这需要仅浏览每个列表一次。查找时间应该非常快,考虑到您拥有的关键字数量,这一点很重要。
针对 Lucene 等工具来分析此方法可能是值得的。
You can use StringTokenizer to get each of the words then populate a hashmap which you check afterwards. This requires going through each list only once. Lookup times should then be very fast which is important given the amount of keywords you have.
It may be worth it to profile this method against something like Lucene.