后缀数组在哪里比后缀树更好?
两个密切相关的数据结构是后缀树和后缀数组。据我所知,后缀树比后缀数组更快、更强大、更灵活、更节省内存。但是,在这个早期问题中,热门答案提到后缀数组在实践中使用更广泛。我没有任何使用这些结构的经验,但现在看来,对于需要它们提供的功能(例如快速子字符串检查)的问题,我似乎总是更喜欢后缀树而不是后缀数组。
在什么情况下后缀数组比后缀树更好?
(顺便说一句,虽然这个问题与我链接的问题相关,但我不认为它是完全重复的,因为我只对后缀数组和后缀树的比较感兴趣,而完全不考虑尝试不过,如果您不同意,如果这个问题被关闭,我会理解。)
Two closely-related data structures are the suffix tree and suffix array. From what I've read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array. However, in this earlier question, one of the top answers mentioned that suffix arrays are more widely used in practice. I don't have any experience using either of these structures, but right now it seems like I would always prefer a suffix tree over a suffix array for problems that needed the functionality they provide (fast substring checking, for example).
In what circumstances would a suffix array be preferable to a suffix tree?
(By the way, while this question is related to the one I've linked, I don't think it's an exact duplicate as I'm interested solely in a comparison of suffix arrays and suffix trees, leaving tries completely out of the picture. If you disagree, though, I would understand if this question were to be closed.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
引用自 http://www.youtube.com/watch?v=1DGZxd-PP7U
等人的论文写得很好,易于理解,并且引用了其他重要论文,例如 Abouelhoda 等人的论文。
Citing from http://www.youtube.com/watch?v=1DGZxd-PP7U
The Kim et al paper is well written, accessible and has references to other important papers, such as the one by Abouelhoda et al.
后缀数组几乎总是更可取,除非:
后缀数组可以用来实现后缀树。这意味着后缀树可以是后缀数组和一些附加数据结构来模拟后缀树功能。
因此:
如果您想要索引大量数据,例如超过 50 MB。后缀树使用了太多的空间,以至于您的计算机没有足够的内存来将其保存在中央内存中。因此,它开始使用辅助内存,您会看到速度大幅下降。 (例如,人类 DNA 使用 700 MB,该数据的后缀树“可以”使用 40 GB -> *“可以”取决于实现*)
因此,后缀树在实践中几乎从未使用过。在实践中,使用后缀数组,并且小的附加数据结构为其提供了一些额外的功能(不是完整的后缀树)。
然而它们是不同的。在许多情况下,由于速度高效、构建速度快且占用空间少,纯后缀数组更适合模式匹配。
A suffix array is nearly always preferable, except:
A suffix array can be used to implement the suffix tree. Meaning a suffix tree can be a suffix array and a few additional data structures to simulate the suffix tree functionality.
Therefore:
If you want to index a lot of data, like more than 50 megabytes. The suffix tree uses so much space that your computer does not have enough ram to keep it in central memory. Therefore it starts to use secondary memory and you will see a huge degradation in speed. (for example the human dna uses 700 megabytes, a suffix tree of that data "can" use 40 gigabytes -> * "can" depending on the implementation * )
Because of this the suffix tree is nearly never used in practice. In practice the suffix array is used and small additional data structures give it some extra functionality (never a complete suffix tree).
However they are different. There are many cases where a pure suffix array is preferable for pattern matching due to efficient speed, fast construction speed and low space use.