后缀数组在哪里比后缀树更好?

发布于 2024-11-30 18:40:52 字数 393 浏览 1 评论 0原文

两个密切相关的数据结构是后缀树和后缀数组。据我所知,后缀树比后缀数组更快、更强大、更灵活、更节省内存。但是,在这个早期问题中,热门答案提到后缀数组在实践中使用更广泛。我没有任何使用这些结构的经验,但现在看来,对于需要它们提供的功能(例如快速子字符串检查)的问题,我似乎总是更喜欢后缀树而不是后缀数组。

在什么情况下后缀数组比后缀树更好?

(顺便说一句,虽然这个问题与我链接的问题相关,但我不认为它是完全重复的,因为我只对后缀数组和后缀树的比较感兴趣,而完全不考虑尝试不过,如果您不同意,如果这个问题被关闭,我会理解。)

Two closely-related data structures are the suffix tree and suffix array. From what I've read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array. However, in this earlier question, one of the top answers mentioned that suffix arrays are more widely used in practice. I don't have any experience using either of these structures, but right now it seems like I would always prefer a suffix tree over a suffix array for problems that needed the functionality they provide (fast substring checking, for example).

In what circumstances would a suffix array be preferable to a suffix tree?

(By the way, while this question is related to the one I've linked, I don't think it's an exact duplicate as I'm interested solely in a comparison of suffix arrays and suffix trees, leaving tries completely out of the picture. If you disagree, though, I would understand if this question were to be closed.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

往昔成烟 2024-12-07 18:40:52

引用自 http://www.youtube.com/watch?v=1DGZxd-PP7U

后缀数组和后缀树过去是不同的。但如今
后缀数组只是实现后缀树(或反之)的一种方法
反之亦然)。参见:Kim、Kim 和 Park。线性化后缀树:一种高效的
具有后缀树和后缀功能的索引数据结构
数组。算法,2007 年。

等人的论文写得很好,易于理解,并且引用了其他重要论文,例如 Abouelhoda 等人的论文。

Citing from http://www.youtube.com/watch?v=1DGZxd-PP7U

Suffix Arrays and Suffix Trees used to be different. But nowadays
Suffix Arrays are just a way of implementing a Suffix Tree (or vice
versa). See: Kim, Kim, and Park. Linearized suffix tree: an efficient
index data structure with the capabilities of suffix trees and suffix
arrays. Algorithmica, 2007.

The Kim et al paper is well written, accessible and has references to other important papers, such as the one by Abouelhoda et al.

仅此而已 2024-12-07 18:40:52

后缀数组几乎总是更可取,除非:

  • 如果您要索引少量数据。
  • 如果您正在研究蛋白质匹配或 DNA 突变,并且可以使用极其昂贵的计算机。
  • 如果您必须不惜一切代价使用带有通配符的错误搜索。

后缀数组可以用来实现后缀树。这意味着后缀树可以是后缀数组和一些附加数据结构来模拟后缀树功能。

因此:

  • 后缀数组使用更少的空间(少很多)
  • 后缀树构建速度较慢
  • 后缀树执行模式匹配操作更快
  • 后缀树可以执行更多操作,最好的是使用通配符进行错误模式匹配(后缀数组也执行模式匹配,但不执行)带通配符)

如果您想要索引大量数据,例如超过 50 MB。后缀树使用了太多的空间,以至于您的计算机没有足够的内存来将其保存在中央内存中。因此,它开始使用辅助内存,您会看到速度大幅下降。 (例如,人类 DNA 使用 700 MB,该数据的后缀树“可以”使用 40 GB -> *“可以”取决于实现*)

因此,后缀树在实践中几乎从未使用过。在实践中,使用后缀数组,并且小的附加数据结构为其提供了一些额外的功能(不是完整的后缀树)。

然而它们是不同的。在许多情况下,由于速度高效、构建速度快且占用空间少,纯后缀数组更适合模式匹配。

A suffix array is nearly always preferable, except:

  • If you are going to index little ammounts of data.
  • If you are doing research on protein matching or dna mutations and have access to extremely expensive computers.
  • If you must at all cost use the error search with wildcards.

A suffix array can be used to implement the suffix tree. Meaning a suffix tree can be a suffix array and a few additional data structures to simulate the suffix tree functionality.

Therefore:

  • Suffix arrays use less space (a lot less)
  • Suffix trees are slower to build
  • Suffix trees are faster doing pattern matching operations
  • Suffix trees can do more operations, the best is error pattern matching with wildcards (suffix array also does pattern matching but not with wildcards)

If you want to index a lot of data, like more than 50 megabytes. The suffix tree uses so much space that your computer does not have enough ram to keep it in central memory. Therefore it starts to use secondary memory and you will see a huge degradation in speed. (for example the human dna uses 700 megabytes, a suffix tree of that data "can" use 40 gigabytes -> * "can" depending on the implementation * )

Because of this the suffix tree is nearly never used in practice. In practice the suffix array is used and small additional data structures give it some extra functionality (never a complete suffix tree).

However they are different. There are many cases where a pure suffix array is preferable for pattern matching due to efficient speed, fast construction speed and low space use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文