手指树(Data.Sequence)与绳索(Data.Rope)(Haskell,或一般情况)

发布于 2024-12-27 11:52:29 字数 2480 浏览 0 评论 0 原文

手指树之间的主要区别是什么(Data.Sequence) 和一根绳子 (Data.Rope) (爱德华Kmett 的版本 还是 Pierre-Etienne Meunier 的版本

在 Haskell 库中,Data .Sequence 有更多的功能,我认为绳索可以更有效地处理“块”,

例如处理 700 万个字符的序列,我需要(a)在任何地方插入,(b)剪切和粘贴。段(拼接),(c)搜索和替换子字符串,哪个更有效?

针对 ehird 的澄清:

  1. 我的算法的大部分正在运行数千个搜索替换操作,例如s/(ome)?reg[3]x/blah$1/g,重复改变数据,所以我需要有效的正则表达式模式匹配(可能是regex-tdfa?)以及拼接 (data[a: b] = newData),其中不一定(length(newData) == b-a+1)

  2. 惰性字节字符串可能没问题,但是拼接怎么样?拼接 ByteString 是 O(dataSize / chunkSize) 线性时间(用于搜索),加上(也许?)维护恒定大小块的开销。 (后半部分可能有误);与 FingerTree 的 O(log(dataSize)) 相比。

  3. 我的“containee”数据类型抽象地是一个有限字母表。它可以具体表示为 CharByteWord8 甚至类似于假设的 Word4 的东西(蚕食)。 ** 我有一个相关问题,关于如何有效地使用 newtypedata 以便我的代码可以引用抽象字母表,但编译后的程序仍然可以高效。 (我应该单独发布这个问题。)

  4. 性能问题:也许 Seq 比 ByteString 差得多(按 q 重要常量因子)。在简单的测试中,将 7MB 读取到严格的 ByteString 中,然后将其打印到控制台,实际内存使用量达到 60MB 峰值(根据 Windows 进程管理器),但将该内容加载到 Seq Char 中code>然后打印使用400MB! (我应该单独发布这个问题,并附上代码和分析详细信息。)

  5. 平台问题:我正在使用 EclipseFP 和 Haskell 平台。我的机器上安装了 Text,我想尝试一下,但是我的 Eclipse 环境找不到它。每当我使用cabal install时,我都会遇到严重的麻烦(安装了不兼容版本的软件包,--user--global混淆),所以我想坚持使用 EclipseFP 可以找到的平台包。我认为 Text 将进入下一版本的 Platform,所以这会很好。

  6. Trifecta:我短暂地看到了 Trifecta,这让我更加困惑。 (为什么它有自己已经发布的通用数据结构的新实现?它们更好吗?太多几乎相同的选项!)

编辑提供更多详细信息和改进的链接。

这个问题问得很大了。

@ehird 的总结是主要的要点。字节串或向量的绳索或手指树加上一个小的自定义幺半群。无论哪种方式,我都必须编写一个简单的正则表达式实现来粘合。

鉴于所有这些信息,我会推荐绳索或建筑 您自己的结构及其所基于的 Fingertree 包(而不是 比 Seq 更好,这样你就可以正确地实现长度之类的东西 测量类型类 — 请参阅 Monoids 和 Finger Trees),带有叶子 数据分块到未装箱的向量中。当然后者更 工作,但可以让您专门针对您的用例进行优化。无论哪种方式, 肯定将其包装在一个抽象接口中。

我今天晚些时候会回来并提出新的问题。我会把底层的技术问题理清,然后再回来进行整体比较。 我将更改问题标题以更好地反映我真正关心的问题“哪些 Haskell 模块提供或支持我有效需要的序列操作操作?”感谢 ehird 和其他回复者。

What are the key differences between a Finger Tree (Data.Sequence) and a Rope (Data.Rope) (Edward Kmett's version or Pierre-Etienne Meunier's version?

In the Haskell libraries, Data.Sequence has more functions. I think ropes handle "blocks" more efficiently.

As a programmer considered about efficiency dealing with, say a sequence of 7 million characters, where I need to do (a) insert anywhere, (b) cut and paste segments (splice), (c) search and replace for substrings, which is more efficient?

Clarifications in response to ehird:

  1. The bulk of my algorithm is running thousands of search-replace operations, like s/(ome)?reg[3]x/blah$1/g, to repeatedly mutate the data. So I need efficient regex pattern matching (perhaps form regex-tdfa?) as well as splicing (data[a:b] = newData), where not necessarily (length(newData) == b-a+1)

  2. Lazy ByteStrings might be OK, but what about splicing? Splicing a ByteString is O(dataSize / chunkSize) linear time (for the search), plus (perhaps?) overhead for maintaining the constant-size chunks. (Could be wrong about the latter part); vs O(log(dataSize)) for FingerTree.

  3. My "containee" data type is abstractly a finite alphabet. It could be represented concretely Chars or Bytes or Word8s or even something like a hypothetical Word4s (nibble).
    ** I have a related question about how to efficiently use a newtype or data so that my code can refer to the abstract alphabet, but the compiled program can still be efficient. (I should post this question separately.)

  4. Performance concerns: Perhaps Seq is far worse than ByteString (by q significant constant factor). In simple tests, reading 7MB into a strict ByteString and then printing it to console peaks at 60MB real mem usage (according to Windows Process Manager), but loading that content into a Seq Char and then printing uses 400MB! (I should post this question separately, with code and profiling details.)

  5. Platform concerns: I'm using EclipseFP and Haskell Platform. I have Text installed on my machine, and I wanted to try it, but my Eclipse environment can't find it. I get in serious trouble whenever I use cabal install (incompatible versions of packages get installed, --user vs --global confusion), so I want to stick with Platform packages that EclipseFP can find. I think Text is going into the next version of Platform, so that will be nice.

  6. Trifecta: I saw Trifecta briefly, and that just added to my confusion. (Why does it have its own new implementations of general data structures that have already been published? Are they better? Too many nearly-identical options!)

Edited with more details and improved links.

This question got big.

@ehird's summary is the main take-home point. Rope, or Finger Tree of ByteStrings or Vectors plus a small custom monoid. Either way, I'll have to write a simple regex implementation to glue in.

Given all this information, I would recommend either Rope, or building
your own structure with the fingertree package it's based on (rather
than Seq, so that you can implement things like length properly with
the Measured type-class — see Monoids and Finger Trees), with the leaf
data chunked into an unboxed Vector. The latter is, of course, more
work, but lets you optimise specially for your use-case. Either way,
definitely wrap it up in an abstract interface.

I will come back later today and split into new questions. I will get the low-level technical questions sorted out, and then come back to the overall comparison.
I will change the question title to better reflect my real concern "Which Haskell modules provide or support the sequence manipulation operations I need efficiently?" Thanks go to ehird and other responders.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

笑红尘 2025-01-03 11:52:29

对于这个答案的其余部分,我假设您实际上是在尝试存储原始字节,而不是字符。如果你想存储字符,那么你应该考虑使用 text (相当于 ByteString 对于 Unicode 文本)或基于它编写您自己的基于指树的结构。您还可以将 ByteStringData.ByteString.UTF8 模块rel="noreferrer">utf8-string 包;我认为这最终会更高效,但它的功能远不如 Unicode 文本的 Text 全面。

好吧,您链接的 Rope 包仅存储 ByteString 的等效项,而 Seq 是通用的,可以处理任何类型的数据;前者对于存储字节串可能更有效。

我怀疑这是相同的基本树结构,因为 Rope 实现了“字节串的指树”,而 Seq 是一个 2-3 指树;它依赖于(因此可能使用) fingertree 包,该包本质上与 Data 相同。顺序但更一般。绳子很可能将数据打包成短的 ByteString,这是 Seq 不可能做到的(不破坏 length 等操作) 。

总而言之,如果您存储字节字符串数据,绳索似乎是一个更好的结构,并且它似乎具有用于“注释”字符串段的奇特功能;然而,它最后一次更新是在 7 月份,新的 trifecta 解析器组合器库由同一作者 (首次发布于 8 月)包含其 自己的集合绳索模块,因此在其上建立新代码可能是不明智的。当然,对 trifecta 所做的更改可能与一般用途无关,并且将它们拆分为新版本的 rod 可能不会太困难;也许他们没有这样做的唯一原因是因为 trifecta 已经有大量的依赖项:)

但是,如果您在处理过程中的任何时候需要通用容器类型(例如将字节解析为更丰富的表示序列),或者想要坚持 Haskell 平台中的内容,那么您需要使用 Seq

您确定 ByteStringText (因为您提到了字符)不适合您正在做的事情吗?它们存储偏移量和长度字段,因此获取子字符串不会导致任何复制。如果您的插入操作足够不频繁,那么它就可以解决。某种基于 IntMap 的结构也可能值得考虑。


回答您更新的问题:

  1. 自定义字符串类型的正则表达式:请记住,要将现有的正则表达式实现与“不寻常”的字符串类型一起使用,您必须自己实现支持来粘合将其添加到现有的 regex-tdfa 代码中。我不确定最终的性能会如何。
  2. 与惰性ByteStrings拼接:请注意,惰性ByteString默认使用64 KiB块,您可以使用任意大小的块手动使用 fromChunks 。但你是对的,手指树可能更合适;这只是更多的工作要做,而惰性 ByteString 已经为您处理好了。
  3. 有限字母表: OK;我建议您抽象出(使用 newtype)表示该字母表序列的类型。这样,您可以尝试各种实现,同时希望本地化必须完成的工作,这样您就可以根据真实的性能数据而不是猜测进行选择:)当然,编写新的实现仍然需要前期成本。至于您的附加问题,newtype 在编译时被删除,因此 newtype 具有与其包装的类型相同的运行时表示形式。简而言之:不用担心将东西包装在 newtype 中。
  4. Seq 性能: 嗯,这并不奇怪。 Seq Char 是完全惰性且装箱的,并且不会像 Rope 那样将 Char 一起“分块”;它的内存效率可能比 String 还要低。像 Seq ByteString 这样的东西可能会表现得更好很多,但是除非你的块是恒定大小的,否则你将失去获得有意义的长度等的能力,而无需遍历整个事情。
  5. EclipseFP 包问题: 我不会根据这样的简单工具问题来选择使用哪种表示形式;我建议提出一个新问题。
  6. 三连胜:我认为三连胜与您的问题无关;它与 Rope 是由同一作者编写的,这就是为什么它与 Rope 的持续发展相关。它只是一个像 Parsec 这样的解析器组合器库,它更关注诊断等而不是性能,所以我不认为它可以取代你的正则表达式。

就 #3 而言,您可能需要考虑一个 未装箱的 ByteString,而不是 ByteString 向量;这样,您就可以使用抽象字母类型,而不是侵入 ByteString 的基于 Word8 的界面。

鉴于所有这些信息,我建议使用 Rope,或者使用 构建自己的结构它基于的 Fingertree 包(而不是 Seq),以便您可以使用 Measured 类型类正确实现诸如 length 之类的东西— 参见幺半群和手指树),叶子数据分块为一个未装箱的Vector。当然,后者需要更多工作,但可以让您针对您的用例进行专门优化。不管怎样,一定要把它包装在一个抽象接口中。

顺便说一句,正则表达式在 Haskell 生态系统中并没有得到应有的良好支持;如果有任何选择的话,可能值得考虑使用其他东西。但这在很大程度上取决于您的程序的具体细节,无法给出具体的建议。

For the rest of this answer, I'll assume you're actually trying to store raw bytes, not characters. If you want to store characters, then you should consider using text (the equivalent of ByteString for Unicode text) or writing your own fingertree-based structure based on it. You could also use a ByteString with the Data.ByteString.UTF8 module from the utf8-string package; I think that can end up being more efficient, but it's much less fully-featured than Text for Unicode text.

Well, the rope package you linked only stores the equivalent of ByteStrings, whereas Seq is generic and can handle any type of data; the former is likely to be more efficient for storing, well, strings of bytes.

I suspect it's the same essential tree structure, as rope implements "fingertrees of bytestrings", and Seq is a 2-3 finger tree; it depends on (and so presumably uses) the fingertree package, which is essentially the same as Data.Sequence but more general. It's likely that rope packs the data into short ByteStrings, which is impossible to do with Seq (without breaking operations like length, etc.).

All in all, rope seems like a better structure if you're storing byte-string data, and it seems to have fancy functionality for "annotating" segments of the string; however, it was last updated in July, and the new trifecta parser combinator library by the same author (first released in August) contains its own set of rope modules, so it may be unwise to base new code on it. Of course, the changes made for trifecta may not be relevant for general use, and it probably wouldn't be too difficult to split those out as a new version of rope; perhaps the only reason they haven't been is because trifecta already has a ton of dependencies :)

But, if you need a generic container type at any point in your processing (e.g. parsing the bytes into a sequence of some richer representation), or want to stick to what's in the Haskell Platform, then you'll need to use Seq.

Are you sure that ByteString or Text (since you mentioned characters) aren't suitable for what you're doing? They store offset and length fields, so that taking a substring doesn't cause any copying. If your insert operations are infrequent enough, then it could work out. An IntMap-based structure of some sort might be worth considering, too.


In response to your updated question:

  1. Regexes on custom string types: Bear in mind that to use an existing regex implementation with an "unusual" string type, you'll have to implement the support yourself to glue it to the existing regex-tdfa code. I'm not sure what the resulting performance would be.
  2. Splicing with lazy ByteStrings: Note that lazy ByteStrings use 64 KiB chunks by default, and you can use chunks as large as you wish by using fromChunks manually. But you're right, a finger tree is probably better suited; it's just more work to do that's already handled for you with lazy ByteStrings.
  3. Finite alphabet: OK; I'd suggest that you abstract out (with a newtype) a type representing a sequence of this alphabet. That way, you can try various implementations out while hopefully localising the work that has to be done, so you can choose based on real performance data rather than guesswork :) Of course, there's still an upfront cost to writing a new implementation. As far as your additional question, newtypes are erased at compile-time, so a newtype has the same runtime representation as the type it's wrapping. In short: don't worry about wrapping things in newtypes.
  4. Seq performance: Well, that's not surprising. Seq Char is fully lazy and boxed, and won't be "chunking" Chars together like Rope would; it's probably even less memory-efficient than String. Something like Seq ByteString might perform a lot better, but unless your chunks are constant-sized, you'll lose the ability to get a meaningful length, etc. without traversing the whole thing.
  5. EclipseFP package problems: I wouldn't choose which representation to use based on simple tool problems like that; I recommend asking a new question.
  6. Trifecta: I don't think trifecta is relevant to your problem; it's just written by the same author as rope, which is why it's relevant with regards to rope's continued development. It's just a parser combinator library like Parsec, and it focuses more on diagnostics and the like rather than performance, so I don't think it could replace your regexes.

As far as #3 goes, instead of ByteString, you might want to consider an unboxed Vector; that way, you can use your abstract alphabet type rather than hacking things into ByteString's Word8-based interface.

Given all this information, I would recommend either Rope, or building your own structure with the fingertree package it's based on (rather than Seq, so that you can implement things like length properly with the Measured type-class — see Monoids and Finger Trees), with the leaf data chunked into an unboxed Vector. The latter is, of course, more work, but lets you optimise specially for your use-case. Either way, definitely wrap it up in an abstract interface.

By the way, regexes aren't as well-supported in the Haskell ecosystem as they could be; it might be worth considering using something else if there's any option of doing so. But it depends too much on specific details of your program to give a specific recommendation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文