当前位置：文江博客话题详情

为文本添加样式的最有效的数据结构

发布于 2024-10-02 04:01:17 字数 280 浏览 0 评论 0原文

我正在寻找最好的数据结构来向文本添加样式（例如在文本编辑器中）。该结构应允许以下操作：

在绝对位置快速查找所有样式 X
在任意位置快速插入文本（必须移动该位置之后的样式）。
文本的每个位置都必须支持任意数量的样式（重叠）。

我考虑过包含文本范围的列表/数组，但它们不允许在不重新计算插入点之后所有样式的位置的情况下快速插入。

具有相对偏移量的树结构支持#2，但是当我向文本添加大量样式时，树会快速退化。

还有其他选择吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱娆 2024-10-09 04:01:18

我从未开发过编辑器，但是这个怎么样：

我相信可以扩展用于存储文本字符本身的方案，当然取决于您的实现细节（语言、工具包等）和您的性能和资源使用要求。

我不想为样式使用单独的数据结构，而是更喜欢有一个引用，该引用将伴随每个字符并指向具有适用字符的数组或列表。具有相同样式集的字符可以指向相同的数组或列表，以便可以共享。

字符插入和删除不会影响样式本身，除了更改对它们的引用数量之外，这可以通过一些引用计数来处理。

根据您的编程语言，您甚至可以通过指向列表的一半来进一步压缩内容，尽管为此进行的额外簿记实际上可能会降低效率。

此建议的主要问题是内存使用。在用 C 编写的 ASCII 编辑器中，由于结构对齐填充，将指针与每个字符捆绑在一起会在 64 位系统上将其有效内存使用量从 1 字节提高到 12 字节。

我会考虑将文本分成小的可变大小的块，这样您就可以有效地压缩指针。例如，32 个字符的块在 C 中可能如下所示：

struct _BLK_ {
    unsigned char size;
    unsigned int styles;
    char content[];
}

有趣的部分是对结构变量部分的元数据处理，其中包含存储的文本和任何样式指针。 size 元素表示字符数。样式整数（因此有 32 个字符的限制）将被视为一组 32 个 1 位字段，每个字段指示字符是否具有自己的样式指针，或者是否应使用与前一个字符相同的样式。这样，具有单一样式的 32 字符块将仅具有大小字符、样式掩码和单个指针以及任何填充字节的额外开销。在这样的小数组中插入和删除字符应该相当快。

至于文本存储本身，树听起来是个好主意。也许是一个二叉树，其中每个节点值都是子节点值的总和，叶节点最终指向文本块，其大小作为节点值？根节点值将是文本的总大小，理想情况下每个子树保存文本的一半。不过，您仍然必须自动平衡它，有时必须合并半空的文本块。

如果您错过了它，我不是树方面的专家:-)

编辑：

显然我建议的是此数据结构的修改版本：

http://en.wikipedia.org/wiki/Rope_%28computer_science%29

如本文中所引用：

文本编辑器的数据结构

编辑2：

建议的数据结构中的删除应该相对较快，因为它会归结为字节数组中的移位以及样式掩码上的一些按位运算。插入几乎是一样的，除非一个块被填满。在每个块内保留一些空间（即样式掩码中的一些位）可能是有意义的，以便将来可以直接在块中插入，而不必为相对少量的新文本更改树本身。

像这样将字符和样式捆绑在块中的另一个优点是，其固有的数据局部性应该比其他替代方案更有效地使用 CPU 缓存，从而在一定程度上提高处理速度。

不过，与任何复杂的数据结构非常相似，您可能需要使用代表性测试用例进行分析或使用自适应算法来确定其操作的最佳参数（块大小、任何保留空间等）。

I have never developped an editor, but how about this:

I believe it would be possible to expand the scheme that is used to store the text characters themeselves, depending of course on the details of your implementation (language, toolkits etc) and your performance and resource usage requirements.

Rather than use a separate data structure for the styles, I'd prefer having a reference that would accompany each character and point to an array or list with the applicable characters. Characters with the same set of styles could point to the same array or list, so that one could be shared.

Character insertions and deletions would not affect the styles themeselves, apart from changing the number of references to them, which could be handled with a bit of reference counting.

Depending on your programming language you could even compress things a bit more by pointing halfway into a list, although the additional bookkeeping for this might in fact make it more inefficient.

The main issue with this suggestion is the memory usage. In an ASCII editor written in C, bundling a pointer with each char would raise its effective memory usage from 1 byte to 12 bytes on a 64 bit system, due to struct alignment padding.

I would look about breaking the text into small variable size blocks that would allow you to efficiently compress the pointers. E.g. a 32-character block might look like this in C:

struct _BLK_ {
    unsigned char size;
    unsigned int styles;
    char content[];
}

The interesting part is the metadata processing on the variable part of the struct, which contains both the stored text and any style pointers. The size element would indicate the number of characters. The styles integer (hence the 32-character limit) would be seen as a set of 32 1-bit fields, with each one indicating whether a character has its own style pointer, or whether it should use the same style as the previous character. This way a 32-char block with a single style would only have the additional overhead of the size char, the styles mask and a single pointer, along with any padding bytes. Inserting and deleting characters into a small array like this should be quite fast.

As for the text storage itself, a tree sounds like a good idea. Perhaps a binary tree where each node value would be the sum of the children values, with the leaf nodes eventually pointing to text blocks with their size as their node value? The root node value would be the total size of the text, with each subtree ideally holding half of your text. You'd still have to auto-balance it, though, with sometimes having to merge half-empty text blocks.

And in case you missed it, I am no expert in trees :-)

EDIT:

Apparently what I suggested is a modified version of this data structure:

http://en.wikipedia.org/wiki/Rope_%28computer_science%29

as referenced in this post:

Data structure for text editor

EDIT 2:

Deletion in the proposed data structure should be relatively fast, as it would come down to byte shifting in an array and a few bitwise operations on the styles mask. Insertion is pretty much the same, unless a block fills up. It might make sense to reserve some space (i.e. some bits in the styles mask) within each block to allow for future insertions directly in the blocks, without having to alter the tree itself for relatively small amounts of new text.

Another advantage of bundling characters and styles in blocks like this is that its inherent data locality should allow for more efficient use of the CPU cache than other alternatives, thus improving the processing speed to some extent.

Much like any complex data structure, though, you'd probably need either profiling with representative test cases or an adaptive algorithm to determine the optimal parameters for its operation (block size, any reserved space etc).

回复收藏 0 原文

~没有更多了~