编辑器核心缓冲区类型和语法突出显示

发布于 2024-07-13 14:26:48 字数 646 浏览 9 评论 0原文

我一直在思考如何让编辑器的核心功能与 vim 兼容，类似于 yzis。

最大的问题是使用什么类型的缓冲区。

要求是：

能够实现快速语法突出显示、正则表达式。
可以在单个文件中实现多个语法突出显示。与 Textmates 类似，
在删除插入时作用域适当的移动标记。以便它们在列中正确调整。与 vim 不同。
处理并突出显示至少 100 MB 的文件，而不会出现太大的问题和内存开销。

可能的缓冲区类型：

间隙缓冲区
基于行的编辑

我读到间隙缓冲区可能会在较长的运行中导致相当大的内存碎片。另外，emacs 语法突出显示引擎非常慢。（不知道为什么，可能与缓冲区类型并不真正相关）

所以问题是：

哪种缓冲区类型最适合快速编程编辑器？
什么是快速/完整的正则表达式引擎？（也许这包括下一点）。 TextMate 使用oniguruma，这是一个明智的选择吗？
什么是快速语法高亮引擎？
关于标记和语法高亮。 emacs 覆盖如何工作，有帮助吗？

谢谢，礼萨

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

叫嚣ゝ 2024-07-20 14:26:48

一个好的文本编辑器应该对程序员可能从事的各种工作都有用，其中包括打开有时可能有几千兆字节大小的文件。因此，我不建议将所有内容都缓冲在 RAM 中。

我建议设置代表文件的切片搜索树，其中单个切片可能是：

对磁盘上实际文件中的一系列字节的引用，或
对编辑的“页面”的引用。

当您打开文件时，首先将单个项目插入树中，这只是代表整个文件的范围，例如对于 10 MiB 文件：

std::map<size_t, slice_info> slices;
slices[0].size = 10*1024*1024;

当用户编辑文件时，创建一个“页面”，这是一些合理的大小，例如 4 KiB，位于编辑点周围。树在此时被拼接。在示例中，编辑点位于 5 MiB：

size_t const PAGE_SIZE = 4*1024;
slices[0].size = 5*1024*1024;
slices[5*1024*1024].size = PAGE_SIZE;
slices[5*1024*1024].buffer = create_buffer(file, 5*1024*1024, PAGE_SIZE);
slices[5*1024*1024 + PAGE_SIZE].size = 5*1024*1024 - PAGE_SIZE

您可以将内存映射文件用于只读缓冲区（源文件）和复制的可编辑缓冲区（后者将放置在临时目录中）。这也允许在编辑器崩溃时进行恢复。

使用固定大小的页面将大大减少内存堆的碎片，因为所有块都具有相同的大小，并且插入文本永远不需要在您前面移动超过 4 KiB 的数据。

这是一个简化的描述，旨在给出总体思路，而不涉及太多具体细节。真正的实现很可能需要更复杂，例如允许页面中的可变数量的数据来处理溢出的页面，并将许多小切片合并在一起，以便跨大文件运行正则表达式替换不会创建太多许多小缓冲区。树中同时拥有的切片数量可能需要受到限制，但关键点是，当您开始插入某处时，您应该确保使用的切片不太大。

对于正则表达式，我认为只要整个编辑器在运行时不挂起，性能就不是什么大问题。尝试 Boost。正则表达式，它很可能足够快，可以满足您的需求，而且它也足够通用，可以插入您需要的任何缓冲策略。

这同样适用于语法突出显示，如果您在后台运行它，它不会在用户打字时打扰太多。您可以在这里使用切片方法来获得好处：

每个切片都可以有一个互斥锁，可以在编辑操作期间锁定该互斥锁，从而允许语法突出显示或“智能感知”类型分析在后台线程中运行。
您可以存储语法突出显示引擎的状态，以便每当您在切片中进行编辑时，您都可以从该切片的开头（而不是从文件的开头）重新启动语法突出显示。

我不知道有任何独立的语法突出显示引擎，但它们通常基于正则表达式替换（例如，参见 vim 中的语法突出显示文件）。

A good text editor should be useful for all kinds of work a programmer might do, and that includes opening files that may sometimes be several gigabytes in size. Therefore I would not recommend a mind set where everything is to be buffered in RAM.

I would recommend setting up a search tree of slices representing the file, where a single slice may be:

A reference to a range of bytes in the actual file on disk, or
A reference to an edited "page".

When you open a file you start by inserting a single item into the tree, which is simply a range representing the whole file, e.g. for a 10-MiB file:

std::map<size_t, slice_info> slices;
slices[0].size = 10*1024*1024;

When the user edits the file, create a "page" which is some reasonable size, say 4 KiB, around the edit point. The tree is spliced at that point. In the example, the edit point is at 5 MiB:

size_t const PAGE_SIZE = 4*1024;
slices[0].size = 5*1024*1024;
slices[5*1024*1024].size = PAGE_SIZE;
slices[5*1024*1024].buffer = create_buffer(file, 5*1024*1024, PAGE_SIZE);
slices[5*1024*1024 + PAGE_SIZE].size = 5*1024*1024 - PAGE_SIZE

You can use memory-mapped files both for the read-only buffer (the source file) and for the copied editable buffers (the latter would be placed in a temp directory). This also allows recovery should the editor crash.

Using fixed-size pages will reduce fragmentation of the memory heap a lot since all blocks have the same size, and inserting text will never require moving more than 4 KiB of data ahead of you.

This is a simplified description to give the general idea without getting into too many gritty details. A real implementation would most likely need to be more sophisticated, e.g. allow for a variable amount of data in a page to cope with pages that overflow, and merge together many small slices so that running a regex substitution across a large file does not create too many small buffers. There probably needs to be a limit for the number of slices you should have in the tree simultaneously, but a key point is that when you start inserting somewhere you should make sure that you are working with a slice that isn't too big.

For regex, I don't think the performance is much of a problem as long as the whole editor doesn't hang while running it. Try Boost.Regex, it will most likely be fast enough for your needs, and it is also generic enough to plug in any buffering strategy you need.

The same applies to syntax highlighting, if you run it in the background it won't disturb the user so much while he is typing. You can use the slice approach to your benefit here:

Each slice can have a mutex that can be locked during an editing operation, allowing syntax highlighting or "intellisense" type analysis to run in a background thread.
You can store the state of the syntax highlighting engine so that whenever you make edits in a slice you can restart the syntax highlighting from the beginning of that slice, rather than from the beginning of the file.

I am not aware of any freestanding syntax highlighting engines, but they are usually based on regex substitution (see e.g. the syntax highlighting files in vim).

回复收藏 0 原文