当前位置：文江博客话题详情

具有快速访问时间的稀疏矩阵压缩

发布于 2025-01-06 07:29:10 字数 1181 浏览 4 评论 0原文

我正在编写一个词法分析器生成器作为业余项目，我想知道如何进行表压缩。所讨论的表是短且非常稀疏的二维数组。它们在一维中始终为 256 个字符。另一个维度的大小根据词法分析器中的状态数量而变化。

压缩的基本要求是

无需解压整个数据集即可访问数据。并且可以在恒定的 O(1) 时间内访问。
计算压缩表的速度相当快。

我了解行位移方法，这就是我目前所实现的。这可能是我幼稚的实现，但我所拥有的生成速度非常慢，尽管访问速度相当快。我想我可以使用一些已建立的字符串搜索算法来加快速度，例如此处找到的算法之一。

我想一个选择是使用字典，但这感觉像是作弊，并且我希望如果我使用带有某种既定算法的直接数组，我将能够获得快速访问时间。也许我对此的担心是不必要的。

据我所知，flex 的词法分析表并未使用此算法。相反，它似乎使用了一种称为行/列等效的东西，我还没有真正找到任何解释。

我真的很想知道 flex 使用的行/列等价算法是如何工作的，或者是否有任何其他好的选择我应该考虑用于此任务。

编辑：详细说明此数据的实际含义。它是词法分析器中状态转换的状态信息。由于状态表可能很大，因此数据需要以压缩格式存储在内存中。也可以从该内存中直接访问实际值，而无需解压缩表。我有一个使用行位移的可行解决方案，但计算速度非常慢 - 部分原因是我的愚蠢实现。

也许行位移方法的我的实现会让人们更清楚如何访问这些数据。它有点冗长，我希望我可以把它放在 Pastebin 上而不是放在这里。

数据非常稀疏。它通常是一大堆零，后面跟着每个状态的几个短字符。例如，对它进行游程编码是微不足道的，但它会破坏线性访问时间。

Flex 显然有两对表，第一对为 base 和 default，next 和 check 第二对。这些表似乎以我不理解的方式相互索引。《龙之书》试图解释这一点，但就像那本神秘知识巨著经常发生的情况一样，它所说的内容对于像我这样的低级头脑来说是迷失的。

原文

I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of short and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.

The basic requirements of the compression is that

The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
Reasonably fast to compute the compressed table.

I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.

I suppose an option would be to use a Dictionary, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.

From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called row/column equivalence which I haven't really been able to find any explanation for.

I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.

Edit: To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format in memory since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.

Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.

The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the
linear access time.

Flex apparently has two pairs of tables, base and default for the first pair and next and check for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.

分享到QQ

分享到微博