具有快速访问时间的稀疏矩阵压缩
我正在编写一个词法分析器生成器作为业余项目,我想知道如何进行表压缩。所讨论的表是短
且非常稀疏的二维数组。它们在一维中始终为 256 个字符。另一个维度的大小根据词法分析器中的状态数量而变化。
压缩的基本要求是
- 无需解压整个数据集即可访问数据。并且可以在恒定的 O(1) 时间内访问。
- 计算压缩表的速度相当快。
我了解行位移方法,这就是我目前所实现的。这可能是我幼稚的实现,但我所拥有的生成速度非常慢,尽管访问速度相当快。我想我可以使用一些已建立的字符串搜索算法来加快速度,例如此处找到的算法之一。
我想一个选择是使用字典,但这感觉像是作弊,并且我希望如果我使用带有某种既定算法的直接数组,我将能够获得快速访问时间。也许我对此的担心是不必要的。
据我所知,flex 的词法分析表并未使用此算法。相反,它似乎使用了一种称为行/列等效的东西,我还没有真正找到任何解释。
我真的很想知道 flex 使用的行/列等价算法是如何工作的,或者是否有任何其他好的选择我应该考虑用于此任务。
编辑:详细说明此数据的实际含义。它是词法分析器中状态转换的状态信息。由于状态表可能很大,因此数据需要以压缩格式存储在内存中。也可以从该内存中直接访问实际值,而无需解压缩表。我有一个使用行位移的可行解决方案,但计算速度非常慢 - 部分原因是我的愚蠢实现。
也许行位移方法的我的实现会让人们更清楚如何访问这些数据。它有点冗长,我希望我可以把它放在 Pastebin 上而不是放在这里。
数据非常稀疏。它通常是一大堆零,后面跟着每个状态的几个短字符。例如,对它进行游程编码是微不足道的,但它会破坏 线性访问时间。
Flex 显然有两对表,第一对为 base
和 default
,next
和 check
第二对。这些表似乎以我不理解的方式相互索引。 《龙之书》试图解释这一点,但就像那本神秘知识巨著经常发生的情况一样,它所说的内容对于像我这样的低级头脑来说是迷失的。
I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of short
and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.
The basic requirements of the compression is that
- The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
- Reasonably fast to compute the compressed table.
I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.
I suppose an option would be to use a Dictionary
, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.
From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called row/column equivalence which I haven't really been able to find any explanation for.
I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.
Edit: To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format in memory since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.
Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.
The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the
linear access time.
Flex apparently has two pairs of tables, base
and default
for the first pair and next
and check
for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
本文 http ://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf,描述了一种压缩稀疏的方法表,并且可能会感兴趣。
您是否事先知道表,并且只需要一种有效的方法来存储和访问它们?
我不太熟悉问题域,但是如果您的表沿一个轴具有固定大小(256),那么大小为 256 的数组(其中每个元素都是可变长度的向量)是否有效?您是否希望能够在给定 (x,y) 对的情况下选出一个元素?
我一直想使用的另一个很酷的解决方案是完美的哈希表, http:// burtleburtle.net/bob/hash/perfect.html,您可以在其中根据数据生成哈希函数,因此您将获得最小的空间需求和 O(1) 查找(即无冲突)。
这些解决方案都没有采用任何类型的压缩,但它们只是最大限度地减少浪费的空间量。
This paper, http://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf, describes a method for compressing sparse tables, and might be of interest.
Are you tables known beforehand, and you just need an efficient way to store and access them?
I'm not really familiar with the problem domain, but if your table has a fix size along one axis (256), then would a array of size 256, where each element was a vector of variable length work? Do you want to be able to pick out an element given a (x,y) pair?
Another cool solution that I've always wanted to use for something is a perfect hash table, http://burtleburtle.net/bob/hash/perfect.html, where you generate a hash function from your data, so you will get minimal space requirements, and O(1) lookups (ie no collisions).
None of these solutions employ any type of compression, tho, they just minimize the amount of space wasted..
不清楚的是您的表是否在一个维度或另一个维度上具有“序列属性”。
序列属性自然存在于人类语音中,因为一个单词是由许多字母组成的,并且字母的序列很可能稍后出现。它在二进制程序、源代码等中也很常见。
另一方面,采样数据(例如原始音频、地震值等)不会广告序列属性。他们的数据仍然可以被压缩,但使用另一种模型(例如简单的“增量模型”后跟“熵”)。
如果您的数据在任何二维中具有“序列属性”,那么您可以使用常见的压缩算法,这将为您提供速度和可靠性。您只需要为其提供“序列友好”的输入(即选择您的维度)。
如果您关心速度,您可以看看这个快速压缩器的 C# 实现,它也是一个非常快的解压缩器:https://github.com/stangelandcl/LZ4Sharp
What's unclear is if your table has "sequence property" in one dimension or another.
Sequence property naturally happens in human speech, since a word is composed of many letters, and the sequence of letters is likely to appear later on. It's also very common in binary program, source code, etc.
On the other hand, sampled data, such as raw audio, seismic values, etc. do not advertise sequence property. Their data can still be compressed, but using another model (such as a simple "delta model" followed by "entropy").
If your data has "sequence property" in any of the 2 dimensions, then you can use common compression algorithm, which will give you both speed and reliability. You just need to provide it with an input which is "sequence friendly" (i.e. select your dimension).
If speed is a concern for you, you can have a look at this C# implementation of a fast compressor which is also a very fast decompressor : https://github.com/stangelandcl/LZ4Sharp