Scala 中的压缩

发布于 2024-09-07 15:07:55 字数 633 浏览 7 评论 0原文

我正在 Scala 上使用非常大的 Int 列表(可能),我需要压缩它们并将其保存在内存中。

唯一的要求是我可以拉出(并解压缩)列表中的第一个数字来使用,而不触及列表的其余部分。

我有很多好主意,但大多数都是将数字转化为位。 示例:

您可以将任何数字 x 写为元组 |log(x)|,x-|log(x)|第一个元素我们将其右对齐为由 1 和结尾处的 0 组成的字符串(一元代码),第二个元素为二进制。例如:

1-> 0,1-> 0 1

...

5-> 2,1-> 110 01

...

8-> 3,0-> 1110 000

9-> 3,1-> 1110 001

...

虽然 Int 占用固定的 32 位内存和 long 64 位,但通过这种压缩,x 需要 2log(x) 位进行存储,并且可以无限增长。在大多数情况下,这种压缩确实会减少内存。

您将如何处理此类数据?有没有像bitarray之类的东西?

还有其他方法可以在 Scala 中压缩此类数据吗?

谢谢

I'm working on Scala with VERY larg lists of Int (maybe large) and I need to compress them and to hold it in memory.

The only requirement is that I can pull (and decompress) the first number on the list to work with, whithout touching the rest of the list.

I have many good ideas but most of them translate the numbers to bits.
Example:

you can write any number x as the tuple |log(x)|,x-|log(x)| the first element we right it as a string of 1's and a 0 at the end (Unary Code) and the second in binary. e.g:

1 -> 0,1 -> 0 1

...

5 -> 2,1 -> 110 01

...

8 -> 3,0 -> 1110 000

9 -> 3,1 -> 1110 001

...

While a Int takes a fixed 32 bits of memory and a long 64, with this compression x requires 2log(x) bits for storage and can grow indefinetly. This Compression does reducememory in most cases.

How would you handle such type of data? Is there something such as bitarray or something?

Any other way to compress such data in Scala?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

記柔刀 2024-09-14 15:07:55

根据数据集的稀疏性和范围,您可以将数据保留为增量列表而不是数字。例如,它用于声音压缩,可以是有损的,也可以是无损的,具体取决于您的需要。

例如,如果您有 Int 数字,但知道它们之间的距离几乎不会超过一个(有符号的)Byte

-1           // Use -1 to imply the next number cannot be computed as a byte delta
0, 0, 4, 0   // 1024 encoded as bytes
1            // 1025 as a delta
-5           // 1020 as a delta
-1           // Next number can't be computed as a byte delta
0, 0, -1, -1 // 65535 encoded as bytes -- -1 doesn't have special meaning here
10           // 65545 as a delta

那么 您可以执行以下字节列表之类的操作:不必使用这种特定的编码来处理位。但是,实际上,如果没有非常清楚地指示特定问题、数据特征等,您将无法获得好的答案。

重新阅读您的问题,似乎您并没有放弃压缩技术数据转化为位。如果没有,那么我建议 Huffman(如果需要的话可以进行预测)或来自 Lempel-Ziv 家族的东西。

不幸的是,Scala 没有处理二进制数据的库。尽管 paulp 编译器本身可能有类似的东西。

Depending on the sparseness and range of your data set, you may keep your data as a list of deltas instead of numbers. That's used for sound compression, for instance, and can be both lossy or lossless, depending on your needs.

For instance, if you have Int numbers but know they will hardly ever be more than a (signed) Byte apart, you could do something like this list of bytes:

-1           // Use -1 to imply the next number cannot be computed as a byte delta
0, 0, 4, 0   // 1024 encoded as bytes
1            // 1025 as a delta
-5           // 1020 as a delta
-1           // Next number can't be computed as a byte delta
0, 0, -1, -1 // 65535 encoded as bytes -- -1 doesn't have special meaning here
10           // 65545 as a delta

So you don't have to handle bits using this particular encoding. But, really, you won't get good answers without a very clear indication of the particular problem, the characteristics of the data, etc.

Rereading your question, it seems you are not discarding compression techniques that turn data into bits. If not, then I suggest Huffman -- predictive if needed -- or something from the Lempel-Ziv family.

And, no, Scala has no library to handle binary data, unfortunately. Though paulp probably has something like that in the compiler itself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文