优化可变长度编码
我有一个例子,我需要压缩很多通常很小的值。因此,我使用可变长度字节编码来压缩它们(ULEB128,具体来说):
size_t
compress_unsigned_int(unsigned int n, char* data)
{
size_t size = 0;
while (n > 127)
{
++size;
*data++ = (n & 127)|128;
n >>= 7;
}
*data++ = n;
return ++size;
}
是否有更有效的方法来做到这一点(也许使用 SSE)?
编辑:压缩后,结果存储到 data
中,占用 size
字节。然后,对下一个无符号整数调用压缩函数。
I've got a case where I need to compress a lot of often small values. Thus I compress them with a variable-length byte encoding (ULEB128, to be specific):
size_t
compress_unsigned_int(unsigned int n, char* data)
{
size_t size = 0;
while (n > 127)
{
++size;
*data++ = (n & 127)|128;
n >>= 7;
}
*data++ = n;
return ++size;
}
Is there a more efficient way to do this (maybe using SSE)?
Edit: After this compression, the result is stored into data
, taking size
bytes. Then, the compression function is called on the next unsigned int.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
经过更多浏览,我在 Sqlite3 中发现了另一个常用的实现(代码版本 3070900):
还有一个针对 32 位 int 稍微优化的版本:
令人失望的是,Sqlite 实现在我的测试中表现最差。因此,如果您打算使用 Sqlite,请考虑用优化的实现替换默认实现。
同时我正在考虑进一步可能的优化。
After more browsing, I found another commonly used implementation in Sqlite3 (code version 3070900):
There is also slightly optimized version for 32 bit int:
It is dissappointing that Sqlite implementation performs the worst of all in my test. So if you are going to use Sqlite consider replacing default implementation with an optimized one.
Meanwhile I am thinking about further possible optimizations.
这是我在 x86 汇编语言(32 位)中的优化。您可以使用 NASM 进行编译并链接。我不知道它是快还是慢,我只是享受编码的乐趣:)
Here's my optimization in x86 assembly language (32 bit). You can compile with NASM and link. I don't know if it's fast or slow, I just had fun with coding :)
您可以通过替换来节省一些操作
size_t size=0;...++size;...;return size++;
与char* base=data;...;返回数据库;
You might save a few operations by replacing
size_t size=0;...++size;...;return size++;
withchar* base=data;...;return data-base;
您要做的第一件事是针对当前代码测试任何可能的解决方案。
我认为您可能想尝试摆脱数据依赖性,以允许处理器同时执行更多工作。
什么是数据依赖性?当数据流经函数时,
n
的当前值取决于n
的先前值,而该值又取决于在此之前的值...这是一条长长的数据依赖链。在下面的代码中,n
从未被修改,因此处理器可以“向前跳过”并同时执行几件不同的操作,而无需等待新的n
被计算。我通过从 0..UINT_MAX 开始的紧密循环中执行它来测试性能。在我的系统上,执行时间是:
一些小的调整可能会产生更好的结果,但我怀疑除非您进行组装,否则您不会获得更多的改进。如果您的整数往往在特定范围内,那么您可以使用分析来让编译器向每个分支添加正确的分支预测。这可能会让你的速度提高几个百分点。 (编辑:我通过对分支重新排序获得了 8%,但这是一种反常的优化,因为它依赖于每个数字 0...UINT_MAX 出现频率相同的事实。我不推荐这样做。 )
上证所不会有帮助。 SSE 设计用于同时操作具有相同宽度的多条数据,众所周知,让 SIMD 来加速任何具有可变长度编码的数据是非常困难的。 (这不一定是不可能的,但可能是不可能的,而且你必须非常聪明才能弄清楚。)
The first thing you want to do is test any possible solution against your current code.
I think you may want to try and get rid of data dependencies, to allow the processor to do more work at the same time.
What are data dependencies? As data flows through your function, the current value of
n
depends on the previous value ofn
, which depends on the value before that... which is a long chain of data dependencies. In the code below,n
is never modified so the processor can "skip ahead" and do a couple different things at the same time without having to wait for the newn
to be computed.I tested the performance by executing it in a tight loop from 0..UINT_MAX. On my system, the execution times are:
Some minor tweaking may produce better results, but I doubt you'll get much more improvement unless you go to assembly. If your integers tend to be in specific ranges, then you can use profiling to get the compiler to add the right branch predictions to each branch. This might get you a few extra percentage points of speed. (EDIT: I got 8% from reordering the branches, but it's a perverse optimization because it relies on the fact that each number 0...UINT_MAX appears with equal frequency. I don't recommend this.)
SSE won't help. SSE is designed to operate on multiple pieces of data with the same width at the same time, it is notoriously difficult to get SIMD to accelerate anything with a variable length encoding. (It's not necessarily impossible, but it might be impossible, and you'd have to be pretty smart to figure it out.)
您可能会在 google protocol buffers 中找到快速实现:
http://code.google.com/p/protobuf/
查看 CodedOutputStream::WriteVarintXXX 方法。
第一种方法可能会重写为:
根据我的测试,google buffers 实现是最好的,然后是其他实现。然而,我的测试相当人为,最好在您的应用程序中测试每种方法并选择最好的。所提出的优化在特定数值上效果更好。
这是我的测试应用程序的代码。 (请注意,我已从 compress_unsigned_int_google_buf 中删除了代码。您可能会在 google 缓冲区协议的以下文件中找到实现:coded_stream.cc 方法 CodedOutputStream::WriteVarint32FallbackToArrayInline)
在使用 Visual C++ 编译器的计算机上,我得到以下结果:
纯文本副本:358 ms
原始:2497 ms
改进:2215 ms
更多改进:2231 ms
简单:2059 ms
Google 缓冲区:968 ms
You might find fast implementation in google protocol buffers:
http://code.google.com/p/protobuf/
Look at CodedOutputStream::WriteVarintXXX methods.
First method might be rewritten as:
According to my test google buffers implementation is the best, then come other implementations. However my test is rather artificial, it is better to test each approach in your application and choose the best. Presented optimizations work better on specific number values.
Here is code of my test application. (Note I've removed code from compress_unsigned_int_google_buf. You might find implementation in the following file from google buffer protocol: coded_stream.cc method CodedOutputStream::WriteVarint32FallbackToArrayInline)
On my machine with Visual C++ compiler I've got following results:
Plain copy: 358 ms
Original: 2497 ms
Improved: 2215 ms
More Improved: 2231 ms
Simple: 2059 ms
Google Buffers: 968 ms
如果您的
unsigned int
值被限制在特定范围(例如 32 位),您可以展开循环:If your
unsigned int
values are limited to a specific range - say, 32 bits - you can unroll the loop: