在现代 x86 硬件上写入比特流的最快方法
在 x86/x86-64 上写入比特流的最快方法是什么? (码字<= 32位)
通过写入比特流我指的是将可变位长符号连接到连续内存缓冲区的过程。
目前我有一个带有 32 位中间缓冲区的标准容器可以写入
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
有谁知道任何好的优化、快速指令或其他可能有用的信息吗?
干杯,
What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我曾经编写过一个相当快的实现,但它有几个限制: 当您写入和读取比特流时,它可以在 32 位 x86 上运行。我在这里不检查缓冲区限制,我分配更大的缓冲区并不时从调用代码中检查它。
I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
一般来说,这个问题很难回答,因为它取决于许多因素,例如您正在读取的位大小的分布、客户端代码中的调用模式以及硬件和编译器。一般来说,从比特流中读取(写入)的两种可能方法是:
write_bits
方法所采用的方法。(1) 的主要优点包括:
(2) 的主要优点是它是完全可预测的——它不包含不可预测的分支。
仅仅因为(2)只有一个优点并不意味着它更糟:这个优点很容易压倒其他一切。
特别是,您可以根据两个因素分析算法可能的分支行为:
例如,如果您 50% 的时间读取 1 位,50% 的时间读取 2 位,则您将在之前执行
64 / 1.5 = ~42
读取(如果您可以使用 64 位缓冲区)需要底层读取。这有利于方法(1),因为即使预测错误,底层的读取也很少。另一方面,如果您通常读取 20+ 位,则每隔几次调用就会从底层读取。这可能有利于方法(2),除非底层读取的模式是非常可预测的。例如,如果您总是读取 22 到 30 位之间的数据,那么您可能总是需要恰好三次调用来耗尽缓冲区并读取底层1数组。因此,分支将得到很好的预测,并且 (1) 将保持快速。同样,这取决于您如何调用这些方法,以及编译器如何内联和简化代码。特别是如果您使用编译时常量大小重复调用这些方法,则可以进行大量简化。当编译时已知代码字时,几乎无法进行简化。
最后,您可以通过提供更复杂的 API 来提高性能。这主要适用于实施选项 (1)。例如,您可以提供一个
ensure_available(unsigned size)
调用,以确保至少有size
位(通常限制缓冲区大小)可供读取。然后,您可以使用不检查缓冲区大小的未检查调用读取最多该位数。这可以通过强制缓冲区填充按可预测的时间表来帮助您减少错误预测,并允许您编写更简单的未经检查的方法。1 这完全取决于您的“从底层读取”例程的编写方式,因为这里有一些选项:有些总是填充到 64 位,有些填充到 57 到 64 位之间(即,读取整数字节),有些可能会填充 32 或 33 到 64 位之间(就像您读取 32 位块的示例)。
It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
write_bits
method takes.The primary advantages of (1) include:
read_bits
calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do
64 / 1.5 = ~42
reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an
ensure_available(unsigned size)
call which ensures that at leastsize
bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).
您可能要等到 2013 年才能获得真正的硬件,但是 "Haswell" 新指令 将带来适当的向量化移位(即能够将每个向量元素移位另一个向量中指定的不同量)矢量)到 x86/AVX。不确定细节(有足够的时间来弄清楚它们),但这肯定会极大地提高比特流构造代码的性能。
You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.
我没有时间为您编写它(不太确定您的示例实际上是否足够完整),但如果您必须这样做,我可以考虑
n
位的固定单位有意义(n
足够大(8 位?)以期望性能提升)本质上,你可以做到
免责声明:这是非常草率的伪代码,我只是希望它传达了我的查找表的想法,以防止位移算术
XLAT
,但话又说回来,一个好的编译器可能已经使用了类似的东西);另外,XLAT 似乎仅限于 8 位和 AL 寄存器,因此它并不是真正通用
更新
警告:请务必使用分析器并测试优化的正确性和速度。根据引用的局部性,使用查找表可能导致性能较差。因此,您可能需要更改单个核心上的位流线程(设置线程关联)才能获得好处,并且您可能必须根据处理器的 L2 缓存调整查找表大小。
另外,看看 SIMD,SSE4 或 GPU (CUDA) 指令集(如果您知道您可以使用某些功能)。
I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of
n
bits (withn
sufficiently large (8 bits?) to expect performance gains)In essence, you'd be able to do
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
XLAT
, but then again, a good compiler might already use something like that); Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.