在现代 x86 硬件上写入比特流的最快方法

发布于 2024-11-02 02:41:55 字数 835 浏览 0 评论 0原文

在 x86/x86-64 上写入比特流的最快方法是什么？（码字<= 32位）

通过写入比特流我指的是将可变位长符号连接到连续内存缓冲区的过程。

目前我有一个带有 32 位中间缓冲区的标准容器可以写入

void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
    if(bits_to_write < bits_left_in_buffer){
        buffer|= codeword << (32-bits_left_in_buffer);
        bits_left_in_buffer -= bits_to_write;

    }else{
        unsigned int full_bits = bits_to_write - bits_left_in_buffer;
        unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
        buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
        dst.push_back(towrite);
        bits_left_in_buffer = 32-full_bits;
    }
}

有谁知道任何好的优化、快速指令或其他可能有用的信息吗？

干杯，

原文

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)

by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.

currently I've got a standard container with a 32bit intermediate buffer to write to

void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
    if(bits_to_write < bits_left_in_buffer){
        buffer|= codeword << (32-bits_left_in_buffer);
        bits_left_in_buffer -= bits_to_write;

    }else{
        unsigned int full_bits = bits_to_write - bits_left_in_buffer;
        unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
        buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
        dst.push_back(towrite);
        bits_left_in_buffer = 32-full_bits;
    }
}

Does anyone know of any nice optimizations, fast instructions or other info that may be of use?

Cheers,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

缱倦旧时光 2024-11-09 02:41:55

我曾经编写过一个相当快的实现，但它有几个限制：当您写入和读取比特流时，它可以在 32 位 x86 上运行。我在这里不检查缓冲区限制，我分配更大的缓冲区并不时从调用代码中检查它。

unsigned char* membuff; 
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb

// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;  // rounding down by 2.
    unsigned int bits = *(unsigned int*)(membuff + byte_offset);
    bits >>= bit_pos & 0xF;
    bit_pos += bit_cnt;
    return bits & BIT_MASKS[bit_cnt];
};

// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;
    *(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
    bit_pos += bit_cnt;
};

I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.

unsigned char* membuff; 
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb

// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;  // rounding down by 2.
    unsigned int bits = *(unsigned int*)(membuff + byte_offset);
    bits >>= bit_pos & 0xF;
    bit_pos += bit_cnt;
    return bits & BIT_MASKS[bit_cnt];
};

// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
    unsigned int byte_offset = bit_pos >> 3;
    byte_offset &= ~1;
    *(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
    bit_pos += bit_cnt;
};

回复收藏 0 原文

谁人与我共长歌 2024-11-09 02:41:55

一般来说，这个问题很难回答，因为它取决于许多因素，例如您正在读取的位大小的分布、客户端代码中的调用模式以及硬件和编译器。一般来说，从比特流中读取（写入）的两种可能方法是：

使用 32 位或 64 位缓冲区，并在需要更多位时有条件地从底层数组读取（写入）它。这就是您的 write_bits 方法所采用的方法。
在每个比特流读取（写入）时无条件地从底层数组读取（写入），然后移位和屏蔽结果值。

(1) 的主要优点包括：

仅以对齐方式从底层缓冲区读取所需的最少次数。
快速路径（无数组读取）速度稍快一些，因为它不必执行读取和相关的寻址数学运算。
该方法可能会更好地内联，因为它没有读取 - 例如，如果您有多个连续的 read_bits 调用，则编译器可能会组合大量逻辑并生成一些非常快的代码。

(2) 的主要优点是它是完全可预测的——它不包含不可预测的分支。

仅仅因为（2）只有一个优点并不意味着它更糟：这个优点很容易压倒其他一切。

特别是，您可以根据两个因素分析算法可能的分支行为：

位流需要多久从底层缓冲区读取一次？
在需要读取之前，调用次数的可预测性如何？

例如，如果您 50% 的时间读取 1 位，50% 的时间读取 2 位，则您将在之前执行 64 / 1.5 = ~42 读取（如果您可以使用 64 位缓冲区）需要底层读取。这有利于方法（1），因为即使预测错误，底层的读取也很少。另一方面，如果您通常读取 20+ 位，则每隔几次调用就会从底层读取。这可能有利于方法（2），除非底层读取的模式是非常可预测的。例如，如果您总是读取 22 到 30 位之间的数据，那么您可能总是需要恰好三次调用来耗尽缓冲区并读取底层¹数组。因此，分支将得到很好的预测，并且 (1) 将保持快速。

同样，这取决于您如何调用这些方法，以及编译器如何内联和简化代码。特别是如果您使用编译时常量大小重复调用这些方法，则可以进行大量简化。当编译时已知代码字时，几乎无法进行简化。

最后，您可以通过提供更复杂的 API 来提高性能。这主要适用于实施选项 (1)。例如，您可以提供一个 ensure_available(unsigned size) 调用，以确保至少有 size 位（通常限制缓冲区大小）可供读取。然后，您可以使用不检查缓冲区大小的未检查调用读取最多该位数。这可以通过强制缓冲区填充按可预测的时间表来帮助您减少错误预测，并允许您编写更简单的未经检查的方法。

¹ 这完全取决于您的“从底层读取”例程的编写方式，因为这里有一些选项：有些总是填充到 64 位，有些填充到 57 到 64 位之间（即，读取整数字节），有些可能会填充 32 或 33 到 64 位之间（就像您读取 32 位块的示例）。

It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:

Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.

The primary advantages of (1) include:

Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.

The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.

Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.

In particular, you can analyze the likely branching behavior of your algorithm based on two factors:

How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?

For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying¹ array. So the branch will be well-predicated and (1) will stay fast.

Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.

Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.

¹ This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).

回复收藏 0 原文

℉服软 2024-11-09 02:41:55

您可能要等到 2013 年才能获得真正的硬件，但是 "Haswell" 新指令将带来适当的向量化移位（即能够将每个向量元素移位另一个向量中指定的不同量）矢量）到 x86/AVX。不确定细节（有足够的时间来弄清楚它们），但这肯定会极大地提高比特流构造代码的性能。

回复收藏 0 原文

北座城市 2024-11-09 02:41:55

我没有时间为您编写它（不太确定您的示例实际上是否足够完整），但如果您必须这样做，我可以考虑

使用转换表来表示各种输入/输出位移位偏移;这种优化对于 n 位的固定单位有意义（n 足够大（8 位？）以期望性能提升）
本质上，你可以做到
```
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
```

免责声明：这是非常草率的伪代码，我只是希望它传达了我的查找表的想法，以防止位移算术

在汇编中编写它（我知道i386有XLAT，但话又说回来，一个好的编译器可能已经使用了类似的东西）
;另外，XLAT 似乎仅限于 8 位和 AL 寄存器，因此它并不是真正通用

更新

警告：请务必使用分析器并测试优化的正确性和速度。根据引用的局部性，使用查找表可能导致性能较差。因此，您可能需要更改单个核心上的位流线程（设置线程关联）才能获得好处，并且您可能必须根据处理器的 L2 缓存调整查找表大小。

另外，看看 SIMD，SSE4 或 GPU (CUDA) 指令集（如果您知道您可以使用某些功能）。

I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of

using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
```
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
```

disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics

writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile

Update

Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.

Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

回复收藏 0 原文

~没有更多了~