如何加快crc32计算速度?
我正在尝试在linux上编写一个尽可能快的crc32实现,作为学习优化C的练习。我已经尽力了,但我无法在网上找到很多好的资源。我什至不确定我的缓冲区大小是否合理;它是通过反复实验而选择的。
#include <stdio.h>
#define BUFFSIZE 1048567
const unsigned long int lookupbase = 0xEDB88320;
unsigned long int crctable[256] = {
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
/* LONG LIST OF PRECALCULTED VALUES */
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D};
int main(int argc, char *argv[]){
register unsigned long int x;
int i;
register unsigned char *c, *endbuff;
unsigned char buff[BUFFSIZE];
register FILE *thisfile=NULL;
for (i = 1; i < argc; i++){
thisfile = fopen(argv[i], "r");
if (thisfile == NULL) {
printf("Unable to open ");
} else {
x = 0xFFFFFFFF;
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
while (c != endbuff){
while (c != endbuff){
x=(x>>8) ^ crctable[(x&0xFF)^*c];
c++;
}
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
}
fclose(thisfile);
x = x ^ 0xFFFFFFFF;
printf("%0.8X ", x);
}
printf("%s\n", argv[i]);
}
return 0;
}
预先感谢我可以阅读的任何建议或资源。
I'm trying to write a crc32 implementation on linux that's as fast as possible, as an exercise in learning to optimise C. I've tried my best, but I haven't been able to find many good resources online. I'm not even sure if my buffer size is sensible; it was chosen by repeated experimentation.
#include <stdio.h>
#define BUFFSIZE 1048567
const unsigned long int lookupbase = 0xEDB88320;
unsigned long int crctable[256] = {
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
/* LONG LIST OF PRECALCULTED VALUES */
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D};
int main(int argc, char *argv[]){
register unsigned long int x;
int i;
register unsigned char *c, *endbuff;
unsigned char buff[BUFFSIZE];
register FILE *thisfile=NULL;
for (i = 1; i < argc; i++){
thisfile = fopen(argv[i], "r");
if (thisfile == NULL) {
printf("Unable to open ");
} else {
x = 0xFFFFFFFF;
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
while (c != endbuff){
while (c != endbuff){
x=(x>>8) ^ crctable[(x&0xFF)^*c];
c++;
}
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
}
fclose(thisfile);
x = x ^ 0xFFFFFFFF;
printf("%0.8X ", x);
}
printf("%s\n", argv[i]);
}
return 0;
}
Thanks in advance for any suggestions or resources I can read through.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您将无法加快 CRC 计算的实际算术速度,因此您可以查看的区域是 (a) 读取文件和 (b) 循环的开销。
您使用了相当大的缓冲区大小,这很好(但为什么它是奇数?)。使用 read(2) 系统调用(假设您使用的是类 UNIX 系统)而不是 fread(3) 标准库函数可能会为您节省一次复制操作(从 fread 的内部复制数据)缓冲区到你的缓冲区中)。
对于循环开销,请查看循环展开。
您的代码还有一些您可能想要消除的冗余。
sizeof (unsigned char)
为 1(根据 C 中的定义);无需显式计算c = &(buff[0])
完全等同于c = buff
这些更改都不会提高代码的性能(假设有一个不错的编译器),但它们会使代码更加清晰和更符合通常的C风格。
You are not going to be able to speed up the actual arithmetic of the CRC calculation, so the areas you can look at are the overhead of (a) reading the file, and (b) looping.
You're using a pretty large buffer size, which is good (but why is it an odd number?). Using a read(2) system call (assuming you're on a unix-like system) instead of the fread(3) standard library function may save you one
copy
operation (copying the data from fread's internal buffer into your bufffer).For the loop overhead, look into loop unrolling.
Your code also has some redundancies that you might want to eliminate.
sizeof (unsigned char)
is 1 (by definition in C); no need to explicitly compute itc = &(buff[0])
is exactly equivalent toc = buff
Neither of these changes will improve the performance of the code (assuming a decent compiler), but they will make it clearer and more in accordance with usual C style.
在 Linux 上?忘记
register
关键字,这只是对编译器的建议,根据我使用gcc
的经验,这是浪费空间。gcc
不仅能够自己解决这个问题。我只是确保您使用疯狂的优化级别
-O3
进行编译,然后进行检查。我见过 gcc 生成的代码达到了我花了几个小时才理解的水平,而且它是如此的狡猾。并且,关于缓冲区大小,请使其尽可能大。即使使用缓冲,调用
fread
的成本仍然是一种成本,因此做得越少越好。如果将缓冲区大小从 1K 增加到 1M,您会看到巨大的改进,如果将缓冲区大小从 1M 增加到 2M,则改进幅度不大,但即使是少量的性能提升也是一种提升。而且,2M 并不是您可以使用的上限,如果可能的话,我会将其设置为一或更多千兆字节。然后,您可能希望将其放在文件级别(而不是放在
main
内)。在某些时候,堆栈将无法容纳它。与大多数优化一样,您通常可以用空间换取时间。请记住,对于小文件(小于 1M),您不会看到任何改进,因为无论缓冲区有多大,仍然只有一次读取。如果进程的加载需要花费更多时间来设置内存,您甚至可能会发现速度略有下降。
但是,由于这仅适用于小文件(无论如何性能都不是问题),所以这并不重要。大文件的性能是一个问题,希望能够得到改进。
我知道我不需要告诉你这一点(因为你表明你正在这样做),但无论如何我都会向那些不知道的人提及它:测量,不要'别猜!地上到处都是那些通过猜测进行优化的人的尸体:-)
On Linux? Forget about the
register
keyword, that's just a suggestion to the compiler and, from my experience withgcc
, it's a waste of space.gcc
is more than capable of figuring that out for itself.I would just make sure you're compiling with the insane optimisation level,
-O3
, and check that. I've seengcc
produce code at that level which took me hours to understand, so sneaky that it was.And, on the buffer size, make it as large as you possibly can. Even with buffering, the cost of calling
fread
is still a cost, so the less you do it, the better. You would see a huge improvement if you increased the buffer size from 1K to 1M, not so much if you up it from 1M to 2M, but even a small amount of increased performance is an increase. And, 2M isn't the upper bound of what you can use, I'd set it to one or more gigabytes if possible.You may then want to put it at file level (rather than inside
main
). At some point, the stack won't be able to hold it.As with most optimisations, you can usually trade space for time. Keep in mind that, for small files (less than 1M), you won't see any improvement since there is still only one read no matter how big you make the buffer. You may even find a slight slowdown if the loading of the process has to take more time to set up memory.
But, since this would only be for small files (where the performance isn't a problem anyway), it shouldn't really matter. Large files, where the performance is an issue, should hopefully find an improvement.
And I know I don't need to tell you this (since you indicate you are doing it), but I will mention it anyway for those who don't know: Measure, don't guess! The ground is littered with the corpses of those who optimised with guesswork :-)
您要求将三个值存储在寄存器中,但标准 x86 只有四个通用寄存器:这对最后一个剩余寄存器来说是一个巨大的负担,这也是我期望
register 实际上只会阻止您使用
&foo
来查找变量的地址。我认为现在没有任何现代编译器使用它作为提示。请随意删除所有三种用途并重新安排您的应用程序。由于您自己正在读取大量文件,因此您不妨直接使用
open(2)
和read(2)
,并删除所有标准 IO 处理在幕后。另一种常见的方法是将文件open(2)
和mmap(2)
放入内存:让操作系统在需要页面时将其分页。这可能允许您在进行计算时乐观地从磁盘读取未来的页面:这是一种常见的访问模式,操作系统设计者已尝试优化一种模式。 (一次映射整个文件的简单机制确实对您可以处理的文件大小设置了上限,在 32 位平台上可能约为 2.5 GB,在 64 位平台上绝对巨大以块的形式映射文件将允许您处理任意大小的文件,即使在 32 位平台上也是如此,但代价是像您现在用于读取和映射的循环一样。)正如 David Gelhar 指出的那样,您是 。使用奇数长度的缓冲区——这可能会使将文件读入内存的代码路径变得复杂。如果您想坚持从文件读取到缓冲区,我建议使用多个
8192
(两页内存),因为在最后一个循环之前不会有特殊情况。如果您确实想追求最后一点速度,并且不介意大幅增加预计算表的大小,则可以以 16 位块的形式查看文件,而不仅仅是 8 位块。通常,沿着 16 位对齐访问内存比沿着 8 位对齐访问内存要快,并且您可以将循环的迭代次数减少一半,这通常会带来巨大的速度提升。当然,缺点是增加了内存压力(65k 条目,每个 8 字节,而不是每个 4 字节 256 个条目),而且更大的表不太可能完全放入 CPU 缓存中。
我想到的最后一个优化想法是将
fork(2)
分成 2、3 或 4 个进程(或使用线程),每个进程都可以计算部分的 crc32< /em> 文件,然后在所有过程完成后合并最终结果。 crc32 的计算强度可能不足以真正受益于尝试使用 SMP 或多核计算机中的多个内核,并且弄清楚如何组合 crc32 的部分计算可能不可行 - 我自己还没有研究过:) - - 但它可能会得到回报,并且学习如何编写多进程或多线程软件无论如何都是值得的。You've asked for three values to be stored in registers, but standard x86 only has four general purpose registers: that's an awful lot of burden to place on the last remaining register, which is one reason why I expect
register
really only prevents you from ever using&foo
to find the address of the variable. I don't think any modern compiler even uses it as a hint, these days. Feel free to remove all three uses and re-time your application.Since you're reading in huge chunks of the file yourself, you might as well use
open(2)
andread(2)
directly, and remove all the standard IO handling behind the scenes. Another common approach is toopen(2)
andmmap(2)
the file into memory: let the OS page it in as pages are required. This may allow future pages to be optimistically read from disk while you're doing your computation: this is a common access pattern, and one the OS designers have attempted to optimize. (The simple mechanism of mapping the entire file at once does put an upper limit on the size of the files you can handle, probably about 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Mapping the file in chunks will allow you to handle arbitrary sized files even on 32-bit platforms, but at the cost of loops like you've got now for reading, but for mapping.)As David Gelhar points out, you're using an odd-length buffer -- this might complicate the code path of reading the file into memory. If you want to stick with reading from files into buffers, I suggest using a multiple of
8192
(two pages of memory), as it won't have special cases until the last loop.If you're really into eeking out of the last bit of speed and don't mind drastically increasing the size of your pre-computation table, you can look at the file in 16-bit chunks, rather than just 8-bit chunks. Frequently, accessing memory along 16-bit alignment is faster than along 8-bit alignment, and you'd cut the number of iterations through your loop in half, which usually gives a huge speed boost. The downside, of course, is increased memory pressure (65k entries, each of 8 bytes, rather than just 256 entries each of 4 bytes), and the much larger table is much less likely to fit entirely in the CPU cache.
And the last optimization idea that crosses my mind is to
fork(2)
into 2, 3, or 4 processes (or use threading), each of which can compute the crc32 of a portion of the file, and then combine the end results after all processes have completed. crc32 may not be computationally intensive enough to actually benefit from trying to use multiple cores out of SMP or multicore computers, and figuring out how to combine partial computations of crc32 may not be feasible -- I haven't looked into it myself :) -- but it might repay the effort, and learning how to write multi-process or multi-threaded software is well worth the effort regardless.