char[] 转十六进制字符串练习
下面是我当前的 char* 到十六进制字符串函数。 我把它写成位操作的练习。 在 AMD Athlon MP 2800+ 上,十六进制化 1000 万字节数组大约需要 7 毫秒。 我是否缺少任何技巧或其他方法?
我怎样才能让它更快?
在 g++ 中使用 -O3 进行编译
static const char _hex2asciiU_value[256][2] =
{ {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };
std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
std::string str;
str.resize(_len*2);
char* pszHex = &str[0];
const unsigned char* pEnd = _pArray + _len;
clock_t stick, etick;
stick = clock();
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
pszHex[0] = _hex2asciiU_value[*pChar][0];
pszHex[1] = _hex2asciiU_value[*pChar][1];
}
etick = clock();
std::cout << "ticks to hexify " << etick - stick << std::endl;
return str;
}
更新
添加了计时代码
Brian R. Bondy:用堆分配的缓冲区替换 std::string 并将 ofs*16 更改为 ofs << 4 - 但是堆分配的缓冲区似乎减慢了速度? - 结果〜11ms
替换内部循环
int upper = *pChar >> 4;
int lower = *pChar & 0x0f;
pszHex[0] = pHex[upper];
pszHex[1] = pHex[lower];
Antti Sykäri:用结果〜8ms
Robert:用完整的 256 项表替换 _hex2asciiU_value
,牺牲内存空间,但结果约为 7ms!
HoyHoy:注意到它产生了不正确的结果
Below is my current char* to hex string function. I wrote it as an exercise in bit manipulation. It takes ~7ms on a AMD Athlon MP 2800+ to hexify a 10 million byte array. Is there any trick or other way that I am missing?
How can I make this faster?
Compiled with -O3 in g++
static const char _hex2asciiU_value[256][2] =
{ {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };
std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
std::string str;
str.resize(_len*2);
char* pszHex = &str[0];
const unsigned char* pEnd = _pArray + _len;
clock_t stick, etick;
stick = clock();
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
pszHex[0] = _hex2asciiU_value[*pChar][0];
pszHex[1] = _hex2asciiU_value[*pChar][1];
}
etick = clock();
std::cout << "ticks to hexify " << etick - stick << std::endl;
return str;
}
Updates
Added timing code
Brian R. Bondy: replace the std::string with a heap alloc'd buffer and change ofs*16 to ofs << 4 - however the heap allocated buffer seems to slow it down? - result ~11ms
Antti Sykäri:replace inner loop with
int upper = *pChar >> 4;
int lower = *pChar & 0x0f;
pszHex[0] = pHex[upper];
pszHex[1] = pHex[lower];
result ~8ms
Robert: replace _hex2asciiU_value
with a full 256-entry table, sacrificing memory space but result ~7ms!
HoyHoy: Noted it was producing incorrect results
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
我假设这是 Windows+IA32。
尝试使用短整型而不是两个十六进制字母。
I assume this is Windows+IA32.
Try to use short int instead of the two hexadecimal letters.
更改
为
会导致大约 5% 的加速。
按照 Robert 的建议一次写入两个字节的结果大约为 18%加速。 代码更改为:
必需的初始化:
一次执行 2 个字节或一次 4 个字节可能会带来更大的加速,正如 Allan Wind,但是当您必须处理奇怪的字符时,它会变得更加棘手。
如果您喜欢冒险,可以尝试调整 Duff 的设备来执行此操作。
结果在 Intel Core Duo 2 处理器和
gcc -O3
上进行。始终衡量您是否确实获得了更快的结果 - 冒充优化的悲观主义并非毫无价值。
始终测试您是否获得了正确的结果 - 冒充优化的错误是非常危险的。
并且始终牢记速度和可读性之间的权衡——生命太短暂,任何人都无法维护不可读的代码。
(强制参考暴力精神病患者知道你住在哪里。)
Changing
to
results in roughly 5% speedup.
Writing the result two bytes at time as suggested by Robert results in about 18% speedup. The code changes to:
Required initialization:
Doing it 2 bytes at a time or 4 bytes at a time will probably result in even greater speedups, as pointed out by Allan Wind, but then it gets trickier when you have to deal with the odd characters.
If you're feeling adventurous, you might try to adapt Duff's device to do this.
Results are on an Intel Core Duo 2 processor and
gcc -O3
.Always measure that you actually get faster results — a pessimization pretending to be an optimization is less than worthless.
Always test that you get the correct results — a bug pretending to be an optimization is downright dangerous.
And always keep in mind the tradeoff between speed and readability — life is too short for anyone to maintain unreadable code.
(Obligatory reference to coding for the violent psychopath who knows where you live.)
这个汇编函数(基于我之前的文章,但我必须稍微修改一下概念才能使其实际工作)在 Core 2 Conroe 3Ghz 的一个内核上每秒处理 33 亿个输入字符(66 亿个输出字符)。 彭林可能更快。
请注意,它使用 x264 汇编语法,这使其更具可移植性(32 位与 64 位等)。 将其转换为您选择的语法很简单:r0、r1、r2 是寄存器中函数的三个参数。 它有点像伪代码。 或者您可以从 x264 树中获取 common/x86/x86inc.asm 并将其包含在内以在本机运行它。
PS Stack Overflow,我在这么琐碎的事情上浪费时间有错吗? 或者这很棒吗?
This assembly function (based off my previous post here, but I had to modify the concept a bit to get it to actually work) processes 3.3 billion input characters per second (6.6 billion output characters) on one core of a Core 2 Conroe 3Ghz. Penryn is probably faster.
Note it uses x264 assembly syntax, which makes it more portable (to 32-bit vs 64-bit, etc). To convert this into the syntax of your choice is trivial: r0, r1, r2 are the three arguments to the functions in registers. Its a bit like pseudocode. Or you can just get common/x86/x86inc.asm from the x264 tree and include that to run it natively.
P.S. Stack Overflow, am I wrong for wasting time on such a trivial thing? Or is this awesome?
这是我的版本,与OP的版本不同,它不假设 std::basic_string 的数据位于连续区域:
This is my version, which, unlike the OP's version, doesn't assume that
std::basic_string
has its data in contiguous region:不会有太大区别... *pChar-(ofs*16) 可以用 [*pCHAR & 0x0F]
not going to make a lot of difference... *pChar-(ofs*16) can be done with [*pCHar & 0x0F]
首先,不要乘以
16
,而是进行bitshift << 4
另外不要使用
std::string
,而只是在堆上创建一个缓冲区,然后删除
它。 它比从字符串中销毁对象更有效。For one, instead of multiplying by
16
do abitshift << 4
Also don't use the
std::string
, instead just create a buffer on the heap and thendelete
it. It will be more efficient than the object destruction that is needed from the string.即使完全指定了 _hex2asciiU_value ,当我编写本文时显示的函数也会产生不正确的输出。 以下代码有效,在我的 2.33GHz Macbook Pro 上运行 200,000,0 亿个字符大约需要 1.9 秒。
The function as it is shown when I'm writing this produces incorrect output even when _hex2asciiU_value is fully specified. The following code works, and on my 2.33GHz Macbook Pro runs in about 1.9 seconds for 200,000,000 million characters.
在我的 Athlon 64 4200+ 上始终获得约 4 毫秒(原始代码约 7 毫秒)
Consistently getting ~4ms on my Athlon 64 4200+ (~7ms with original code)
我不确定一次执行更多字节会更好...您可能会遇到大量缓存未命中并显着减慢速度。
您可能会尝试展开循环,每次通过循环采取更大的步骤并执行更多的字符,以消除一些循环开销。
I'm not sure doing it more bytes at a time will be better... you'll probably just get tons of cache misses and slow it down significantly.
What you might try is to unroll the loop though, take larger steps and do more characters each time through the loop, to remove some of the loop overhead.
确保您的编译器优化已打开到最高工作级别。
您知道,gcc 中的“-O1”到“-03”等标志。
Make sure your compiler optimization is turned on to the highest working level.
You know, flags like '-O1' to '-03' in gcc.
我发现使用数组索引而不是指针可以加快速度。 这完全取决于您的编译器如何选择优化。 关键是处理器有指令可以在一条指令中执行复杂的操作,例如 [i*2+1]。
I have found that using an index into an array, rather than a pointer, can speed things up a tick. It all depends on how your compiler chooses to optimize. The key is that the processor has instructions to do complex things like [i*2+1] in a single instruction.
如果您对这里的速度相当着迷,您可以执行以下操作:
每个字符都是一个字节,代表两个十六进制值。 因此,每个字符实际上是两个四位值。
因此,您可以执行以下操作:
因此,在单条指令中,您将执行16次表查找所需的时钟数少于通常只执行一次所需的时钟数(pshufb 在 Penryn 上为 1 个时钟延迟)。
因此,在计算步骤中:
If you're rather obsessive about speed here, you can do the following:
Each character is one byte, representing two hex values. Thus, each character is really two four-bit values.
So, you can do the following:
Thus, in a single instruction, you will have performed 16 table lookups in fewer clocks than it normally takes to do just one (pshufb is 1 clock latency on Penryn).
So, in computational steps:
以更多内存为代价,您可以创建一个包含 256 个十六进制代码的完整表:
然后直接索引到该表中,无需进行任何操作。
At the cost of more memory you can create a full 256-entry table of the hex codes:
Then direct index into the table, no bit fiddling required.
它适用于我的
unsigned char
:It works for me with
unsigned char
:一次对 32 位(4 个字符)进行操作,然后根据需要处理尾部。 当我使用 url 编码进行此练习时,每个字符的全表查找比逻辑构造稍快,因此您可能还想在上下文中测试它以考虑缓存问题。
Operate on 32 bits at a time (4 chars), then deal with the tail if needed. When I did this exercise with url encoding a full table lookup for each char was slightly faster than logic constructs, so you may want to test this in context as well to take caching issues into account.
更快的 C 实现
其运行速度比 C++ 实现快近 3 倍。 不知道为什么,因为它非常相似。 对于我发布的最后一个 C++ 实现,运行 200,000,000 个字符数组需要 6.8 秒。 执行仅用了2.2秒。
Faster C Implmentation
This runs nearly 3x faster than the C++ implementation. Not sure why as it's pretty similar. For the last C++ implementation that I posted it took 6.8 seconds to run through a 200,000,000 character array. The implementation took only 2.2 seconds.