有没有办法提高查找的速度或效率? (C/C++)

发布于 2024-08-10 09:36:09 字数 1576 浏览 10 评论 0原文

我编写了一个函数,用于将 64 位整数转换为 Base 62 字符串。最初,我是这样实现的:

char* charset = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = strlen(charset);

std::string integerToKey(unsigned long long input)
{
    unsigned long long num = input;
    string key = "";

    while(num)
    {
        key += charset[num % charsetLength];
        num /= charsetLength;
    }

    return key;
}

但是,这太慢了。

我通过提供生成查找表的选项来提高速度。该表的大小约为 624 字符串,生成方式如下:

// Create the integer to key conversion lookup table
int lookupChars;

if(lookupDisabled)
    lookupChars = 1;
else
    largeLookup ? lookupChars = 4 : lookupChars = 2;

lookupSize = pow(charsetLength, lookupChars);
integerToKeyLookup = new char*[lookupSize];

for(unsigned long i = 0; i < lookupSize; i++)
{
    unsigned long num = i;
    int j = 0;

    integerToKeyLookup[i] = new char[lookupChars];

    while(num)
    {
        integerToKeyLookup[i][j] = charset[num % charsetLength];
        num /= charsetLength;

        j++;
    }

    // Null terminate the string
    integerToKeyLookup[i][j] = '\0';
}

实际转换如下所示:

std::string integerToKey(unsigned long long input)
{
    unsigned long long num = input;
    string key = "";

    while(num)
    {
        key += integerToKeyLookup[num % lookupSize];
        num /= lookupSize;
    }

    return key;
}

这大大提高了速度,但我仍然相信它可以改进。 32 位系统上的内存使用量约为 300 MB,64 位系统上的内存使用量超过 400 MB。看来我应该能够减少内存和/或提高速度,但我不确定如何做。

如果有人能帮助我弄清楚如何进一步优化该表,我将不胜感激。

I have a function I've written to convert from a 64-bit integer to a base 62 string. Originally, I achieved this like so:

char* charset = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = strlen(charset);

std::string integerToKey(unsigned long long input)
{
    unsigned long long num = input;
    string key = "";

    while(num)
    {
        key += charset[num % charsetLength];
        num /= charsetLength;
    }

    return key;
}

However, this was too slow.

I improved the speed by providing an option to generate a lookup table. The table is about 624 strings in size, and is generated like so:

// Create the integer to key conversion lookup table
int lookupChars;

if(lookupDisabled)
    lookupChars = 1;
else
    largeLookup ? lookupChars = 4 : lookupChars = 2;

lookupSize = pow(charsetLength, lookupChars);
integerToKeyLookup = new char*[lookupSize];

for(unsigned long i = 0; i < lookupSize; i++)
{
    unsigned long num = i;
    int j = 0;

    integerToKeyLookup[i] = new char[lookupChars];

    while(num)
    {
        integerToKeyLookup[i][j] = charset[num % charsetLength];
        num /= charsetLength;

        j++;
    }

    // Null terminate the string
    integerToKeyLookup[i][j] = '\0';
}

The actual conversion then looks like this:

std::string integerToKey(unsigned long long input)
{
    unsigned long long num = input;
    string key = "";

    while(num)
    {
        key += integerToKeyLookup[num % lookupSize];
        num /= lookupSize;
    }

    return key;
}

This improved speed by a large margin, but I still believe it can be improved. Memory usage on a 32-bit system is around 300 MB, and more than 400 MB on a 64-bit system. It seems like I should be able to reduce memory and/or improve speed, but I'm not sure how.

If anyone could help me figure out how this table could be further optimized, I'd greatly appreciate it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

秋凉 2024-08-17 09:36:09

使用某种字符串生成器而不是重复连接到“key”将提供显着的速度提升。

Using some kind of string builder rather than repeated concatenation into 'key' would provide a significant speed boost.

喜爱纠缠 2024-08-17 09:36:09

您可能需要提前为您的字符串键预留内存。这可能会给您带来不错的性能提升,以及内存利用率的提高。每当您在 std::string 上调用追加运算符时,如果必须重新分配,内部缓冲区的大小可能会增加一倍。这意味着每个字符串占用的内存可能比存储字符所需的内存多得多。您可以通过提前为字符串保留内存来避免这种情况。

You may want to reserve memory in advance for your string key. This may get you a decent performance gain, as well as a gain in memory utilization. Whenever you call the append operator on std::string, it may double the size of the internal buffer if it has to reallocate. This means each string may be taking up significantly more memory than is necessary to store the characters. You can avoid this by reserving memory for the string in advance.

山川志 2024-08-17 09:36:09

我同意罗布·沃克的观点——你专注于提高错误领域的表现。弦是最慢的部分。

我对代码进行了计时(顺便说一句,你的原始代码已损坏),而你的原始代码(修复后)为 100000 次查找的 44982140 个周期,以下代码约为 13113670。

const char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
#define CHARSET_LENGTH 62

// maximum size = 11 chars
void integerToKey(char result[13], unsigned long long input)
{
    char* p = result;
    while(input > 0)
    {
        *p++ = charset[input % CHARSET_LENGTH];
        input /= CHARSET_LENGTH;
    }

    // null termination
    *p = '\0';
    // need to reverse the output
    char* o = result;
    while(o + 1 < p)
        swap(*++o, *--p);
}

I agree with Rob Walker - you're concentrating on improving performance in the wrong area. The string is the slowest part.

I timed the code (your original is broken, btw) and your original (when fixed) was 44982140 cycles for 100000 lookups and the following code is about 13113670.

const char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
#define CHARSET_LENGTH 62

// maximum size = 11 chars
void integerToKey(char result[13], unsigned long long input)
{
    char* p = result;
    while(input > 0)
    {
        *p++ = charset[input % CHARSET_LENGTH];
        input /= CHARSET_LENGTH;
    }

    // null termination
    *p = '\0';
    // need to reverse the output
    char* o = result;
    while(o + 1 < p)
        swap(*++o, *--p);
}
若言繁花未落 2024-08-17 09:36:09

这几乎是教科书上关于如何不这样做的案例。在循环中连接字符串是一个坏主意,因为附加不是特别快,而且因为您不断分配内存。

注意:您的问题表明您要转换为 base-62,但代码似乎有 63 个符号。你想做什么?

给定一个 64 位整数,您可以计算出结果中不需要超过 11 位数字,因此使用静态 12 字符缓冲区肯定有助于提高速度。另一方面,您的 C++ 库可能有一个与 ultoa 相当的 long-long,这将是相当理想的。


编辑:这是我突发奇想的东西。它也允许您指定任何所需的基数:

std::string ullToString(unsigned long long v, int base = 64) {
    assert(base < 65);
    assert(base > 1);
    static const char digits[]="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/";
    const int max_length=65;
    static char buffer[max_length];

    buffer[max_length-1]=0;
    char *d = buffer + max_length-1;
    do {
        d--;
        int remainder = v % base;
        v /= base;
        *d = digits[remainder];
    } while(v>0);

    return d;
}

这只会创建一个 std::string 对象,并且不会不必要地移动内存。目前它不会对输出进行零填充,但是将其更改为您想要的任意位数的输出是微不足道的。

This is almost a textbook case of how not to do this. Concatenating strings in a loop is a bad idea, both because appending isn't particularly fast, and because you're constantly allocating memory.

Note: your question states that you're converting to base-62, but the code seems to have 63 symbols. Which are you trying to do?

Given a 64-bit integer, you can calculate that you won't need any more than 11 digits in the result, so using a static 12 character buffer will certainly help improve your speed. On the other hand, it's likely that your C++ library has a long-long equivalent to ultoa, which will be pretty optimal.


Edit: Here's something I whipped up. It allows you to specify any desired base as well:

std::string ullToString(unsigned long long v, int base = 64) {
    assert(base < 65);
    assert(base > 1);
    static const char digits[]="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/";
    const int max_length=65;
    static char buffer[max_length];

    buffer[max_length-1]=0;
    char *d = buffer + max_length-1;
    do {
        d--;
        int remainder = v % base;
        v /= base;
        *d = digits[remainder];
    } while(v>0);

    return d;
}

This only creates one std::string object, and doesn't move memory around unnecessarily. It currently doesn't zero-pad the output, but it's trivial to change it to do that to however many digits of output you want.

尬尬 2024-08-17 09:36:09

您不需要将输入复制到 num 中,因为您是按值传递的。您还可以在编译时计算字符集的长度,无需在每次调用该函数时在运行时计算它。

但这些都是非常小的性能问题。我认为您可以获得的最重要的帮助是避免循环中的字符串连接。当您构造键字符串时,将结果字符串的长度传递给字符串构造函数,以便该字符串只有一次分配。然后在循环中,当您连接到字符串时,您将不会重新分配。

如果将目标字符串作为参考参数,甚至像标准算法那样作为两个迭代器,则可以使事情变得更加高效。但这可以说是迈得太远了。

顺便问一下,如果输入传入的值为零怎么办?你甚至不会进入循环; key 不应该是“0”吗?

我看到输入的值不能为负,但我们要注意的是:C 余数运算符不是模运算符。

You don't need to copy input into num, because you pass it by value. You can also compute the length of charset in compiletime, there's no need to compute it in runtime every single time you call the function.

But these are very minor performance issues. I think the the most significant help you can gain is by avoiding the string concatenation in the loop. When you construct the key string pass the string constructor the length of your result string so that there is only one allocation for the string. Then in the loop when you concatenate into the string you will not re-allocate.

You can make things even slightly more efficient if you take the target string as a reference parameter or even as two iterators like the standard algorithms do. But that is arguably a step too far.

By the way, what if the value passed in for input is zero? You won't even enter the loop; shouldn't key then be "0"?

I see the value passed in for input can't be negative, but just so we note: the C remainder operator isn't a modulo operator.

寂寞美少年 2024-08-17 09:36:09

如果您可以再添加两个符号,以便将其转换为 base-64,则您的模数和除法运算将变成位掩码和移位。比除法快多了。

If you could add two more symbols so that it is converting to base-64, your modulus and division operations would turn into a bit mask and shift. Much faster than a division.

堇色安年 2024-08-17 09:36:09

如果您需要的只是一个短字符串密钥,则转换为 base-64 数字会大大加快速度,因为 div/mod 64 非常便宜(移位/掩码)。

If all you need is a short string key, converting to base-64 numbers would speed up things a lot, since div/mod 64 is very cheap (shift/mask).

小瓶盖 2024-08-17 09:36:09

为什么不直接使用 base64 库呢? 63 等于“11”而不是更长的字符串真的很重要吗?

size_t base64_encode(char* outbuffer, size_t maxoutbuflen, const char* inbuffer, size_t inbuflen);

std::string integerToKey(unsigned long long input) {
    char buffer[14];
    size_t len = base64_encode(buffer, sizeof buffer, (const char*)&input, sizeof input);
    return std::string(buffer, len);
}

是的,每个字符串都会以相同的大小结尾。如果您不想这样做,请去掉等号。 (如果需要解码数字,请记住将其添加回来。)

当然,我真正的问题是为什么要转换固定宽度的 8 字节值而不直接使用它作为“键”而不是可变长度字符串值?

脚注:我很清楚这方面的字节序问题。他没有说明密钥的用途,因此我认为它不会用于不同字节序机器之间的网络通信。

Why not just use a base64 library? Is really important that 63 equals '11' and not a longer string?

size_t base64_encode(char* outbuffer, size_t maxoutbuflen, const char* inbuffer, size_t inbuflen);

std::string integerToKey(unsigned long long input) {
    char buffer[14];
    size_t len = base64_encode(buffer, sizeof buffer, (const char*)&input, sizeof input);
    return std::string(buffer, len);
}

Yes, every string will end with an equal size. If you don't want it to, strip off the equal sign. (Just remember to add it back if you need to decode the number.)

Of course, my real question is why are you turning a fixed width 8byte value and not using it directly as your "key" instead of the variable length string value?

Footnote: I'm well aware of the endian issues with this. He didn't say what the key will be used for and so I assume it isn't being used in network communications between machines of disparate endian-ness.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文