C中的s_strip(s, del)函数，还有比这个更优化的版本吗？

发布于 2024-11-04 06:47:40 字数 1494 浏览 1 评论 0原文

我在这里发表的第一篇文章（很遗憾我没有早点发现这个伟大的社区）。

无论如何，我编写了一个 C 函数，该函数从字符串 s 中删除字符串 del 中包含的任何字符。我想知道在速度方面是否还有改进的空间，特别是对于在 for 循环内查找 del 中包含的字符的部分（我使用的是 strpbrk()，但 pmg 明智地建议了 strchr ())。

也非常欢迎 Bug 猎人！我认为它很强大，但你永远不知道。

这是代码（提前感谢您的回答）...

当前版本

// remove from string s any char contained in string del (return modified s)
// alg:
// parse s via cp1, keep desired *cp1's by copying them via cp2 to the start of s
// null terminate & return the trimmed s

char *s_strip(char *s, const char *del)
{
    char *cp1;                      // for parsing the whole s
    char *cp2;                      // for keeping desired *cp1's

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( !strchr(del, *cp1) )   // *cp1 is NOT contained in del (thanks pmg!)
            *cp2++ = *cp1;          // copy it via cp2

    *cp2 = 0;                       // null terminate the trimmed s
    return s;
}

原始版本

char *s_strip(char *s, const char *del)
{
    char *cp1;                              // for parsing the whole s
    char *cp2;                              // for keeping desired *cp1's

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( cp1 != strpbrk(cp1, del) ) {   // *cp1 is NOT contained in del
            *cp2 = *cp1;                    // copy it via cp2
            cp2++;
        }

    *cp2 = 0;                               // null terminate the trimmed s
    return s;
}

原文

my first post here (it's a real pity I hadn't discovered this great community earlier).

Anyway, I've coded a C function that removes from string s any character contained in string del. I was wondering if there's room for improvement, speed-wise, especially for the part that looks for chars contained in del, inside the for-loop (I was using strpbrk(), but pmg wisely suggested strchr() ).

Bug-hunters are most welcome too! I think it's robust, but you never know.

Here's the code (thanks in advance for any answers)...

Current version

// remove from string s any char contained in string del (return modified s)
// alg:
// parse s via cp1, keep desired *cp1's by copying them via cp2 to the start of s
// null terminate & return the trimmed s

char *s_strip(char *s, const char *del)
{
    char *cp1;                      // for parsing the whole s
    char *cp2;                      // for keeping desired *cp1's

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( !strchr(del, *cp1) )   // *cp1 is NOT contained in del (thanks pmg!)
            *cp2++ = *cp1;          // copy it via cp2

    *cp2 = 0;                       // null terminate the trimmed s
    return s;
}

Original version

char *s_strip(char *s, const char *del)
{
    char *cp1;                              // for parsing the whole s
    char *cp2;                              // for keeping desired *cp1's

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( cp1 != strpbrk(cp1, del) ) {   // *cp1 is NOT contained in del
            *cp2 = *cp1;                    // copy it via cp2
            cp2++;
        }

    *cp2 = 0;                               // null terminate the trimmed s
    return s;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

硬不硬你别怂 2024-11-11 06:47:40

使用 strchr() 代替 strpbrk() 怎么样？

当您只需要检查第一个字符时，您的版本不必要地循环输入字符串。

/* ... */
if (!strchr(del, *cp1)) *cp2++ = *cp1;
/* ... */

What about using strchr() instead of strpbrk()?

Your version loops over the input string unnecessarily, when you only need to check the 1st character.

/* ... */
if (!strchr(del, *cp1)) *cp2++ = *cp1;
/* ... */

回复收藏 0 原文

乖乖哒 2024-11-11 06:47:40

输入字符串平均有多长，以及您要删除多少个字符。如果您使用 8 位代码集和更长的源字符串，请考虑：

char *s_strip(char *s, const char *del)
{
    char map[256] = { 0 };
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    while (*up1 != '\0')
        map[*up1++] = 1;

    for ( ; *up2 != '\0'; up2++)
    {
        if (map[*up2] == 0)
            *up3++ = *up2;
    }
    *up3 = '\0';

    return (char *)up3;
}

这用简单的数组查找替换了函数调用（strpbrk() 或 strchr()）以在进入函数时初始化数组为代价。如果处理后的字符串相当长，则可以通过大大减少的搜索时间轻松获得回报。

此版本的代码返回一个指向字符串末尾的 null 的指针 - 这允许您在返回时计算压缩字符串的长度，而无需调用 strlen()。我发现这是一个比返回指向字符串开头的指针更有用的设计 - 我已经知道字符串从哪里开始，但我不再知道它在哪里结束，但函数 s_strip() 确实知道那。（如果我们能够重新开始，我也会游说对 strcpy() 和 strcat() 进行相同的设计。）

对“unsigned char”的强制转换确保它可以在有符号和无符号字符的机器上正常工作。我对大量的演员表并不太满意，但很难避免它们（嗯，可以通过复制 `up2 来避免将 s 分配给 up3 的演员表）。

您可能会想出一种也可以处理签名字符的设计，大致如下：

char realmap[256] = { 0 };
char *map = &realmap[128];

然后，您可以依赖于 map[-128] 范围内的签名字符 .. map[+127] 仍然位于 realmap 数组中。

自首次发布以来，上面的函数已经被更正了几次 - 现在它似乎与问题中建议的两个解决方案一致。我将这三个解决方案放入测试工具中对其进行计时，主要是因为我很好奇地图代码将如何执行，而在过去，我输给了汇编代码内联函数。该映射解决方案很快就显示出在我的测试系统（Mac Mini 2 GHZ Intel Core 2 Duo、MacOS X 10.6.7、GCC 4.1.2）上速度最快。

时间是以秒为单位的平均时间。每个算法运行 10 次。
size是源字符串的大小；删除字符串为 22 个字符（小写字母、大写字母和数字各 6 个，加上 4 个标点符号）。数据是由 ASCII 中的字母、数字和标点符号构建的固定字符串，根据需要重复多次以达到规定的大小（不包括终止空值）。请注意，计时包括每次复制源字符串 - “空”列表示复制所花费的时间。

size     map        strchr     strpbrk    null       micro1     micro2
   2     0.000542   0.002292   0.001009   0.000106   0.000639   0.000707
   8     0.000654   0.004125   0.017524   0.000106   0.001012   0.000966
  32     0.001667   0.015815   0.063314   0.000196   0.002549   0.002247
 128     0.006385   0.064513   0.313749   0.000171   0.008455   0.007188
 512     0.022231   0.257910   1.293040   0.000282   0.013284   0.011829
2048     0.089066   1.035052   5.297966   0.000819   0.043391   0.037597

即使源字符串非常短，映射算法也比 strchr() 或 strpbrk() 算法快得多（是 的 5-10 倍） strchr() 算法，速度是 strpbrk() 算法的 5-50 倍），并且差异随着搜索字符串大小的增加而增大。（我没想到这个结果 - 因为地图代码中存在设置开销。）

“micro1”和“micro2”算法对应于阿谢利。当字符串足够长时（切换在 128 到 512 字节之间），微优化版本比简单映射更快。

测试代码

如果您想要 timer.c、timer.h 的源代码，请联系我（请参阅我的个人资料）。我通常会将它们构建到一个库中并与该库链接，但这次将这些文件包含在程序中更简单。

#include <string.h>

extern char *s_strip1(char *s, const char *del);
extern char *s_strip2(char *s, const char *del);
extern char *s_strip3(char *s, const char *del);

char *s_strip3(char *s, const char *del)
{
    char map[256] = { 0 };
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    while (*up1 != '\0')
        map[*up1++] = 1;

    for ( ; *up2 != '\0'; up2++)
    {
        if (map[*up2] == 0)
            *up3++ = *up2;
    }
    *up3 = '\0';

    return (char *)up3;
}

char *s_strip2(char *s, const char *del)
{
    char *cp1;
    char *cp2;

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( !strchr(del, *cp1) )
            *cp2++ = *cp1;

    *cp2 = 0;
    return s;
}

char *s_strip1(char *s, const char *del)
{
    char *cp1;
    char *cp2;

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( cp1 != strpbrk(cp1, del) ) {
            *cp2 = *cp1;
            cp2++;
        }

    *cp2 = 0;
    return s;
}

#include <stdio.h>
#include "timer.h"
#include "timer.c"

enum { NUM_REPEATS = 10000 };
typedef char *(*Function)(char *str, const char *del);

static void fill_bytes(char *buffer, size_t buflen)
{
    static const char source[] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789[]{}\\|,./?><;:'\"=+-_)(*&^%$#@!";
    char *end = buffer + buflen;

    while (buffer < end)
    {
        size_t numbytes = sizeof(source) - 1;
        if ((size_t)(end - buffer) < sizeof(source)-1)
            numbytes = end - buffer;
        memmove(buffer, source, numbytes);
        buffer += numbytes;
    }
}

static void test(Function f, const char *fn, const char *del, size_t numbytes)
{
    Clock clk;
    char refbuf[numbytes];
    char buffer[numbytes];
    char clkbuf[32];
    fill_bytes(refbuf, sizeof(refbuf));
    strcpy(buffer, refbuf);
    clk_init(&clk);
    clk_start(&clk);
    for (size_t i = 0; i < NUM_REPEATS; i++)
    {
        memmove(buffer, refbuf, sizeof(buffer));
        if (f)
            (*f)(buffer, del);
    }
    clk_stop(&clk);
    printf("%-17s (%4zd) = %10s (%.64s)\n", fn, numbytes,
           clk_elapsed_us(&clk, clkbuf, sizeof(clkbuf)), buffer);
}

int main(void)
{
    for (int size = 2; size <= 2048; size = size * 4)
    {
        for (int i = 0; i < 10; i++)
        {
           test(s_strip1, "s_strip1:strpbrk:", "AJQRSTajqrst234567=+[]", size);
           test(s_strip2, "s_strip2:strchr:",  "AJQRSTajqrst234567=+[]", size);
           test(s_strip3, "s_strip3:map",      "AJQRSTajqrst234567=+[]", size);
           test(0,        "s_strip4:null",     "AJQRSTajqrst234567=+[]", size);
        }
    }
    return 0;
}

微观优化

extern char *s_strip4(char *s, const char *del);
extern char *s_strip5(char *s, const char *del);

char *s_strip5(char *s, const char *del)
{
    char map[256];
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    memset(map, 1, sizeof(map));

    while (*up1 != '\0')
        map[*up1++] = 0;

    for ( ; *up2 != '\0'; up2++)
    {
        *up3 = *up2;
        up3 += map[*up2];
    }
    *up3 = '\0';

    return (char *)up3;
}

char *s_strip4(char *s, const char *del)
{
    char map[256] = { 0 };
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    while (*up1 != '\0')
        map[*up1++] = 1;

    for ( ; *up2 != '\0'; up2++)
    {
        *up3 = *up2;
        up3 += !map[*up2];
    }
    *up3 = '\0';

    return (char *)up3;
}

How long are the input strings going to be, and how many characters are you going to be deleting, on average. If you work with 8-bit code sets and longer source strings, think about:

char *s_strip(char *s, const char *del)
{
    char map[256] = { 0 };
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    while (*up1 != '\0')
        map[*up1++] = 1;

    for ( ; *up2 != '\0'; up2++)
    {
        if (map[*up2] == 0)
            *up3++ = *up2;
    }
    *up3 = '\0';

    return (char *)up3;
}

This replaces the function call (strpbrk() or strchr()) with a simple array lookup at the cost of initializing the array on entry to the function. If the processed strings are fairly long, this could easily pay itself back in the vastly reduced search time.

This version of the code returns a pointer to the null at the end of the string - which allows you to compute the length of the compressed string on return without calling strlen(). I find this a more useful design than returning the pointer to the start of the string - I already knew where the string started, but I no longer know where it ends, yet the function s_strip() does know that. (If we were able to start over, I'd lobby for the same design for strcpy() and strcat() too.)

The coercions to 'unsigned char' ensure that it works properly on machines with signed and unsigned characters. I'm not really happy with the slathering of casts, but it is hard to avoid them (well, the cast assigning s to up3 could be avoided by copying `up2 instead).

You probably could come up with a design that handles signed characters too, along the lines of:

char realmap[256] = { 0 };
char *map = &realmap[128];

You can then rely on the signed characters being in a range such that map[-128] .. map[+127] still lands within the realmap array.

The function above has been corrected a couple of times since it was first posted - it now seems to work consistently with the two solutions suggested in the question. I put the three solutions into a test harness to time them, mainly because I was curious about how the map code would perform, having lost out to assembler code inline functions in the past. The mapping solution very quickly shows itself to be fastest on my test system (Mac Mini 2 GHZ Intel Core 2 Duo, MacOS X 10.6.7, GCC 4.1.2).

The times are average times in seconds. Each algorithm was given 10 runs.
The size is the size of the source string; the delete string was 22 characters (6 each of lower-case letters, upper-case letters and digits, plus 4 punctuation). The data was a fixed string built from the letters, digits and punctuation in ASCII, repeated as often as necessary to reach the stated size, which excludes the terminating null. Note that the timing includes copying the source string each time - the 'null' column indicates the time spent on the copying.

size     map        strchr     strpbrk    null       micro1     micro2
   2     0.000542   0.002292   0.001009   0.000106   0.000639   0.000707
   8     0.000654   0.004125   0.017524   0.000106   0.001012   0.000966
  32     0.001667   0.015815   0.063314   0.000196   0.002549   0.002247
 128     0.006385   0.064513   0.313749   0.000171   0.008455   0.007188
 512     0.022231   0.257910   1.293040   0.000282   0.013284   0.011829
2048     0.089066   1.035052   5.297966   0.000819   0.043391   0.037597

Even with very short source strings, the map algorithm is much faster than either the strchr() or strpbrk() algorithms (5-10 times as fast as the strchr() algorithm, and 5-50 times as fast as the strpbrk() algorithm), with the disparity growing with search string size. (I did not expect this result - because there is setup overhead in the map code.)

The 'micro1' and 'micro2' algorithms correspond to the modifications suggested by AShelly. When the strings are long enough (somewhere between 128 and 512 bytes is the switch-over), then the micro-optimized versions are quicker than the simple map.

Test Code

Contact me (see my profile) if you want the source for timer.c, timer.h. I'd normally build them into a library and link with the library, but it was simpler to include the files in the program this time.

#include <string.h>

extern char *s_strip1(char *s, const char *del);
extern char *s_strip2(char *s, const char *del);
extern char *s_strip3(char *s, const char *del);

char *s_strip3(char *s, const char *del)
{
    char map[256] = { 0 };
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    while (*up1 != '\0')
        map[*up1++] = 1;

    for ( ; *up2 != '\0'; up2++)
    {
        if (map[*up2] == 0)
            *up3++ = *up2;
    }
    *up3 = '\0';

    return (char *)up3;
}

char *s_strip2(char *s, const char *del)
{
    char *cp1;
    char *cp2;

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( !strchr(del, *cp1) )
            *cp2++ = *cp1;

    *cp2 = 0;
    return s;
}

char *s_strip1(char *s, const char *del)
{
    char *cp1;
    char *cp2;

    for (cp1=s, cp2=s; *cp1; cp1++ )
        if ( cp1 != strpbrk(cp1, del) ) {
            *cp2 = *cp1;
            cp2++;
        }

    *cp2 = 0;
    return s;
}

#include <stdio.h>
#include "timer.h"
#include "timer.c"

enum { NUM_REPEATS = 10000 };
typedef char *(*Function)(char *str, const char *del);

static void fill_bytes(char *buffer, size_t buflen)
{
    static const char source[] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789[]{}\\|,./?><;:'\"=+-_)(*&^%$#@!";
    char *end = buffer + buflen;

    while (buffer < end)
    {
        size_t numbytes = sizeof(source) - 1;
        if ((size_t)(end - buffer) < sizeof(source)-1)
            numbytes = end - buffer;
        memmove(buffer, source, numbytes);
        buffer += numbytes;
    }
}

static void test(Function f, const char *fn, const char *del, size_t numbytes)
{
    Clock clk;
    char refbuf[numbytes];
    char buffer[numbytes];
    char clkbuf[32];
    fill_bytes(refbuf, sizeof(refbuf));
    strcpy(buffer, refbuf);
    clk_init(&clk);
    clk_start(&clk);
    for (size_t i = 0; i < NUM_REPEATS; i++)
    {
        memmove(buffer, refbuf, sizeof(buffer));
        if (f)
            (*f)(buffer, del);
    }
    clk_stop(&clk);
    printf("%-17s (%4zd) = %10s (%.64s)\n", fn, numbytes,
           clk_elapsed_us(&clk, clkbuf, sizeof(clkbuf)), buffer);
}

int main(void)
{
    for (int size = 2; size <= 2048; size = size * 4)
    {
        for (int i = 0; i < 10; i++)
        {
           test(s_strip1, "s_strip1:strpbrk:", "AJQRSTajqrst234567=+[]", size);
           test(s_strip2, "s_strip2:strchr:",  "AJQRSTajqrst234567=+[]", size);
           test(s_strip3, "s_strip3:map",      "AJQRSTajqrst234567=+[]", size);
           test(0,        "s_strip4:null",     "AJQRSTajqrst234567=+[]", size);
        }
    }
    return 0;
}

Micro-optimizations

extern char *s_strip4(char *s, const char *del);
extern char *s_strip5(char *s, const char *del);

char *s_strip5(char *s, const char *del)
{
    char map[256];
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    memset(map, 1, sizeof(map));

    while (*up1 != '\0')
        map[*up1++] = 0;

    for ( ; *up2 != '\0'; up2++)
    {
        *up3 = *up2;
        up3 += map[*up2];
    }
    *up3 = '\0';

    return (char *)up3;
}

char *s_strip4(char *s, const char *del)
{
    char map[256] = { 0 };
    const unsigned char *up1 = (const unsigned char *)del;
    unsigned char *up2 = (unsigned char *)s;
    unsigned char *up3 = (unsigned char *)s;

    while (*up1 != '\0')
        map[*up1++] = 1;

    for ( ; *up2 != '\0'; up2++)
    {
        *up3 = *up2;
        up3 += !map[*up2];
    }
    *up3 = '\0';

    return (char *)up3;
}

回复收藏 0 原文

红ご颜醉 2024-11-11 06:47:40

每次调用 strpbrk 时，它都会向前扫描字符串，在 del 中查找字符。如果您使用不包含“可删除”字符的 100 个字符长的字符串调用此函数，它将检查 100+99+98+97+...+1 个字符。

为什么不更好地利用 strpbrk 返回的信息呢？如果返回值为null，则可以返回原始字符串。否则，您可以使用 strpbrk 返回的指针之前的所有字符。

像这样的东西：

 dest=src=s;
 bad = strpbrk(src,del);
 if (!bad) return s;  //early exit if there is no work.
 while (bad)
 {
    while (src<bad) *dst++=*src++;
    bad = strpbrk(++src,del);
 }
 while (*src) *dst++=*src++; //copy the remainder
 *dst='\0';
 return s;

Every time you call strpbrk, it is going to scan forward through the string, looking for a character in del. If you call this with a 100 character long string which contains no "deletable" characters, it is going to check 100+99+98+97+...+1 characters.

Why not make better use of the information strpbrk returns? If the return value is null, you can return the original string. Otherwise, you can use all the characters up to the pointer returned by strpbrk.

Something like:

 dest=src=s;
 bad = strpbrk(src,del);
 if (!bad) return s;  //early exit if there is no work.
 while (bad)
 {
    while (src<bad) *dst++=*src++;
    bad = strpbrk(++src,del);
 }
 while (*src) *dst++=*src++; //copy the remainder
 *dst='\0';
 return s;

回复收藏 0 原文

~没有更多了~