在 C/C++ 中执行不区分大小写的子字符串搜索的最快方法？

发布于 2024-07-07 02:42:19 字数 5237 浏览 11 评论 0原文

注意

下面的问题是在 2008 年提出的，涉及 2003 年的一些代码。正如 OP 的更新所示，整篇文章已被 2008 年的老式算法所废弃，仅作为历史好奇心而保留在这里。

我需要在 C/C++ 中进行快速的不区分大小写的子字符串搜索。我的要求如下：

应该表现得像 strstr() （即返回指向匹配点的指针）。
必须不区分大小写 (doh)。
必须支持当前区域设置。
必须可在 Windows (MSVC++ 8.0) 上使用或轻松移植到 Windows（即从开源库）。

这是我当前使用的实现（取自 GNU C 库）：

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, [email protected] */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig ([email protected])
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '\0';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != '\0')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '\0')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == '\0')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == '\0')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '\0')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == '\0')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '\0')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == '\0')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '\0')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

你能让这个代码更快吗？或者你知道更好的实现吗？

注意：我注意到 GNU C 库现在有 strstr() 的新实现，但我不是确定它可以多么容易地修改为不区分大小写，或者它实际上是否比旧的更快（在我的例子中）。我还注意到旧的实现仍然用于宽字符串，所以如果有人知道原因，请分享。

更新

只是为了让事情变得清楚 - 如果还没有 - 我没有编写这个函数，它是 GNU C 库的一部分。我只是将其修改为不区分大小写。

另外，感谢您提供有关 strcasestr() 的提示，并查看其他来源（如 OpenBSD、FreeBSD 等）的其他实现。这似乎是必经之路。上面的代码来自 2003 年，这就是为什么我将其发布在这里，希望有更好的版本可用，显然确实如此。 :)

原文

Note

The question below was asked in 2008 about some code from 2003. As the OP's update shows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity.

I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows:

Should behave like strstr() (i.e. return a pointer to the match point).
Must be case-insensitive (doh).
Must support the current locale.
Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).

Here is the current implementation I am using (taken from the GNU C Library):

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, [email protected] */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig ([email protected])
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '\0';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != '\0')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '\0')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == '\0')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == '\0')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '\0')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == '\0')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '\0')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == '\0')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '\0')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

Can you make this code faster, or do you know of a better implementation?

Note: I noticed that the GNU C Library now has a new implementation of strstr(), but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share.

Update

Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive.

Also, thanks for the tip about strcasestr() and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

長街聽風 2024-07-14 02:42:19

您发布的代码大约是 strcasestr 的一半。

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

main 函数是：

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

它被适当修改以测试这两种实现。我注意到，当我输入此内容时，我留在了 init_stristr 调用中，但它不应该改变太多事情。 bench 只是一个简单的 shell 脚本：

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

The code you posted is about half as fast as strcasestr.

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

The main function was:

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

It was suitably modified to test both implementations. I notice as I am typing this up I left in the init_stristr call, but it shouldn't change things too much. bench is just a simple shell script:

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

回复收藏 0 原文

暮色兮凉城 2024-07-14 02:42:19

您可以使用 StrStrI 函数来查找字符串中子字符串的第一次出现。比较不区分大小写。
不要忘记包含它的头文件 - Shlwapi.h。
看看这个：http:// msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx

回复收藏 0 原文

筱果果 2024-07-14 02:42:19

使用增强字符串算法。它是可用的、跨平台的，并且只有一个头文件（没有可链接的库）。更不用说你无论如何都应该使用 boost。

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

回复收藏 0 原文

我很坚强 2024-07-14 02:42:19

对于独立于平台的使用：

const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'\0')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'\0')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'\0')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}

For platform independent use:

const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'\0')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'\0')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'\0')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}

回复收藏 0 原文

谁的年少不轻狂 2024-07-14 02:42:19

为什么使用 _strlwr(string); 在 init_stristr() 中？这不是标准功能。大概是为了区域设置支持，但由于它不是标准的，我只使用：

char_table[i] = tolower(i);

Why do you use _strlwr(string); in init_stristr()? It's not a standard function. Presumably it's for locale support, but as it's not standard, I'd just use:

char_table[i] = tolower(i);

回复收藏 0 原文

请持续率性 2024-07-14 02:42:19

我建议您采用一些已经存在的常见 strcasestr 实现。例如 glib、glibc、OpenBSD、FreeBSD 等。您可以通过 google.com/codesearch 搜索更多内容。然后，您可以进行一些性能测量并比较不同的实现。

回复收藏 0 原文

べ映画 2024-07-14 02:42:19

假设两个输入字符串都已经是小写。

int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
        int iCounter = 0;
        while((iCounter + iSearchTextSize) <= iTextSize)
        {
            if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
                return  iCounter;
            iCounter ++;
        }
    }

    return -1;
}

您也可以尝试使用掩码...例如，如果您要比较的大多数字符串仅包含从 a 到 z 的字符，也许值得做这样的事情。

long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != '\0')
    {       
        if (*p_cText>='a' && *p_cText<='z')
            lMask = lMask | (1 << (*p_cText - 'a') );
        else if(*p_cText != ' ')
        {
            lMask = 0;
            break;      
        }

        p_cText ++;
    }
    return lMask;
}

然后...

int main(int argc, char* argv[])
{

    char* p_cText = "this is a test";   
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
        if((lTextMask & lSearchMask) == lSearchMask)
        {       
             iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
        }
    }
    else
    {
        iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}

Assuming both input strings are already lowercase.

int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
        int iCounter = 0;
        while((iCounter + iSearchTextSize) <= iTextSize)
        {
            if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
                return  iCounter;
            iCounter ++;
        }
    }

    return -1;
}

You could also, try using masks... if for example most of the strings you are going to compare only contains chars from a to z, maybe it's worth to do something like this.

long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != '\0')
    {       
        if (*p_cText>='a' && *p_cText<='z')
            lMask = lMask | (1 << (*p_cText - 'a') );
        else if(*p_cText != ' ')
        {
            lMask = 0;
            break;      
        }

        p_cText ++;
    }
    return lMask;
}

Then...

int main(int argc, char* argv[])
{

    char* p_cText = "this is a test";   
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
        if((lTextMask & lSearchMask) == lSearchMask)
        {       
             iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
        }
    }
    else
    {
        iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}

回复收藏 0 原文

饮湿 2024-07-14 02:42:19

这不会考虑区域设置，但如果您可以更改 IS_ALPHA 和 TO_UPPER，则可以使其考虑它。

#define IS_ALPHA(c) (((c) >= 'A' && (c) <= 'Z') || ((c) >= 'a' && (c) <= 'z'))
#define TO_UPPER(c) ((c) & 0xDF)

char * __cdecl strstri (const char * str1, const char * str2){
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp){
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && (IS_ALPHA(*s1) && IS_ALPHA(*s2))?!(TO_UPPER(*s1) - TO_UPPER(*s2)):!(*s1-*s2))
                        ++s1, ++s2;

                if (!*s2)
                        return(cp);

                ++cp;
        }
        return(NULL);
}

This will not consider the locale, but If you can change the IS_ALPHA and TO_UPPER you can make it to consider it.

#define IS_ALPHA(c) (((c) >= 'A' && (c) <= 'Z') || ((c) >= 'a' && (c) <= 'z'))
#define TO_UPPER(c) ((c) & 0xDF)

char * __cdecl strstri (const char * str1, const char * str2){
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp){
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && (IS_ALPHA(*s1) && IS_ALPHA(*s2))?!(TO_UPPER(*s1) - TO_UPPER(*s2)):!(*s1-*s2))
                        ++s1, ++s2;

                if (!*s2)
                        return(cp);

                ++cp;
        }
        return(NULL);
}

回复收藏 0 原文

趁微风不噪 2024-07-14 02:42:19

如果您想减少 CPU 周期，您可以考虑这一点 - 假设我们正在处理 ASCII 而不是 Unicode。

制作一个包含 256 个条目的静态表。表中的每个条目都是 256 位。

要测试两个字符是否相等，请执行以下操作：

if (BitLookup(table[char1], char2)) { /* match */ }

要构建表，请在 table[char1] 中的任何位置设置一个位，其中您认为它与 char2 匹配。因此，在构建表时，您可以在第“a”条目（以及第“A”条目）中的“a”和“A”索引处设置位。

现在，执行位查找会很慢（位查找很可能是移位、掩码和添加），因此您可以使用字节表来代替，这样您就可以使用 8 位来表示 1 位。这将需要 32K - 太棒了 - 您已经实现了时间/空间的权衡！我们可能想让表格更加灵活，所以假设我们这样做——表格将定义同余。

当且仅当存在将两个字符定义为等效的函数时，两个字符才被视为全等。因此，“A”和“a”在不区分大小写的情况下是一致的。 'A'、'À'、'Á' 和 'â' 对于变音不敏感是一致的。

所以你定义了与你的同余相对应的位域

#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)

然后你的测试是这样的：

inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
    return (_congruencyTable[c1][c2] & congruency) != 0;
}

#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)

顺便说一句，这种对巨大表的位摆弄是 ctype 的核心。

If you want to shed CPU cycles, you might consider this - let's assume that we're dealing with ASCII and not Unicode.

Make a static table with 256 entries. Each entry in the table is 256 bits.

To test whether or not two characters are equal, you do something like this:

if (BitLookup(table[char1], char2)) { /* match */ }

To build the table, you set a bit everywhere in table[char1] where you consider it a match for char2. So in building the table you would set the bits at the index for 'a' and 'A' in the 'a'th entry (and the 'A'th entry).

Now this is going to be slowish to do the bit lookup (bit look up will be a shift, mask and add most likely), so you could use instead a table of bytes so you use 8 bits to represent 1 bit. This will take 32K - so hooray - you've hit a time/space trade-off! We might want to make the table more flexible, so let's say we do this instead - the table will define congruences instead.

Two characters are considered congruent if and only if there is a function that defines them as equivalent. So 'A' and 'a' are congruent for case insensitivity. 'A', 'À', 'Á' and 'Â' are congruent for diacritical insensitivity.

So you define bitfields that correspond to your congruencies

#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)

Then your test is something like this:

inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
    return (_congruencyTable[c1][c2] & congruency) != 0;
}

#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)

This kind of bit fiddling with ginormous tables is the heart of ctype, by the by.

回复收藏 0 原文