在 C/C++ 中执行不区分大小写的子字符串搜索的最快方法?
注意
下面的问题是在 2008 年提出的,涉及 2003 年的一些代码。正如 OP 的更新所示,整篇文章已被 2008 年的老式算法所废弃,仅作为历史好奇心而保留在这里。
我需要在 C/C++ 中进行快速的不区分大小写的子字符串搜索。 我的要求如下:
- 应该表现得像 strstr() (即返回指向匹配点的指针)。
- 必须不区分大小写 (doh)。
- 必须支持当前区域设置。
- 必须可在 Windows (MSVC++ 8.0) 上使用或轻松移植到 Windows(即从开源库)。
这是我当前使用的实现(取自 GNU C 库):
/* Return the offset of one string within another.
Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA. */
/*
* My personal strstr() implementation that beats most other algorithms.
* Until someone tells me otherwise, I assume that this is the
* fastest implementation of strstr() in C.
* I deliberately chose not to comment it. You should have at least
* as much fun trying to understand it, as I had to write it :-).
*
* Stephen R. van den Berg, [email protected] */
/*
* Modified to use table lookup instead of tolower(), since tolower() isn't
* worth s*** on Windows.
*
* -- Anders Sandvig ([email protected])
*/
#if HAVE_CONFIG_H
# include <config.h>
#endif
#include <ctype.h>
#include <string.h>
typedef unsigned chartype;
char char_table[256];
void init_stristr(void)
{
int i;
char string[2];
string[1] = '\0';
for (i = 0; i < 256; i++)
{
string[0] = i;
_strlwr(string);
char_table[i] = string[0];
}
}
#define my_tolower(a) ((chartype) char_table[a])
char *
my_stristr (phaystack, pneedle)
const char *phaystack;
const char *pneedle;
{
register const unsigned char *haystack, *needle;
register chartype b, c;
haystack = (const unsigned char *) phaystack;
needle = (const unsigned char *) pneedle;
b = my_tolower (*needle);
if (b != '\0')
{
haystack--; /* possible ANSI violation */
do
{
c = *++haystack;
if (c == '\0')
goto ret0;
}
while (my_tolower (c) != (int) b);
c = my_tolower (*++needle);
if (c == '\0')
goto foundneedle;
++needle;
goto jin;
for (;;)
{
register chartype a;
register const unsigned char *rhaystack, *rneedle;
do
{
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) == (int) b)
break;
a = *++haystack;
if (a == '\0')
goto ret0;
shloop:
;
}
while (my_tolower (a) != (int) b);
jin:
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) != (int) c)
goto shloop;
rhaystack = haystack-- + 1;
rneedle = needle;
a = my_tolower (*rneedle);
if (my_tolower (*rhaystack) == (int) a)
do
{
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
if (my_tolower (*rhaystack) != (int) a)
break;
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
}
while (my_tolower (*rhaystack) == (int) a);
needle = rneedle; /* took the register-poor approach */
if (a == '\0')
break;
}
}
foundneedle:
return (char*) haystack;
ret0:
return 0;
}
你能让这个代码更快吗?或者你知道更好的实现吗?
注意:我注意到 GNU C 库现在有 strstr()
的新实现,但我不是确定它可以多么容易地修改为不区分大小写,或者它实际上是否比旧的更快(在我的例子中)。 我还注意到 旧的实现仍然用于宽字符串,所以如果有人知道原因,请分享。
更新
只是为了让事情变得清楚 - 如果还没有 - 我没有编写这个函数,它是 GNU C 库的一部分。 我只是将其修改为不区分大小写。
另外,感谢您提供有关 strcasestr()
的提示,并查看其他来源(如 OpenBSD、FreeBSD 等)的其他实现。 这似乎是必经之路。 上面的代码来自 2003 年,这就是为什么我将其发布在这里,希望有更好的版本可用,显然确实如此。 :)
Note
The question below was asked in 2008 about some code from 2003. As the OP's update shows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity.
I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows:
- Should behave like strstr() (i.e. return a pointer to the match point).
- Must be case-insensitive (doh).
- Must support the current locale.
- Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).
Here is the current implementation I am using (taken from the GNU C Library):
/* Return the offset of one string within another.
Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA. */
/*
* My personal strstr() implementation that beats most other algorithms.
* Until someone tells me otherwise, I assume that this is the
* fastest implementation of strstr() in C.
* I deliberately chose not to comment it. You should have at least
* as much fun trying to understand it, as I had to write it :-).
*
* Stephen R. van den Berg, [email protected] */
/*
* Modified to use table lookup instead of tolower(), since tolower() isn't
* worth s*** on Windows.
*
* -- Anders Sandvig ([email protected])
*/
#if HAVE_CONFIG_H
# include <config.h>
#endif
#include <ctype.h>
#include <string.h>
typedef unsigned chartype;
char char_table[256];
void init_stristr(void)
{
int i;
char string[2];
string[1] = '\0';
for (i = 0; i < 256; i++)
{
string[0] = i;
_strlwr(string);
char_table[i] = string[0];
}
}
#define my_tolower(a) ((chartype) char_table[a])
char *
my_stristr (phaystack, pneedle)
const char *phaystack;
const char *pneedle;
{
register const unsigned char *haystack, *needle;
register chartype b, c;
haystack = (const unsigned char *) phaystack;
needle = (const unsigned char *) pneedle;
b = my_tolower (*needle);
if (b != '\0')
{
haystack--; /* possible ANSI violation */
do
{
c = *++haystack;
if (c == '\0')
goto ret0;
}
while (my_tolower (c) != (int) b);
c = my_tolower (*++needle);
if (c == '\0')
goto foundneedle;
++needle;
goto jin;
for (;;)
{
register chartype a;
register const unsigned char *rhaystack, *rneedle;
do
{
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) == (int) b)
break;
a = *++haystack;
if (a == '\0')
goto ret0;
shloop:
;
}
while (my_tolower (a) != (int) b);
jin:
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) != (int) c)
goto shloop;
rhaystack = haystack-- + 1;
rneedle = needle;
a = my_tolower (*rneedle);
if (my_tolower (*rhaystack) == (int) a)
do
{
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
if (my_tolower (*rhaystack) != (int) a)
break;
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
}
while (my_tolower (*rhaystack) == (int) a);
needle = rneedle; /* took the register-poor approach */
if (a == '\0')
break;
}
}
foundneedle:
return (char*) haystack;
ret0:
return 0;
}
Can you make this code faster, or do you know of a better implementation?
Note: I noticed that the GNU C Library now has a new implementation of strstr()
, but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share.
Update
Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive.
Also, thanks for the tip about strcasestr()
and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
您发布的代码大约是
strcasestr
的一半。main
函数是:它被适当修改以测试这两种实现。 我注意到,当我输入此内容时,我留在了
init_stristr
调用中,但它不应该改变太多事情。bench
只是一个简单的 shell 脚本:The code you posted is about half as fast as
strcasestr
.The
main
function was:It was suitably modified to test both implementations. I notice as I am typing this up I left in the
init_stristr
call, but it shouldn't change things too much.bench
is just a simple shell script:您可以使用 StrStrI 函数来查找字符串中子字符串的第一次出现。 比较不区分大小写。
不要忘记包含它的头文件 - Shlwapi.h。
看看这个:http:// msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx
You can use StrStrI function which finds the first occurrence of a substring within a string. The comparison is not case-sensitive.
Don't forget to include its header - Shlwapi.h.
Check this out: http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx
使用增强字符串算法。 它是可用的、跨平台的,并且只有一个头文件(没有可链接的库)。 更不用说你无论如何都应该使用 boost。
use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.
对于独立于平台的使用:
For platform independent use:
为什么使用 _strlwr(string); 在 init_stristr() 中? 这不是标准功能。 大概是为了区域设置支持,但由于它不是标准的,我只使用:
Why do you use _strlwr(string); in init_stristr()? It's not a standard function. Presumably it's for locale support, but as it's not standard, I'd just use:
我建议您采用一些已经存在的常见 strcasestr 实现。 例如 glib、glibc、OpenBSD、FreeBSD 等。您可以通过 google.com/codesearch 搜索更多内容。 然后,您可以进行一些性能测量并比较不同的实现。
I'd advice you to take some of the common strcasestr implementation that already exists. For example of glib, glibc, OpenBSD, FreeBSD, etc. You can search for more with google.com/codesearch. You can then make some performance measurements and compare the different implementation.
假设两个输入字符串都已经是小写。
您也可以尝试使用掩码...例如,如果您要比较的大多数字符串仅包含从 a 到 z 的字符,也许值得做这样的事情。
然后...
Assuming both input strings are already lowercase.
You could also, try using masks... if for example most of the strings you are going to compare only contains chars from a to z, maybe it's worth to do something like this.
Then...
这不会考虑区域设置,但如果您可以更改 IS_ALPHA 和 TO_UPPER,则可以使其考虑它。
This will not consider the locale, but If you can change the IS_ALPHA and TO_UPPER you can make it to consider it.
如果您想减少 CPU 周期,您可以考虑这一点 - 假设我们正在处理 ASCII 而不是 Unicode。
制作一个包含 256 个条目的静态表。 表中的每个条目都是 256 位。
要测试两个字符是否相等,请执行以下操作:
要构建表,请在 table[char1] 中的任何位置设置一个位,其中您认为它与 char2 匹配。 因此,在构建表时,您可以在第“a”条目(以及第“A”条目)中的“a”和“A”索引处设置位。
现在,执行位查找会很慢(位查找很可能是移位、掩码和添加),因此您可以使用字节表来代替,这样您就可以使用 8 位来表示 1 位。 这将需要 32K - 太棒了 - 您已经实现了时间/空间的权衡! 我们可能想让表格更加灵活,所以假设我们这样做——表格将定义同余。
当且仅当存在将两个字符定义为等效的函数时,两个字符才被视为全等。 因此,“A”和“a”在不区分大小写的情况下是一致的。 'A'、'À'、'Á' 和 'â' 对于变音不敏感是一致的。
所以你定义了与你的同余相对应的位域
然后你的测试是这样的:
顺便说一句,这种对巨大表的位摆弄是 ctype 的核心。
If you want to shed CPU cycles, you might consider this - let's assume that we're dealing with ASCII and not Unicode.
Make a static table with 256 entries. Each entry in the table is 256 bits.
To test whether or not two characters are equal, you do something like this:
To build the table, you set a bit everywhere in table[char1] where you consider it a match for char2. So in building the table you would set the bits at the index for 'a' and 'A' in the 'a'th entry (and the 'A'th entry).
Now this is going to be slowish to do the bit lookup (bit look up will be a shift, mask and add most likely), so you could use instead a table of bytes so you use 8 bits to represent 1 bit. This will take 32K - so hooray - you've hit a time/space trade-off! We might want to make the table more flexible, so let's say we do this instead - the table will define congruences instead.
Two characters are considered congruent if and only if there is a function that defines them as equivalent. So 'A' and 'a' are congruent for case insensitivity. 'A', 'À', 'Á' and 'Â' are congruent for diacritical insensitivity.
So you define bitfields that correspond to your congruencies
Then your test is something like this:
This kind of bit fiddling with ginormous tables is the heart of ctype, by the by.
如果您可以控制针字符串使其始终为小写,那么您可以编写 stristr() 的修改版本以避免查找,从而加快代码速度。 它并不那么通用,但它可以更快——稍微快一点。 类似的评论适用于干草堆,但您更有可能从您无法控制的来源读取干草堆,因为您无法确定数据是否满足要求。
性能的提升是否值得完全是另一个问题。 对于 99% 的申请,答案是“不,不值得”。 您的应用程序可能是重要的极少数应用程序之一。 更有可能的是,事实并非如此。
If you can control the needle string so that it is always in lower case, then you can write a modified version of stristr() to avoid the lookups for that, and thus speed up the code. It isn't as general, but it can be faster - slightly faster. Similar comments apply to the haystack, but you are more likely to be reading the haystack from sources outside your control for you cannot be certain that the data meets the requirement.
Whether the gain in performance is worth it is another question altogether. For 99% of applications, the answer is "No, it is not worth it". Your application might be one of the tiny minority where it matters. More likely, it is not.