检查一组数组中是否存在数组
我有这样的固定长度阵列:
char x[10] = "abcdefghij"; // could also contain non-printable char
如何有效测试此阵列是否存在于相同长度的10,000个阵列中?(注意:这些阵列都是二进制数据阵列,或所有它们的无效终止字符串,这是
在Python中修复的,我将使用集合/hashtable/dict而不是列表(因为非常快的O(1)查找):
s = "abcdefghij"
S = {"foo1234567", "bar1234567", ..., "baz9876543"}
print(s in S) # True or False
如何做在C中的等效物? (不是c ++)
注意:链接的问题如何检查字符串是否在c?中是一种幼稚的方法,没有性能要求(它们在所有字符串上循环并使用strcmp < /代码>)。在这里,由于有10k数组,因此需要使用另一种方法来进行性能(也许是一个可标记?)。
I have fixed-length arrays like this:
char x[10] = "abcdefghij"; // could also contain non-printable char
How to efficiently test if this array is present or not in a fixed set of 10,000 arrays of the same length? (note: these arrays are either all of them binary data arrays, or all of them null-terminated strings, this is fixed at the beginning)
In Python, I would use a set/hashtable/dict and not a list (because of very fast O(1) lookup):
s = "abcdefghij"
S = {"foo1234567", "bar1234567", ..., "baz9876543"}
print(s in S) # True or False
How to do the equivalent in C? (not C++)
Note: the linked question How to check if a string is in an array of strings in C? is about a naive way to do it, with no performance requirement (they loop over all strings and use strcmp
). Here it's different since there are 10k arrays, one needs to use another method for performance (a hashtable maybe?).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请注意,关于在您的问题中,将零终止的字符串和二进制数据存储在数组中
,您可以指定固定长度阵列可以代表
如果您的程序无法知道这两种可能性中的哪一种由数组内容表示(例如,使用函数 strncpy )。仅编写单个无效终止字符是不够的。
这很重要,因为如果您不这样做,则您的程序将无法知道如何解释其余数据在遇到一个字节中使用值
0
。它不知道是将此字节解释为字符串的零终止特征,而忽略了字符串的所有剩余字节,还是
将此字节解释为二进制数据,并将其余字节视为二进制数据(即不要忽略其余字节)。<<<<<<<<<<<<<<<<<<<<<<<<<< /p>
但是,如果您始终在终止带有空字符的字符串的null字符后填充所有剩余字节,则您的程序将不会担心数组内容是否代表字符串或二进制数据,因为它可以以相同的方式对处理。
因此,在我的答案的其余部分中,我假设如果数组内容代表一个字符串,那么字符串结束后的所有剩余字节都将带有值
0
的字节。这样,我可以假设不应忽略字节。哈希表解决方案
也可以使用a hash表在C中,尽管您必须自己编程或使用已经存在的库。 C标准库不提供任何哈希功能。
以下代码将随机生成10,000个固定长度的数组(不是null终止的)长度10的字符串,字符
a
toz
,a
a 到z
和0
to9
,但它也会将三个硬编码单词放入不同位置的数组,因此您可以稍后搜索这些单词,以测试搜索功能。然后,该程序将将所有单词插入哈希表中,然后在哈希表中执行几个硬编码的查找。该程序具有以下输出:
如您所见,它发现了所有明确插入的3个字符串,并且找不到其他字符串。
在代码中,我使用
无符号char
而不是char
,因为在c中,achar *
通常用于传递null终止的字符串。因此,使用无符号char *
用于传递可能是二进制或不终止的数据似乎更合适。bsearch
用于比较的解决方案,以下是一种以相同方式生成输入数组的解决方案,但使用
bsearch
而不是用于搜索的哈希表:此程序具有以下输出:
哈希表解决方案可能比二进制搜索解决方案更快假设为输入选择了良好的哈希函数,因此哈希碰撞的数量很小。
Note about storing both null-terminated strings and binary data in an array
In your question, you specify that the fixed-length arrays could represent
If your program has no way of knowing which one of these two possibilities is represented by the array contents, then, if the array contents is supposed to represent a null-terminated string, you will have to fill the remainder of the array with null characters (for example using the function strncpy). It won't be sufficient to only write a single null terminating character.
This is important, because if you don't do this, your program will have no way of knowing how to interpret the remaining data after it encounters a byte with the value
0
. It won't know whether tointerpret this byte as a null-terminating character of a string and ignore all remaining bytes of the string, or
interpret this byte as binary data and treat the remaining bytes also as binary data (i.e. not ignore the remaining bytes).
However, if you always fill all remaining bytes after a terminating null character of a string with null characters, then your program won't habe to worry about whether the array contents represents a string or binary data, because it can treat both the same way.
Therefore, in the remainder of my answer, I will assume that if the array contents represents a string, then all remaining bytes after the end of the string will be filled with bytes with the value
0
. That way, I can assume that no bytes should ever be ignored.Hash table solution
It is also possible to use a hash table in C, although you will have to program it yourself or use an already existing library. The C standard library does not provide any hashing functions.
The following code will randomly generate an array of 10,000 fixed-length (not null-terminated) strings of length 10 with the characters
a
toz
,A
toZ
and0
to9
, but it will also place three hard-coded words into the array in different places, so you can search for these words later, in order to test the search function. The program will then insert all words into a hash table, and then perform several hard-coded lookups into the hash table.This program has the following output:
As you can see, it found all 3 strings that were explicitly inserted, and no other strings were found.
In the code, I am using
unsigned char
instead ofchar
, because in C, achar *
is usually used for passing null-terminated strings. Therefore, it seemed more appropriate to useunsigned char *
for passing data that could be binary or not null-terminated.bsearch
solutionFor comparison, here is a solution which generates the input array in the same way, but uses
bsearch
instead of a hash table for searching it:This program has the following output:
The hash table solution is probably faster than the binary search solution though, assuming that a good hash function is selected for the input, so that the number of hash collisions is minimal.
bsearch
函数的签名如下:键
将是t1
或t2
和基本
将是t
。NMEMB
和size
分别为9和4。比较
是指向回调函数进行比较的指针。这可以是围绕memcmp
的包装器:然后,您使用这些参数调用
bsearch
:The signature of the
bsearch
function is as follows:Here,
key
would be eithert1
ort2
andbase
would beT
.nmemb
andsize
would be 9 and 4 respectively.compar
is a pointer to a callback function to do the comparison. This can just be a wrapper aroundmemcmp
:Then you call
bsearch
with these parameters:我认为以下二进制搜索起作用(感谢@weathervane),如果对数组的数组进行排序,则复杂性为O(log n)。我们可以使用
memcmp
进行比较。结果:
I think the following binary search works (thanks to @WeatherVane) if the array of arrays is sorted, then the complexity is O(log n). We can use
memcmp
for the comparison.Result: