在 C 中创建集合的良好平均速度/内存效率方法?:
假设我正在将非空字符串 (char[]/char*s) 流式传输到我的程序中。我想创建一组。也就是说,对于集合 S 中的任何元素 a,a 在 S 中都是唯一的。
我曾想过以几种方式解决这个问题,但遇到了问题。
如果我知道要读取的项目数量 n,我可以创建一个哈希表,所有元素都以 null 开头,大小相同,如果发生冲突,请勿将其插入该表中。插入完成后,我将迭代哈希表的数组,计算非空值和大小,然后创建该大小的数组,然后将所有值复制到其中。
我可以只使用单个数组并在添加元素之前调整其大小,使用搜索算法检查元素在调整大小/添加之前是否已存在。
我意识到第二种方法可行,但由于元素可能未排序,因此由于搜索算法和调整大小的选择,对于大输入也可能需要很长时间。
任何意见将不胜感激。如果您需要更多信息,请随时在下面的评论框中提问。图书馆会很有帮助! (谷歌搜索“Sets in C”和类似的东西并没有多大帮助。)
Let's say that I am streaming non-empty strings (char[]/char*s) into my program. I would like to create a set of them. That is, for any element a in set S, a is unique in S.
I have thought to approach this in a few ways, but have run into issues.
If I knew the amount of items n I would be reading, I could just create a hash table, with all elements beginning as null, of the same size and if there was a collision, do not insert it into that table. When the insertions are done, I would iterate through the array of the hashtable, counting non-null values, size, and then create an array of that size, and then copy all the values to it.
I could use just use a single array and resize it before an element is added, using a search algorithm to check to see if an element already exists before resizing/adding it.
I realize the second method would work, but because the elements may not be sorted, could also take a very long time for large inputs because of choice of search algorithm and resizing, regardless.
Any input would be appreciated. Please feel free to ask questions in the comment box below if you need further information. Libraries would be very helpful! (Google searching "Sets in C" and similar things doesn't help very much.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
即使您不知道要插入的元素数量的大小,哈希表也可以工作...您只需定义哈希表以使用“桶”(即,每个位置实际上是一个链接的散列到相同值的元素列表),并且您将搜索每个“桶”以确保每个元素尚未插入到散列表中。避免搜索大“桶”的关键是良好的哈希算法。
如果您可以定义对象的弱排序,您还可以使用二叉搜索树。然后,如果 !(A < B) 和 !(B < A),则可以假设 A == B,因此您不会将该对象的任何其他迭代插入到树中,这将再次定义一个集合。
虽然我知道您使用的是 C,但请考虑以下事实:在 C++ STL 中,
std::set
使用 RB 树(红黑树,一种平衡二叉搜索树),并且 < code>std::unordered_set 使用哈希表。使用数组是一个坏主意......调整大小操作将花费很长时间,而插入树可以在 O(log N) 时间内完成,而对于哈希表,摊销 O(1) 时间。
A hash table can work even if you didn't know the size of the number of elements that you are going to be inserting ... you would simply define you hash table to use "buckets" (i.e., each position is actually a linked list of elements that hash to the same value), and you would search through each "bucket" to make sure that each element has not already been inserted into the hash-table. The key to avoiding large "buckets" to search through would be a good hash algorithm.
You can also, if you can define a weak ordering of your objects, use a binary search tree. Then if !(A < B) and !(B < A), it can be assumed A == B, and you would therefore not insert any additional iterations of that object into the tree, which again would define a set.
While I know you're using C, consider the fact that in the C++ STL,
std::set
uses a RB-tree (red-black tree which is a balanced binary search tree), andstd::unordered_set
uses a hash-table.Using an array is a bad idea ... resizing operations will take a long time, where-as insertions into a tree can be done in O(log N) time, and for a hash-table, ammortized O(1).