内存 C 中的 IP 查找表
我目前正在尝试 libpcap 和各种 C 应用程序,并尝试完成以下任务。程序初始化后,我想从文件加载 IP 并将它们存储在内存中。当我收到一些要处理的数据包详细信息时,我想将 IP 与加载到内存中的 IP 集进行比较。
在 C 中实现此功能的最佳方法/数据结构是什么?我需要适应列表增长和高效匹配,所以我觉得简单的查找数组将是一个错误的解决方案。帮助?
I'm currently experimenting with libpcap and various C applications and trying to do accomplish the following. Upon program initialization, I'd like to load IPs from a file and store them in memory. When I receive some packet details for processing, I'd like to compare an IP with the set of IP's loaded into memory.
What's the best way/data struct to implement this in C? I need to accommodate a list growth and efficient matching, so I feel like a simple lookup array would be a wrong solution. Help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
好吧,想必您永远不会在运行时删除 IP,而只是添加 IP。如果列表没有变得很大,那么对其进行排序实际上不会有太大的好处。
考虑到这两个事实,我可能会将它们全部放入一个(大尺寸)数组中,并在需要时进行线性搜索。跟踪数组中数据的末尾位置,在那里添加新条目将是一件微不足道的事情。
如果这真的太慢,你可以开发一个哈希表。当然,它需要根据 IP 映射的典型内容进行调整,以避免冲突(并进行开发和调试,因为 C 在标准中没有哈希值)。有点像 PITA,但应该是可行的。
我不会理会中间的任何事情(大概使用二进制搜索进行查找)。如果你那么渴望速度,那你不妨一路走下去。
Well, presumably you won't ever be removing IPs at runtime, just adding. If the list doesn't get huge, there would really be no big gain in sorting it.
Given those two facts, I'd probably just chuck them all in a (generously-sized) array and do linear searches when required. Keep track of where the end of data in the array is, and it will be a trivial matter to add new entries there.
If that is really too slow, you could develop a hash table. It would need to be tweaked based on the typical contents of your IP map to avoid collisions of course (and developed and debugged, as C doesn't have hashes in the standard). Bit of a PITA, but should be doable.
I wouldn't bother with anything in-between (presumably using binary searches for lookups). If you are that desperate for speed, you might as well go all the way.
为了获得真正不错的性能,绝对最少的工作量可能是仅使用 uint32_t 数组。
加载数据时,将每个 IP 放入数组中,并根据需要使用
realloc()
来增长它。请记住使用合理的增长模式,每次用完时将分配的长度加倍是常见的,并且可能会很好地工作。加载后,使用简单的
http://linux.die.net/man/3/qsort
调用对数组进行排序。然后你可以使用
bsearch()
快速搜索数组。由于这仅使用标准函数,因此代码量非常小,因此易于理解且易于编写。没有依赖关系,也不需要花时间去寻找健全的库,或者编写自己的更高级别的数据结构。但由于它使用二分搜索,所以速度会相当快。
The absolutely least amount of work, for really-decent performance, would probably be to just use an array of
uint32_t
.When loading your data, throw each IP into the array, using
realloc()
to grow it as necessary. Remember to use a sane growth pattern, doubling the allocated length each time it runs out is common and would probably work nicely.After load, sort the array using a simple
http://linux.die.net/man/3/qsort
call.Then you can search the array quickly using
bsearch()
.Since this uses only standard functions, it will be very small code-wise, and thus easy to understand and quick to write. No dependencies, and no time spent either chasing down sane libraries, or writing your own higher-level data structures. But since it uses binary search, it will be pretty fast.
很大程度上取决于表中可能包含的 IP 地址的数量。
对于少数情况,平衡二叉树(例如 AVL 树)应该工作得相当好。它有相当大的开销(每个节点 2 个指针),但只要节点数量很小,它可能就不是什么大问题(除非您的目标系统内存有限)。您还可以使用混合模式,其中单个节点在阵列中存储最多 N 个 IP 地址。通过半谨慎地选择 N,这可以减少指针开销,并提高缓存利用率。
如果您的数据可能超过 10K 左右,那么可能值得考虑使用 trie。
如果您的数字可能非常,您可以考虑使用简单的位集,每个 IP 地址一位。
编辑:我应该补充一点,与查找相比,它还取决于插入/删除的频率。我发现在许多情况下有用的一种混合结构是从已排序的主数组开始,然后在添加项目时将它们保留在未排序的单独数组中。当/如果辅助数组变得太大时,您可以对其进行排序并与主数组合并。
A lot depends on the number if IP addresses you're likely to have in your table.
For a small number, a balanced binary tree (e.g., AVL tree) should work reasonably well. It has a fair amount of overhead (2 pointers per node) but as long as the number of nodes is small, it's probably not much of an issue (unless you're targeting a system with constrained memory). You could also use a hybrid where a single node stores up to N IP addresses in an array. With semi-careful selection of N, this can reduce pointer overhead, and improve cache usage.
If you're likely to have more than 10K or so, it would probably be worth considering using a trie instead.
If you're likely to have a really large number, you might consider using a simple bitset, one bit per IP address.
Edit: I should add that it can also depend on the frequency of insertions/deletions compared to lookups. One hybrid structure I've found useful in many situations is to start with a main array that's sorted, then as items are added, keep them in a separate array that's not sorted. When/if the secondary array gets too big, you sort it and merge with the main array.