在内存中存储大量 IP 地址
问题有点长,请耐心等待。
我正在编写 java 代码来将全天网络跟踪中的流量聚合到每个子网的 84 秒 bin 中。目前,我最多有 256 个子网,每个子网有 1024 个容器。我用它来获取流量特征统计信息,例如每个子网每个窗口中的连接数、输入/输出字节数、外部 ip 地址数。虽然连接、输入/输出字节很简单,但获取唯一数量的外部 IP 地址会导致 OutOfMemory 错误。
为了确定外部 IP 地址的唯一数量,我需要将 IP 地址存储在某种数据结构(例如哈希表)中,并且在跟踪结束时我可以获得该哈希表的大小。这意味着我将拥有 1024*256 个哈希表,每个哈希表存储大量 12-15 字节的 IP 地址字符串(数十到数千)。这很快就会崩溃,系统内存不足(我尝试将 java 堆大小设置为最大 2GB,但没有成功)。谁能建议一种有效存储如此大量对象的方法?
我尝试使用 bitset(将 ip 转换为 int),但是考虑到 ip 地址非常稀疏,它对内存情况没有帮助。作为最后的手段,我可能会使用 colt 库稀疏矩阵,每个双精度存储最多 64 个 IP 地址,但我想获得意见,以防我遗漏了一些明显的东西,并且可以节省编写/调试此类包装器的时间。
旁注:为了了解规模,我在解析和聚合的每个跟踪中看到了数亿个流量。大多数情况下,我使用 256 个子网中的 10 到 20 个,但我希望该解决方案能够扩展到所有 256 个子网。
Question is a bit long so please bear with me.
I'm writing java code to aggregate flows from a full day network trace into 84 sec bins for each subnet. Currently, I have upto 256 subnets and 1024 bins for each subnet. I use this to acquire traffic feature statistics such as number of connections, in/out bytes, number of external ip addresses in each window of each subnet. While connections, in/out bytes is simple, getting unique number of exteran ip addresses is causing OutOfMemory errors.
To determine unique number of external ip addresses, I need to store IP address in some data structure such as hash table, and at the end of trace I can get the size of this hashtable. This means I'll have 1024*256 hashtables, each storing large number of 12-15 byte IP Address strings (tens to thousands). This quickly blows up and system goes out of memory (I've tried to set java heap size to upto 2GB with no avail). Can anyone suggest a way to store such a large number of objects efficiently?
I tried using bitset (converting ip to a int) however considering ip addresses are very very sparse, it doesn't help with memory situation. As the last resort, I might use colt library sparse matrices with each double storing upto 64 ip addresses but I wanted to get an opinion in case I'm missing something obvious and can save up on time writing/debugging such a wrapper.
Sidenotes: To get an idea of scale, I see several hundred million flows per trace that I parse and aggregate. I'm using 10 to 20 of 256 subnets in most cases but I'd like the solution to be scalable to all 256 subnets.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
更新:
如果您将整个 40 亿个 IPv4 地址存储为一个数组,那么您可以将时间表示为一个单独的短时间。
那就是 8GB,时间分辨率为 65K。只需考虑一下这一点,因为它对内存设置了上限,因为任何其他方案都必须低于该上限。如果使用一个字节,则每个 bin 的时间分辨率为 256,时间分辨率为 337.5 秒,容量为 4 GB。
现在,您只需要一点点就可以说您在该桶中至少看到了一个数据包。如果您需要再次耗尽内存的计数,但由于短路,您可以使用 1024 个具有潜在 6 位分辨率的存储桶进行计数:最多 64 个数据包。
现在,拥有 1 亿个唯一 IP,可将内存减少 10 倍,理论上从 8GB 增加到 800MB。虽然不分配您认为可以节省内存的整个空间,但您仍然必须为每个 IP 存储 4 个字节:400MB 仅用于 IP + 400MB 用于某种排序结构来保存它们(100M 指针 * 4 字节),以及 2 个字节时间:最少 1GB。通过分配完整空间,您可以跳过再次存储 IP,因为您的哈希值就是您的 IP。如果减少数组,您就不再拥有 IP,因为它已被散列掉。现在,您无法存储 IP 并仍然回答给定 IP 的问题,但您无法反省它。
如果您存储了一系列子网掩码,然后汇总该子网掩码下的所有 IP,并保留该子网掩码上的统计信息,该怎么办?例如,您有 256 个子网,它们都有自己的子网掩码。您的程序将采用掩码的下限。也就是说,如果您的掩码为 209.134.0.0/16 并使用下限 8。那么它将为该子网创建 256 个 bin,这些 bin 与 209.134.0.0-209.134.255.255 分开。您需要对拥有的所有 256 个子网重复相同的过程。下限为 8 位意味着每个子网的低 256 个地址将被汇总。您可以将任何 IP 地址散列到 bin 中并将统计信息保存在内存中。但是,您无法对单个 IP 地址说任何话。但是,如果您需要更高的分辨率,您可以将较低的子网掩码(例如 4)删除,现在有更多的垃圾箱。
仅当其中有 1 个 IP 时,您才会创建一个 bin,因此,如果您没有 IP 显示在那里,您可以节省一些空间,这样它就可以在足够低的下降分辨率和足够高的值之间进行平衡,以跳过为您不喜欢的事情创建 bin不需要。
然后,您可以写出每个 bin 的日志并跟踪磁盘上每个 bin 中发生的情况。如果您想回答有关单个 IP 的问题,您需要找出它属于哪个 bin,然后打开文件并在其中搜索以找到答案。此方案意味着您可以根据数据大小以及提高和降低界限来扩大或缩小规模。您可以通过更改每个 bin 写入的文件结构来提高性能。
我知道很抱歉的长度! :-)
Update:
If you stored the entire 4 Billion IPv4 addresses as a single array then you could represent time as an individual short.
That'd be 8GB with 65K time resolution. Just consider that because it puts an upper bound on the memory because any other scheme must be under that. If you used a byte it'd be 256 time resolution for 337.5 sec per bin, and 4 GB.
Now that gives you only a bit to say you saw at least a packet within that bucket. If you need a count that blows up the memory again, but with a short you could use 1024 buckets with a potential 6 bit resolution for counting: 64 packet max.
Now with 100 million unique IPs that reduces memory by 10x so you went from 8GB to 800MB in theory. While not allocating the entire space you think you can save memory, but you still have to store 4 bytes per IP: 400MB just for the IPs + 400MB for some sort structure to hold them (100M pointers * 4 bytes), and 2 bytes for time: 1GB minimum. By allocating the full space you get the ability to skip storing the IP over again because your hash is your IP. If you reduce the array you don't have the IP anymore because it's been hashed away. Now you could not store the IP and still answer questions given an IP, but you can't regurgitate it.
What if you stored a series of subnet masks, and then rolled up all the IPs underneath that, and keep your stats on that subnet mask. For example, you have 256 subnets with their own subnet mask. Your program would take in a lower bound of the mask. That being if you mask is 209.134.0.0/16 and use a lower bound of 8. Then it would create 256 bins for that subnet that were apart of 209.134.0.0-209.134.255.255. You'd repeat that same process for all 256 subnets you have. With a lower bound of 8 bits means the lower 256 addresses of each subnet would be rolled up. You can hash any IP address into the bin and keep statistics in memory. However, you can't say anything about a single IP address. But, if you need more resolution you can just drop the lower subnet mask say to 4, and now there are more bins.
You only create a bin if you have 1 IP within it so if you don't have IPs showing up there you can save some space so it's balancing act between low enough for descent resolution, but high enough to skip creating bins for things you don't need.
Then you could write out a log of each bin and track what occurred in each bin on the disk. If you want to answer a question about a single IP you figure out which bin it belongs in then open the file and search within there to find the answer. This scheme means you can scale up or down depending on the size of your data, and by raising and lowering your bounds. You could get performance boost by changing the file structure you write out per bin.
I know sorry for the length! :-)
不知道为什么你有 1024*256?
你只需要一个数据结构来保存所有数据;使用由 IP 作为 4 字节整数作为键控的红黑树。这为您提供了 O(log(n)) 查找时间,这意味着最坏的情况是 32 次比较才能找到 IP。或者一个由
Integer
键控的HashMap
。每个节点都有 84 个“bin”对象(存储在链接列表、数组或任何对您所拥有的访问模式有意义的内容中),其中包含您要存储的信息。如果您只需要聚合...仅存储聚合。这确实会减少你的内存使用量。
编辑:我倾向于忘记 Java
int
已签名。这不会造成问题,除非您确实想轻松地对它们进行排序,在这种情况下使用long
/Long
Not not sure why you have 1024*256?
You only need one data structure to hold all the data; Use a red-black tree keyed by the IP as a 4-byte integer. This gives you O(log(n)) lookup time which means worst case is 32 comparisons to find an IP. Or a
HashMap
keyed byInteger
.In each node have your 84 "bin" objects (stored in a Linked list, Array, or whatever makes sense in terms of the access pattern you have) that contain the information you want to store. If you only need the aggregate ... only store the aggregate. This really would cut down your memory usage.
Edit: I tend to forget about Java
int
s being signed. That doesn't pose a problem unless you actually want to sort them easily in which case use along
/Long
我会有多个位集,例如
每个地址只有 1 位,具体取决于 IP 地址的空闲程度。鉴于许多地址范围不会出现,内存消耗可能非常有效。如果没有真实的数据集,很难进行测试。 ;)
最坏情况下的内存大小是 512 MB,但对于实际数据集来说,它应该比这个小得多。
I would have multiple BitSets e.g.
This will as little as 1 bit per address, depending on how spare the IP address. Given many address ranges won't appear, memory consumption could be very efficient. Its very hard to test without a realistic dataset. ;)
The worst case memory size is 512 MB, but it should be much less than this for realistic data sets.