Java中IP地址过滤器内存数据结构的最佳选择

发布于 2024-12-18 12:57:11 字数 1314 浏览 4 评论 0原文

我有一个像 192.168.1.0/24 这样的 CIDR 格式的文件,它被转换成这样的两列结构

3232236030 3232235777

每个字符串 IP 地址转换都发生在这段代码中:

String subnet = "192.168.1.0/24";
SubnetUtils utils = new SubnetUtils(subnet);

Inet4Address a = (Inet4Address) InetAddress.getByName(utils.getInfo().getHighAddress());
long high = bytesToLong(a.getAddress());
Inet4Address b = (Inet4Address) InetAddress.getByName(utils.getInfo().getLowAddress());
long low = bytesToLong(b.getAddress());

private static long bytesToLong(byte[] address) {
   long ipnum = 0;
   for (int i = 0; i < 4; ++i) {
       long y = address[i];
       if (y < 0) {
           y += 256;
       }
       ipnum += y << ((3 - i) * 8);
   }
   return ipnum;
}

考虑有超过 500 万个 <代码>(低高:3232236030 3232235777)。
此外,还会有交叉点,因此 IP 可以源自多个范围。只要第一个就足够了。
数据是只读的。
查找 ipToBefiltered 所属范围的最快方法是什么?该结构将完全在内存中,因此不需要数据库查找。

更新:

我发现了这个 Peerblock 项目(它有超过一百万的下载量,所以我想它必须有一些快速算法): http://code.google.com/ p/peerblock/source/browse/trunk/src/pbfilter/filter_wfp.c

有谁知道该项目使用什么技术来创建范围列表并搜索它们?

I have file that is CIDR format like this 192.168.1.0/24 and it is converted into this two column strucutre

3232236030 3232235777

Each string IP address convertion happens with this code:

String subnet = "192.168.1.0/24";
SubnetUtils utils = new SubnetUtils(subnet);

Inet4Address a = (Inet4Address) InetAddress.getByName(utils.getInfo().getHighAddress());
long high = bytesToLong(a.getAddress());
Inet4Address b = (Inet4Address) InetAddress.getByName(utils.getInfo().getLowAddress());
long low = bytesToLong(b.getAddress());

private static long bytesToLong(byte[] address) {
   long ipnum = 0;
   for (int i = 0; i < 4; ++i) {
       long y = address[i];
       if (y < 0) {
           y += 256;
       }
       ipnum += y << ((3 - i) * 8);
   }
   return ipnum;
}

Consider that there are over 5 million entries of (low high : 3232236030 3232235777).
Also there will be intersects so the IP can originate from multiple ranges. Just the first one is more than OK.
The data is read only.
What would be the fastest way to find the range the ipToBefiltered belongs to? The structure will be entirely in memory so no database lookups.

UPDATE:

I found this Peerblock project (it has over million download so I'm thinking it must have some fast algorithms):
http://code.google.com/p/peerblock/source/browse/trunk/src/pbfilter/filter_wfp.c

Does anyone know what technique is the project using for creating the list of ranges and than searching them?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

纵性 2024-12-25 12:57:11

归根结底,我只需要知道 IP 是否存在于任何 5M 范围内。

我会考虑一个n-ary树,其中n=256,并且从点分地址而不是转换后的整数开始工作。

顶层是一个包含 256 个对象的数组。 null 条目表示“否”,不存在包含地址的范围,因此给定示例 192.168.1.0/24 array[192] 将包含一个对象,但 array [100] 可能为 null,因为没有为任何 100.xxx/n 定义范围

存储的对象包含(引用)另一个数组 [256] 和范围说明符,只会设置两者之一,因此192.0.0.0/8 最终会出现一个范围说明符,指示该范围内的所有地址都将被过滤。这将允许诸如 192.255.0.0/10 之类的内容,其中地址的前 10 位很重要 1100 0000 11xx xxxx ——否则您需要检查下一个八位字节第二级数组。

最初将重叠范围(如果有)合并为更大的范围...例如 3 .. 107 .. 16 变为 3 .. 16 ...允许这样做,因为您不需要将给定的 IP 与定义它的范围关联起来。

这应该需要不超过 8 次比较。每个八位字节最初直接用作索引,然后比较 null,比较终端节点(它是范围还是指向下一个树级别的指针)

最坏情况的内存消耗理论上为 4 GB (256 ^ 4) 如果每个 IP 地址都在过滤范围内,但当然这会合并成一个范围,因此实际上只有 1 个范围对象。更现实的最坏情况可能更像 (256 ^ 3) 或 16.7 MB。现实世界的使用可能会使每个级别的大多数 array[256] 节点为空。

这本质上类似于霍夫曼/前缀编码。一旦找到答案(范围),最短的不同前缀就会终止,因此通常您会得到 平均值4比较。

When it comes down to it I just need to know if the IP is present in any of the 5M ranges.

I would consider an n-ary tree, where n=256, and work from the dotted address rather than the converted integer.

The top level would be an array of 256 objects. A null entry means "No" there is no range that contains the address, so given your example 192.168.1.0/24 array[192] would contain an object, but array[100] might be null because no range was defined for any 100.x.x.x/n

The stored object contains a (reference to) another array[256] and a range specifier, only one of the two would be set, so 192.0.0.0/8 would end up with a range specifier indicating all addresses within that range are to be filtered. This would allow for things like 192.255.0.0/10 where the first 10 bits of the address are significant 1100 0000 11xx xxxx -- otherwise you need to check the next octet in the 2nd level array.

Initially coalescing overlapping ranges, if any, into larger ranges... e.g. 3 .. 10 and 7 .. 16 becomes 3 .. 16 ... allows this, since you don't need to associate a given IP with which range defined it.

This should require no more than 8 comparisons. Each octet is initially used directly as an index, followed by a compare for null, a compare for terminal-node (is it a range or a pointer to the next tree level)

Worst case memory consumption is theoretically 4 GB (256 ^ 4) if every IP address was in a filtering range, but of course that would coalesce into a single range so actually would be only 1 range object. A more realistic worst-case would probably be more like (256 ^ 3) or 16.7 MB. Real world usage would probably have the majority of array[256] nodes at each level empty.

This is essentially similar to Huffman / prefix coding. The shortest distinct prefix can terminate as soon as an answer (a range) is found, so often you would have averages of < 4 compares.

失退 2024-12-25 12:57:11

我将使用一个排序的 int 数组(基地址)和另一个相同大小的数组(结束地址)。这将使用 5M * 8 = 40 MB。第一个 IP 是基础地址,第二个 IP 是范围内的最后一个地址。您需要删除交叉点。

要查找地址是否被过滤为二分搜索 O(log N),如果不完全匹配,请检查它是否小于(或等于)上限。

I would use a sorted array of int (the base address) and another array the same size (the end address). This would use 5M * 8 = 40 MB. The first IP is the base and the second IP is the last address in range. You would need to remove intersections.

To find if an address is filtered to a binary search O(log N) and if not an exact match, check it is less than (or equal to) the upper bound.

呆橘 2024-12-25 12:57:11

我在 Vuze(又名 azureus) 项目中发现了这个二进制 Chop 算法:

public IpRange isInRange(long address_long) {
    checkRebuild();

    if (mergedRanges.length == 0) {
        return (null);
    }

    // assisted binary chop

    int bottom = 0;
    int top = mergedRanges.length - 1;
    int current = -1;

    while (top >= 0 && bottom < mergedRanges.length && bottom <= top) {

        current = (bottom + top) / 2;

        IpRange e = mergedRanges[current];

        long this_start = e.getStartIpLong();
        long this_end = e.getMergedEndLong();

        if (address_long == this_start) {
            break;
        } else if (address_long > this_start) {

            if (address_long <= this_end) {
                break;
            }

            // lies to the right of this entry

            bottom = current + 1;

        } else if (address_long == this_end) {
            break;
        } else {
            // < this_end

            if (address_long >= this_start) {
                break;
            }
            top = current - 1;
        }
    }

    if (top >= 0 && bottom < mergedRanges.length && bottom <= top) {

        IpRange e = mergedRanges[current];

        if (address_long <= e.getEndIpLong()) {
            return (e);
        }

        IpRange[] merged = e.getMergedEntries();

        if (merged == null) {
            //inconsistent merged details - no entries
            return (null);
        }

        for (IpRange me : merged) {
            if (me.getStartIpLong() <= address_long && me.getEndIpLong() >= address_long) {
                return (me);
            }
        }
    }
    return (null);
}

似乎表现得很好。如果您知道更快的事情,请告诉我。

I found this binary chop algorithm in Vuze (aka azureus) project:

public IpRange isInRange(long address_long) {
    checkRebuild();

    if (mergedRanges.length == 0) {
        return (null);
    }

    // assisted binary chop

    int bottom = 0;
    int top = mergedRanges.length - 1;
    int current = -1;

    while (top >= 0 && bottom < mergedRanges.length && bottom <= top) {

        current = (bottom + top) / 2;

        IpRange e = mergedRanges[current];

        long this_start = e.getStartIpLong();
        long this_end = e.getMergedEndLong();

        if (address_long == this_start) {
            break;
        } else if (address_long > this_start) {

            if (address_long <= this_end) {
                break;
            }

            // lies to the right of this entry

            bottom = current + 1;

        } else if (address_long == this_end) {
            break;
        } else {
            // < this_end

            if (address_long >= this_start) {
                break;
            }
            top = current - 1;
        }
    }

    if (top >= 0 && bottom < mergedRanges.length && bottom <= top) {

        IpRange e = mergedRanges[current];

        if (address_long <= e.getEndIpLong()) {
            return (e);
        }

        IpRange[] merged = e.getMergedEntries();

        if (merged == null) {
            //inconsistent merged details - no entries
            return (null);
        }

        for (IpRange me : merged) {
            if (me.getStartIpLong() <= address_long && me.getEndIpLong() >= address_long) {
                return (me);
            }
        }
    }
    return (null);
}

Seems to be performing pretty well. If you know about something faster please let me know.

茶色山野 2024-12-25 12:57:11

如果您只有一个 CIDR 地址(或它们的列表),并且想要检查某个 ipAddress 是否在该 CIDR(或 CIDR 列表)的范围内,只需定义一组 SubnetUtils 对象即可。

除非您要过滤非常大的 N 个地址,否则这都是字符串比较,并且执行速度非常快。您不需要基于高/低阶位和所有复杂的 Jazz 构建二叉树。

String subnet = "192.168.1.0/24";
SubnetUtils utils = new SubnetUtils(subnet);
//...
//for each subnet, create a SubnetUtils object
Set<SubnetUtils> subnets = getAllSubnets();
//...

使用 Guava 谓词来过滤不在子网集范围内的 ipAddresses:

   Set<String> ipAddresses = getIpAddressesToFilter();
   Set<String> ipAddressesInRange = 
       Sets.filter(ipAddresses, filterIpsBySubnet(subnets))


   Predicate<String> filterIpsBySubnet(final Set<SubnetUtils> subnets){
       return new Predicate<String>() {
            @Override
            public boolean apply(String ipAddress) {
                for (SubnetUtils subnet : subnets) {
                    if (subnet.getInfo().isInRange(ipAddress)) {
                        return true;
                    }
                }
                return false;
            }
        };
   }

现在,如果 IP 位于任何子网中,您就有了一个很好的简单过滤器,并且不必构建您将拥有的数据结构进行单元测试。如果这还不够性能,那么就进行优化。不要过早优化:)

If you just have a CIDR address (or a list of them) and you want to check if some ipAddress is in the range of that CIDR (or list of CIDR's), just define a Set of SubnetUtils objects.

Unless you are filtering a very large N addresses, this is all String comparison and will execute extremely fast. You dont need to build a binary tree based on the higher/lower order bits and all of that complicated Jazz.

String subnet = "192.168.1.0/24";
SubnetUtils utils = new SubnetUtils(subnet);
//...
//for each subnet, create a SubnetUtils object
Set<SubnetUtils> subnets = getAllSubnets();
//...

Use a Guava Predicate to filter the ipAddresses that are not in the range of your set of subnets:

   Set<String> ipAddresses = getIpAddressesToFilter();
   Set<String> ipAddressesInRange = 
       Sets.filter(ipAddresses, filterIpsBySubnet(subnets))


   Predicate<String> filterIpsBySubnet(final Set<SubnetUtils> subnets){
       return new Predicate<String>() {
            @Override
            public boolean apply(String ipAddress) {
                for (SubnetUtils subnet : subnets) {
                    if (subnet.getInfo().isInRange(ipAddress)) {
                        return true;
                    }
                }
                return false;
            }
        };
   }

Now if the IP is in any of the Subnets, you have a nice simple filter and you dont have to build a data structure that you will have to unit test. If this is not performant enough, then go to optimization. Don't prematurely optimize :)

深海里的那抹蓝 2024-12-25 12:57:11

这是答案的开始,当我有更多空闲时间时我会回来

设置:

  1. 按起始数字对范围进行排序。
  2. 由于这些是 IP 地址,我假设没有任何范围重叠。如果存在重叠,您可能应该运行列表合并范围并修剪不必要的范围(例如,如果您有范围 1 - 10,则可以修剪范围 5 - 7)。
    1. 要合并或修剪,请执行以下操作(假设范围 a 紧邻范围 b):
      1. 如果 b.end < a.end 那么范围 b 是范围 a 的子集,您可以删除范围 b。
      2. 如果 b.start < b.end和b.end> a.end 然后你可以合并范围a和b。设置 a.end = b.end 然后删除范围 b。

Here is the beginning of an answer, I'll come back when I get more freetime

Setup:

  1. Sort the ranges by the starting number.
  2. Since these are IP Addresses, I assume that none of the ranges overlap. If there are overlaps, you should probably run the list merging ranges and trimming unnecessary ranges (ex. if you have a range 1 - 10, you can trim the range 5 - 7).
    1. To merge or trim do this (assume range a immediately precedes range b):
      1. If b.end < a.end then range b is a subset of range a and you can remove range b.
      2. If b.start < b.end and b.end > a.end then you can merge range a and b. Set a.end = b.end then remove range b.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文