优化 bin 放置算法
好吧,我有两个集合,我需要将 collection1 中的元素放入 collection2 的容器(元素)中,具体取决于它们的值是否落在给定容器的范围内。
举一个具体的例子,假设我有一个排序的集合对象(bins),它有一个 int 范围([1...4]、[5..10]等)。我需要确定 int 所属的范围,并将其放入适当的容器中。
foreach(element n in collection1) {
foreach(bin m in collection2) {
if (m.inRange(n)) {
m.add(n);
break;
}
}
}
所以明显的 NxM 复杂度算法就在那里,但我真的很想看到 Nxlog(M)。为此,我想使用 BinarySearch 代替内部 foreach 循环。要使用 BinarySearch,我需要实现一个 IComparer 类来为我进行搜索。我遇到的问题是这种方法需要我创建一个 IComparer.Compare 函数来比较两种不同类型的对象(一个元素与其容器),但这似乎不可能或不正确。那么我想问一下,这个算法应该怎么写呢?
Alright, I've got two collections, and I need to place elements from collection1 into the bins (elements) of collection2, based on whether their value falls within a given bin's range.
For a concrete example, assume I have a sorted collection objects (bins) which have an int range ([1...4], [5..10], etc). I need to determine the range an int falls in, and place it in the appropriate bin.
foreach(element n in collection1) {
foreach(bin m in collection2) {
if (m.inRange(n)) {
m.add(n);
break;
}
}
}
So the obvious NxM complexity algorithm is there, but I really would like to see Nxlog(M). To do this I'd like to use BinarySearch in place of the inner foreach loop. To use BinarySearch, I need to implement an IComparer class to do the searching for me. The problem I'm running into is this approach would require me to make an IComparer.Compare function that compares two different types of objects (an element to its bin), and that doesn't seem possible or correct. So I'm asking, how should I write this algorithm?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
让我们重述一下这个问题。您希望
这样编写,使 GetBinIndex 的性能在 bin 数量方面优于 O(n)。
这取决于箱的拓扑结构。
如果 bin 只是连续的非负整数范围,例如 0..4、5..9、10..14 等,则只需将 item 除以 5 即可。那是 O(1)。
如果 bin 是不同大小的连续整数范围,例如 0..4、5..14、15..16、17..17、18..32,...,则
搜索的时间复杂度为 O(lg n),构建列表的时间复杂度为 O(n)。
如果箱是不连续的整数范围(也就是说,如果范围有空洞,或者它们重叠),那么您想要构建以有效查找最佳范围的数据结构是区间树。在非病态情况下搜索区间树的时间复杂度通常为 O(lg n),首先构建树的时间复杂度为 O(n lg n)。
Let's restate the problem. You wish to write
such that the performance of GetBinIndex is better than O(n) in the number of bins.
It depends on the topology of the bins.
If the bins are simply contiguous non-negative integer ranges, say, 0..4, 5..9, 10..14, and so on, then just divide item by 5, done. That's O(1).
If the bins are contiguous integer ranges of different sizes, say, 0..4, 5..14, 15..16, 17..17, 18..32, ... then:
List<int>
that stores the top of each range. So in this case, {4, 14, 16, 17, 32, ...}This is O(lg n) to search, and O(n) to build the list in the first place.
If the bins are noncontiguous integer ranges -- that is, if the ranges have holes, or if they overlap -- then the data structure you want to build to efficiently find the best range is an interval tree. Interval trees are typically O(lg n) to search in nonpathological situations, and O(n lg n) to build the tree in the first place.
我不确定我是否完全理解这个问题,因为我并没有真正理解这一部分:
尽管如此,我会尽力而为:
IComparer 用于对集合进行排序,以便您可以执行二分搜索。查看 MSDN 文章:http://msdn。 microsoft.com/en-us/library/system.collections.icomparer.aspx
因此,基本上,您需要确保首先使用 IComparer 对 Collection2 进行排序,IComparer 基本上只是从最低到最高范围对 Bins 进行排序。从您在第二个 foreach 内进行中断的事实来看,似乎您没有任何重叠,因此这不应该成为问题。
您不会使用 Array.BinarySearch (http: //msdn.microsoft.com/en-us/library/system.array.binarysearch.aspx) 方法,因为它搜索数组中的特定对象(也许这就是您上面引用的内容) ?),但是您可以毫无困难地实现自己的二分搜索。
I'm not sure I understand the question completely because I didn't really get this part:
Nevertheless, I'll try to do my best:
IComparer is used to sort the collection so that you can perform a binary search. Take a look at the MSDN article: http://msdn.microsoft.com/en-us/library/system.collections.icomparer.aspx
So basically, you want to make sure that you first sort Collection2 using your IComparer which basically just sorts the Bins from lowest to highest range. Judging by the fact that you're doing a break inside the second foreach, it seems you don't have any overlap so that shouldn't be an issue.
You're not going to use the Array.BinarySearch (http://msdn.microsoft.com/en-us/library/system.array.binarysearch.aspx) method because it searches for a specific object in the array (maybe this is what you were referring to with that quote above?), but you can implement your own binary search without much difficulty.
仅当 Bin2 中的元素已排序时,二分查找才有效。因此,将 Bin2 集合更改为排序集合(例如数组)。在
m*logm
时间内对其进行排序,然后使用二分查找在logm
时间内放置新项目。总而言之:m*logm + n*logm
。这可以进一步优化 - 但这只是一个开始。
Binary search will only work if the elements in Bin2 are sorted. So, change the Bin2 collection to a sorted collection (array, for example). Sort it in
m*logm
time, and then use binary search to place the new items inlogm
time. All in all:m*logm + n*logm
.This can be optimized further - but it's a start.
如果(这是一个很大的如果)您的垃圾箱具有可计算的上限和下限索引,那么您的问题就转化为相对简单且高效的问题,即设计一种哈希算法并运行一次要分箱的项目集合。如果你的垃圾箱没有可计算的索引,为什么不改变你的问题,让它们有呢?
进一步OP的评论:
与其说你的垃圾箱是否有固定的边界,不如说是否有规则来计算给定垃圾箱编号的边界。因此,如果您的垃圾箱的边界为 1..5、6..10、11..15 等,则规则是
bin_bounds = (bin_number-1)*5+1..(bin_number*5)< /code>
散列函数只是该函数的逆函数,即给定一个整数,计算 bin 的索引号。
但是如果你的 bin 的边界本质上是任意的,那么基本上不可能找到这样的哈希函数。根据我的经验,任意大小的垃圾箱相对不常见。当然,我不知道你的问题的任何细节,所以所有这些可能对你没有任何帮助。
IF (that's a big if) your bins have computable upper and lower indices then your problem translates into the relatively easy, and efficient, one of devising a hashing algorithm and running through your collection-of-items-to-be-binned once. And if your bins don't have computable indices, why not transform your problem so that they do have ?
FURTHER to OP's comment:
It's not so much whether your bins have fixed bounds as whether there is a rule to compute the bounds given the bin number. So, if your bins had bounds 1..5, 6..10, 11..15, etc, the rule is that
bin_bounds = (bin_number-1)*5+1..(bin_number*5)
The hashing function is simply the inverse of this function, ie given an integer, compute the index number of the bin.
BUT if the bounds on your bins are essentially arbitrary it will be essentially impossible to find such a hash function. In my experience it's relatively unusual for bins to be of arbitrary sizes. Of course, I don't know your problem in any detail so all of this may not help you a jot.