高性能交集和不同元素提取？

发布于 2024-08-10 22:53:21 字数 2256 浏览 0 评论 0原文

我的代码中有如下一行：

potentialCollisionsX.Intersect(potentialCollisionsY).Distinct().ToList();

通过分析，我确定它占用了我大约 56% 的时间。我需要弄清楚如何提供有效的实施。我尝试过

        List<Extent> probableCollisions = new List<Extent>();
        for (int j = 0; j < potentialCollisionsX.Count; j++)
        {
            if (potentialCollisionsY.Contains(potentialCollisionsX[j]) && !probableCollisions.Contains(potentialCollisionsX[j]))
            {
                probableCollisions.Add(potentialCollisionsX[j]);
            }
        }

，但这只将其降至 42%。优化或替代想法将不胜感激。

编辑：有人请求有关 Extent 类的信息，我想不出比提供类定义更好的方法来向他们提供信息。

    private enum ExtentType { Start, End }
    private sealed class Extent
    {
        private ExtentType _type;
        public ExtentType Type
        {
            get
            {
                return _type;
            }
            set
            {
                _type = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }
        private Nucleus _nucleus; //Nucleus is the main body class in my engine
        public Nucleus Nucleus
        {
            get
            {
                return _nucleus;
            }
            set
            {
                _nucleus = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }

        private int _hashcode;

        public Extent(Nucleus nucleus, ExtentType type)
        {
            Nucleus = nucleus;
            Type = type;
            _hashcode = 23;
            _hashcode *= 17 + Nucleus.GetHashCode();
        }

        public override bool Equals(object obj)
        {
            return Equals(obj as Extent);
        }
        public bool Equals(Extent extent)
        {
            if (this.Nucleus == extent.Nucleus) //nucleus.Equals does an int comparison
            {
                return true;
            }
            return false;
        }
        public override int GetHashCode()
        {
            return _hashcode;
        }
    }

Edit2：使用哈希集似乎可以使我的代码的这一部分达到我需要的性能，所以感谢你的帮助！

原文

I have a line like the following in my code:

potentialCollisionsX.Intersect(potentialCollisionsY).Distinct().ToList();

Which, through profiling, i have determined that it is eating approximately 56 percent of my time. I need to figure out how to provide a efficient implementation. I tried

        List<Extent> probableCollisions = new List<Extent>();
        for (int j = 0; j < potentialCollisionsX.Count; j++)
        {
            if (potentialCollisionsY.Contains(potentialCollisionsX[j]) && !probableCollisions.Contains(potentialCollisionsX[j]))
            {
                probableCollisions.Add(potentialCollisionsX[j]);
            }
        }

but that only drops it to 42 percent. Optimizations or alternative ideas would be much appreciated.

Edit: Someone requested information about the Extent class, and i can't think of a better way to give them information than providing the class definition.

    private enum ExtentType { Start, End }
    private sealed class Extent
    {
        private ExtentType _type;
        public ExtentType Type
        {
            get
            {
                return _type;
            }
            set
            {
                _type = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }
        private Nucleus _nucleus; //Nucleus is the main body class in my engine
        public Nucleus Nucleus
        {
            get
            {
                return _nucleus;
            }
            set
            {
                _nucleus = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }

        private int _hashcode;

        public Extent(Nucleus nucleus, ExtentType type)
        {
            Nucleus = nucleus;
            Type = type;
            _hashcode = 23;
            _hashcode *= 17 + Nucleus.GetHashCode();
        }

        public override bool Equals(object obj)
        {
            return Equals(obj as Extent);
        }
        public bool Equals(Extent extent)
        {
            if (this.Nucleus == extent.Nucleus) //nucleus.Equals does an int comparison
            {
                return true;
            }
            return false;
        }
        public override int GetHashCode()
        {
            return _hashcode;
        }
    }

Edit2: It would seem that using hashsets makes this part of my code as performant as i need, so thanks for your guy's help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

゛清羽墨安 2024-08-17 22:53:22

如果您无法提出更好的解决方案，请考虑使用非托管代码作为最后的手段。

回复收藏 0 原文

眼趣 2024-08-17 22:53:22

两种方法：

如果项目尚不存在，则将其放入哈希图中，否则将它们在哈希图中标记为重复。这是 O(n)。然后，您迭代哈希图中的所有项目，并再次查看它们是否被标记为重复 - O(n)。

另一种方法：

对两个列表进行排序。这是一个 O(n lg n) 操作，但至关重要的是，您可能可以愉快地维护两个列表始终排序，因此在专门寻找交集等时不会花费成本。

然后遍历两个列表按顺序查找不同的和重复的等条目。这是 O(n)。

回复收藏 0 原文

圈圈圆圆圈圈 2024-08-17 22:53:21

Intersect 无论如何都会返回不同的元素，从而无需调用 Distinct()。这至少会占用你一些时间。

另外，你真的需要调用ToList吗？那么你要对结果做什么呢？

顺序重要吗？如果没有，您应该考虑在“手动”代码中使用 HashSet 而不是 List。（并且可能还为 pottialCollisionsY 创建一个 HashSet。）这将使 Contains 调用更快，至少在集合是足够大...

顺便说一下，不要相信 的文档>相交 - 它是操作顺序错误（至少在 .NET 3.5 中）

回复收藏 0 原文

傻比既视感 2024-08-17 22:53:21

好的，我看到了 Extent 类的定义。首先，它违反了如果 obj1.Equals(obj2)==true then obj1.GetHashCode()==obj2.GetHashCode() 的规则。但这不是重点，并且可以修复（如果不修复，依赖于哈希的算法（例如 HashSet）将会失败）。

现在，如果对 Extent 对象可以执行的唯一操作是比较相等性，那么将不可能获得高于 O(N*M) 的最坏情况性能（其中 N 是第一个集合的大小，M 是第二个集合的大小）。那是因为您最终必须将每个元素与每个元素进行比较。

使用 GetHashCode() 可以使这一点变得更好，而且具有不同哈希码的对象本身也会不同。其他人建议使用 HashSet 类，这就是这样的解决方案。在这种情况下，最好的情况性能是O(N+M)，最坏的情况是O(N+N*M)。平均而言，您应该会获胜，除非 GetHashCode() 方法的实现非常糟糕并且为许多对象返回相同的哈希码。

我自己更喜欢更稳定的解决方案。如果范围类可以可靠地排序（也就是说，如果您可以比较两个范围对象以查看哪个更大，哪个更小），那么您可以对两个列表进行排序，并且性能可以降低到O （排序+M+N）。这个想法是，当列表排序时，您可以同时浏览它们并在那里查找相等的元素。

现在排序性能是这里最棘手的事情。如果您只实现比较操作（如IComparable接口），您将能够及时对两个列表进行排序O(N*logN+M*logM) 。标准 List.Sort() 方法应该可以为您完成此操作。总而言之，总性能将为O(N*logN+M*logM+N+M)。但您应该注意，这使用了快速排序算法，该算法在接近排序的列表上表现不佳。最坏的情况是完全排序的列表，在这种情况下它是O(N*M)。如果您的列表已经接近排序，您应该考虑另一种排序算法（并自己实现）。

最终的可靠速度是，如果您可以将每个范围转换为整数（或更一般地说，某个字符串），并且具有以下属性：如果字符串相等，则范围也相等，如果字符串不相等，则范围也不相等。字符串的问题在于，它们可以使用诸如基数排序、< a href="http://en.wikipedia.org/wiki/Radix_tree" rel="nofollow noreferrer">基数树等，那么排序只需要O(N+ M）。事实上，如果您构建了一个基数树，您只需对第一个列表进行排序，然后就可以直接在其中搜索字符串（每次搜索都需要 O(1) 时间）。总而言之，总性能将是O(N+M)，这是最好的。

但你应该永远记住一件事——大算法有大常数。基数方法在纸面上看起来可能是最好的，但实施起来相当棘手，并且对于少量数据来说通常比更简单的方法慢。仅当您的列表包含数千和数万个元素时，您才应该开始考虑这一点。此外，这些算法需要创建大量新对象，并且每个 new() 操作的成本也变得很大。您应该仔细考虑以尽量减少所需的分配数量。

OK, I see the definition of the Extent class. First of all, it violates the rule that if obj1.Equals(obj2)==true then obj1.GetHashCode()==obj2.GetHashCode(). But that's besides the point and can be fixed (if you won't, the algorithms which depend on hashing, like a HashSet will fail).

Now, if the only operation that one can do on the Extent object is to compare for equality, then it will not be possible to get the worst case performance above O(N*M) (where N is the size of first collection, and M is the size of the second collection). That's because you will ultimately have to compare every element with every element.

This can be made better by the use of GetHashCode() and the fact that objects with different hash codes will also be different themselves. Other people have suggested to use the HashSet class, that would be such a solution. The best case performance in this case would be O(N+M), and the worst case - O(N+N*M). On average though you should win, unless the GetHashCode() method is very poorly implemented and returns the same hash codes for many objects.

I myself prefer a more stable solution. If the extent class could be sorted reliably (that is, if you could compare two Extent objects to see which one was bigger and which one was smaller), then you could sort both lists and the performance could be brought down to O(sorting+M+N). The idea is that when the lists are sorted, you can go through them both simultaneously and look for equal elements there.

Now the sorting performance is the tricky thing here. If you only implement the comparison operation (as in, the IComparable interface), you will be able to sort both lists in time O(N*logN+M*logM). The standard List.Sort() method should do that for you. All in all, the total performance would be O(N*logN+M*logM+N+M). You should note however that this uses the QuickSort algorithm which performs poorly on nearly-sorted lists. The worst case is a completely sorted list in which case it is O(N*M). If your lists are close to being sorted already, you should consider another sorting algorithm (and implement it yourself).

The ultimate in reliable speed would be if you could convert each Extent to an integer (or more generally, some string) with the property that if the strings are equal, the Extents are equal as well, and if the strings are not equal, then the Extents are not equal either. The thing with strings is that they can be sorted in linear time with algorithms like radix sort, radix tree, etc. Then the sorting would take only the time of O(N+M). In fact, if you constructed a Radix tree, you would only have to sort the first list and you could search for strings in it directly (with every search taking O(1) time). All in all, the total performance would be O(N+M) which is the best available.

One thing you should always keep in mind though - big algorithms have big constants. The radix approach might look the best on paper, but it will be quite tricky to implement and generally slower than the simpler approaches for small amounts of data. Only if your lists have elements in the ranges of thousands and tens of thousands should you start to think about this. Also, these algorithms require to create a lot of new objects and the cost of each new() operation becomes significant as well. You should think carefully to minimize the number of allocations required.

回复收藏 0 原文

書生途 2024-08-17 22:53:21

试试这个：

HashSet<Extent> result = new HashSet<Extent>();
HashSet<Extent> potentialSetY = new HashSet<Extent>(potentialCollisionsY);
foreach (Extent ex in potentialCollisionsX)
    if (potentialSetY.Contains(ex))
        result.Add(ex);

哈希集擅长做包含< /code> 快速，但不保留顺序

如果您需要保留顺序，这里有一些更复杂的事情：有序哈希集。它使用正常的哈希集语义（好吧，字典，但它是同一件事），但在枚举之前它根据插入顺序对项目重新排序。

// Unchecked code

public class OrderedHashSet<T> : IEnumerable<T> {
    int currentIndex = 0;
    Dictionary<T, index> items = new Dictionary<T, index>();

    public bool Add(T item) {
        if (Contains(item))
            return false;
        items[item] = currentIndex++;
        return true;
    }

    public bool Contains(T item) {
        return items.ContainsKey(item);
    }

    public IEnumerator<T> GetEnumerator() {
        return items.Keys.OrderBy(key => items[key]).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator() {
        return GetEnumerator();
    }
}

现在，只需将上面示例中的 HashSet 更改为 OrderedHashSet，它就应该可以工作。

Try this:

HashSet<Extent> result = new HashSet<Extent>();
HashSet<Extent> potentialSetY = new HashSet<Extent>(potentialCollisionsY);
foreach (Extent ex in potentialCollisionsX)
    if (potentialSetY.Contains(ex))
        result.Add(ex);

Hash sets are good at doing Contains quickly, but don't preserve order

If you need to preserve order, here's something a little more complicated: An ordered hash set. It uses normal hash set semantics (well, a dictionary, but it's the same thing), but before enumeration it reorders the items according to the insertion order.

// Unchecked code

public class OrderedHashSet<T> : IEnumerable<T> {
    int currentIndex = 0;
    Dictionary<T, index> items = new Dictionary<T, index>();

    public bool Add(T item) {
        if (Contains(item))
            return false;
        items[item] = currentIndex++;
        return true;
    }

    public bool Contains(T item) {
        return items.ContainsKey(item);
    }

    public IEnumerator<T> GetEnumerator() {
        return items.Keys.OrderBy(key => items[key]).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator() {
        return GetEnumerator();
    }
}

Now simply change HashSet to OrderedHashSet in the above sample and it should work.

回复收藏 0 原文

~没有更多了~