对于创建不同的数据结构,什么更好:HashSet 还是 Linq 的 Distinct()?

发布于 2024-11-14 21:38:53 字数 558 浏览 3 评论 0原文

我想知道是否可以就哪种方法是创建一组不同元素的更好方法达成共识:C# HashSet 或使用 IEnumerable 的 .Distinct(),哪个是 Linq 函数?

假设我正在使用 DataReader 循环访问数据库的查询结果,我的选择是将我构造的对象添加到 ListHashSet code> 使用 List 选项,我最终不得不执行以下操作:

myList = myList.Distinct().ToList();

对于 HashSet,我的理解是,向其中添加元素会自行处理不重复问题,假设您已经重写了 SomeObject 中的 GetHashCode()Equals() 方法。我主要关心选项的风险和性能方面。

谢谢。

I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a C# HashSet or using IEnumerable's .Distinct(), which is a Linq function?

Let's say I'm looping through query results from the DB with DataReader, and my options are to add the objects I construct to a List<SomeObject> or to a HashSet<SomeObject> With the List option, I would wind up having to do something like:

myList = myList.Distinct().ToList<SomeObject>();

With the HashSet, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the GetHashCode() and Equals() methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

沉溺在你眼里的海 2024-11-21 21:38:54

如果您循环遍历 DbReader 的结果,将结果添加到哈希集会比将其添加到列表并对其执行 Distinct 更好。您将节省一次迭代。 (Distinct内部使用了HashSet)

If yor looping through the results of a DbReader adding your resutls to a Hashset would be better than adding it to a List and than doing a Distinct on that. You would save one itteration. (Distinct internally uses a HashSet)

请你别敷衍 2024-11-21 21:38:54

Distinct的实现可以使用HashSet。看看 Jon Skeet 的 Edulinq 实现

The implementation of Distinct may use HashSet. Take a look at Jon Skeet's Edulinq implementation.

半世晨晓 2024-11-21 21:38:53

安东尼·佩格拉姆说得最好。使用适合工作的正确工具。我这样说是因为 DistinctHashSet 在性能方面并没有那么大的不同。当集合应该始终只包含不同的内容时,请使用 HashSet。它还告诉程序员你不能向其中添加重复项。当您稍后需要添加重复项和删除重复项时,请使用普通的 List.Distinct() 。意图很重要。

一般来说,

a) 如果您从数据库添加新对象并且没有指定您自己的自定义 Equals,则 HashSet 可能没有任何用处。数据库中的每个对象都可以是哈希集的新实例(如果您只是新手),这将导致集合中出现重复项。在这种情况下,请使用普通的 List

b) 如果您确实为哈希集定义了相等比较器,并且您的集合应始终仅保存不同的对象,请使用哈希集。

c) 如果您确实为哈希集定义了相等比较器,并且您只需要来自数据库的不同对象,但集合并不总是只包含不同的对象(即稍后需要添加重复项),则更快的方法是从数据库获取项目到一个哈希集,然后从该哈希集中返回一个常规列表。

d) 你应该做的最好的事情就是将删除重复项的任务交给数据库,这是正确的工具 这是一流的!

至于性能差异,在我的测试中,我总是发现 HashSet 更快,但这只是微不足道的。 考虑到使用 List 方法,您必须首先添加然后对其执行不同操作,这是显而易见的。

测试方法:从两个通用函数开始,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

实现:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 毫秒

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 毫秒

对于包含 10000 个对象的列表,再迭代 10000 次时,2.5 秒的差异已经不错了。对于正常情况,差异几乎不会被注意到。

对于您当前的设计,可能的最佳方法是:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 毫秒

没有任何显着差异,请参阅..


部分不相关 - 在发布此答案后,我很想知道从正常列表中删除重复项的最佳方法是什么。

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 毫秒

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 毫秒

这里,正确的工具 Distinct 比黑客 HashSet 更快!也许是创建哈希集的开销。


我已经测试了各种其他组合,例如引用类型,原始列表中没有重复项等。结果是一致的。

Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn't that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.

In general,

a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.

b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.

c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.

d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!

As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.

Test method: Starting with two general functions,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

Implementation:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 ms

A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.

The best approach possibly for you with your current design:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 ms

There isn't any significant difference, see..


Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 ms

Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.


I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.

_蜘蛛 2024-11-21 21:38:53

更好的是最能表达您的意图。内部实现细节或多或少是相同的,区别在于“谁在编写代码?”

如果您的意图是从头开始创建一个来自源的不同项目集合,而不是所述项目的集合,那么我会支持HashSet< ;T>。您必须创建项目,必须构建集合,您不妨从一开始就构建正确的集合。

否则,如果您已经拥有项目集合并且想要消除重复项,我主张调用 Distinct()。您已经有了一个集合,您只是想要一种富有表现力的方式来从中获取不同的项目。

What's better is what's the most expressive of describing your intention. The internal implementation details are more or less going to be the same, the difference being "who's writing the code?"

If your intention is to create from the ground up a distinct collection of items from a source that is not a collection of said items, I would argue for the HashSet<T>. You have to create the item, you have to build the collection, you might as well build the right one from the beginning.

Otherwise, if you already have a collection of items and you want to eliminate duplicates, I would argue for invoking Distinct(). You already have a collection, you just want an expressive way to get the distinct items out of it.

多像笑话 2024-11-21 21:38:53

“更好”这个词用起来很棘手——它对不同的人来说可能有很多不同的含义。

为了便于阅读,我会选择 Distinct() 因为我个人认为这更容易理解。

就性能而言,我怀疑手工制作的 HashSet 实现可能会执行得稍微快一些 - 但我怀疑它会非常不同,因为 Distinct 的内部实现无疑会使用某种形式的哈希。

对于我认为的“最佳”实现...我认为您应该使用 Distinct 但以某种方式将其推送到数据库层 - 即在填充 DataReader 之前更改底层数据库 SELECT。

"Better" is a tricky word to use - it can mean so many different things to different people.

For readability, I would go for Distinct() as I personally find this more comprehensible.

For performance, I suspect a hand-crafted HashSet implementation might perform mildly quicker - but I doubt it would be very different as the internal implementation of Distinct will no doubt itself use some form of hashing.

For what I think of as "best" implementation... I think you should use Distinct but somehow push this down to the database layer - i.e. change the underlying database SELECT before you fill the DataReader.

一世旳自豪 2024-11-21 21:38:53

对于大型集合,HashSet 可能会更快。它依赖于对象的哈希码来快速确定集合中是否已存在元素。

实际上,它(很可能)并不重要(但您应该衡量是否关心)。

一开始我本能地猜测 HashSet 会更快,因为它使用快速哈希检查。但是,我在参考源中查找了 Distinct 的当前(4.0)实现,它在幕后使用了类似的 Set 类(也依赖于哈希)。结论;没有实际性能差异。

对于您的情况,为了可读性,我会选择 .Distinct - 它清楚地传达了代码的意图。但是,我同意其他答案之一,如果可能的话,您可能应该在数据库中执行此操作。

For large collections HashSet is likely to be faster. It relies on the hashcode of the objects to quickly determine whether or not an element already exists in the set.

In practice, it (most likely) won't matter (but you should measure if you care).

I instinctively guessed at first that HashSet would be faster, because of the fast hash checking it uses. However, I looked up the current (4.0) implementation of Distinct in the reference sources, and it uses a similar Set class (which also relies on hashing) under the covers. Conclusion; there are no practical performance difference.

For your case, I would go with .Distinct for readability - it clearly conveys the intent of the code. However, I agree with one of the other answers, that you probably should perform this operationn in the DB if possible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文