当前位置：文江博客话题详情

对于创建不同的数据结构，什么更好：HashSet 还是 Linq 的 Distinct()？

发布于 2024-11-14 21:38:53 字数 558 浏览 3 评论 0原文

我想知道是否可以就哪种方法是创建一组不同元素的更好方法达成共识：C# HashSet 或使用 IEnumerable 的 .Distinct()，哪个是 Linq 函数？

假设我正在使用 DataReader 循环访问数据库的查询结果，我的选择是将我构造的对象添加到 List 或 HashSet code> 使用 List 选项，我最终不得不执行以下操作：

myList = myList.Distinct().ToList();

对于 HashSet，我的理解是，向其中添加元素会自行处理不重复问题，假设您已经重写了 SomeObject 中的 GetHashCode() 和 Equals() 方法。我主要关心选项的风险和性能方面。

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉溺在你眼里的海 2024-11-21 21:38:54

如果您循环遍历 DbReader 的结果，将结果添加到哈希集会比将其添加到列表并对其执行 Distinct 更好。您将节省一次迭代。（Distinct内部使用了HashSet）

回复收藏 0 原文

请你别敷衍 2024-11-21 21:38:54

Distinct的实现可以使用HashSet。看看 Jon Skeet 的 Edulinq 实现。

回复收藏 0 原文

半世晨晓 2024-11-21 21:38:53

安东尼·佩格拉姆说得最好。使用适合工作的正确工具。我这样说是因为 Distinct 或 HashSet 在性能方面并没有那么大的不同。当集合应该始终只包含不同的内容时，请使用 HashSet。它还告诉程序员你不能向其中添加重复项。当您稍后需要添加重复项和删除重复项时，请使用普通的 List 和 .Distinct() 。意图很重要。

一般来说，

a) 如果您从数据库添加新对象并且没有指定您自己的自定义 Equals，则 HashSet 可能没有任何用处。数据库中的每个对象都可以是哈希集的新实例（如果您只是新手），这将导致集合中出现重复项。在这种情况下，请使用普通的 List。

b) 如果您确实为哈希集定义了相等比较器，并且您的集合应始终仅保存不同的对象，请使用哈希集。

c) 如果您确实为哈希集定义了相等比较器，并且您只需要来自数据库的不同对象，但集合并不总是只包含不同的对象（即稍后需要添加重复项），则更快的方法是从数据库获取项目到一个哈希集，然后从该哈希集中返回一个常规列表。

d) 你应该做的最好的事情就是将删除重复项的任务交给数据库，这是正确的工具 这是一流的！

至于性能差异，在我的测试中，我总是发现 HashSet 更快，但这只是微不足道的。 考虑到使用 List 方法，您必须首先添加然后对其执行不同操作，这是显而易见的。

测试方法：从两个通用函数开始，

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

实现：

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 毫秒

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 毫秒

对于包含 10000 个对象的列表，再迭代 10000 次时，2.5 秒的差异已经不错了。对于正常情况，差异几乎不会被注意到。

对于您当前的设计，可能的最佳方法是：

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 毫秒

没有任何显着差异，请参阅..

部分不相关 - 在发布此答案后，我很想知道从正常列表中删除重复项的最佳方法是什么。

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 毫秒

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 毫秒

这里，正确的工具 Distinct 比黑客 HashSet 更快！也许是创建哈希集的开销。

我已经测试了各种其他组合，例如引用类型，原始列表中没有重复项等。结果是一致的。

Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn't that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.

In general,

a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.

b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.

c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.

d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!

As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.

Test method: Starting with two general functions,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

Implementation:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 ms

A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.

The best approach possibly for you with your current design:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 ms

There isn't any significant difference, see..

Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 ms

Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.

I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.

回复收藏 0 原文

_蜘蛛 2024-11-21 21:38:53

更好的是最能表达您的意图。内部实现细节或多或少是相同的，区别在于“谁在编写代码？”

如果您的意图是从头开始创建一个来自源的不同项目集合，而不是所述项目的集合，那么我会支持HashSet< ;T>。您必须创建项目，必须构建集合，您不妨从一开始就构建正确的集合。

否则，如果您已经拥有项目集合并且想要消除重复项，我主张调用 Distinct()。您已经有了一个集合，您只是想要一种富有表现力的方式来从中获取不同的项目。

回复收藏 0 原文

多像笑话 2024-11-21 21:38:53

“更好”这个词用起来很棘手——它对不同的人来说可能有很多不同的含义。

为了便于阅读，我会选择 Distinct() 因为我个人认为这更容易理解。

就性能而言，我怀疑手工制作的 HashSet 实现可能会执行得稍微快一些 - 但我怀疑它会非常不同，因为 Distinct 的内部实现无疑会使用某种形式的哈希。

对于我认为的“最佳”实现...我认为您应该使用 Distinct 但以某种方式将其推送到数据库层 - 即在填充 DataReader 之前更改底层数据库 SELECT。

回复收藏 0 原文

一世旳自豪 2024-11-21 21:38:53

~~对于大型集合，HashSet 可能会更快。它依赖于对象的哈希码来快速确定集合中是否已存在元素。~~

实际上，它（很可能）并不重要（但您应该衡量是否关心）。

一开始我本能地猜测 HashSet 会更快，因为它使用快速哈希检查。但是，我在参考源中查找了 Distinct 的当前（4.0）实现，它在幕后使用了类似的 Set 类（也依赖于哈希）。结论;没有实际性能差异。

对于您的情况，为了可读性，我会选择 .Distinct - 它清楚地传达了代码的意图。但是，我同意其他答案之一，如果可能的话，您可能应该在数据库中执行此操作。

回复收藏 0 原文

~没有更多了~

关于作者

习惯那些不曾习惯的习惯

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

对于创建不同的数据结构，什么更好：HashSet 还是 Linq 的 Distinct()？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

对于创建不同的数据结构，什么更好：HashSet 还是 Linq 的 Distinct()？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。