确定 IEnumerable<> 是否有效的最佳方法具有独特的价值

发布于 2024-10-25 08:55:30 字数 181 浏览 1 评论 0 原文

我有很多代码在其中执行类似的操作

bool GetIsUnique(IEnumerable<T> values)
{
    return values.Count() == values.Distinct().Count;
}

是否有更好更快更好的方法来执行此操作?

I have a lot of code in which I do something like this

bool GetIsUnique(IEnumerable<T> values)
{
    return values.Count() == values.Distinct().Count;
}

Is there a better faster nicer way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

如歌彻婉言 2024-11-01 08:55:30

我会将其作为一个很好的扩展方法

public static bool IsUnique<T>(this IEnumerable<T> list)
{
    var hs = new HashSet<T>();
    return list.All(hs.Add);  
}

检查所有项目是否都可以添加到 HashSet 中。

I would make this a nice extension method

public static bool IsUnique<T>(this IEnumerable<T> list)
{
    var hs = new HashSet<T>();
    return list.All(hs.Add);  
}

Checks that all items can be added to a HashSet.

终难遇 2024-11-01 08:55:30

您的方法需要对序列进行两次迭代,但存在一些潜在的缺点:

  1. 对于任何大小的序列,迭代两次都会比迭代一次慢。
  2. 如果您尝试多次迭代某些序列,它们将引发异常;其他人可能会为后续迭代返回不同的结果。
  3. 您的方法使用 Count ,每次都需要迭代整个序列。一旦您知道存在重复值,就没有理由不尽早突破。

下面的方法只需要遍历序列一次,并且一旦遇到任何重复值就会提前中断:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    var set = new HashSet<T>();

    foreach (T item in values)
    {
        if (!set.Add(item))
            return false;
    }
    return true;
}

Your method needs to iterate through the sequence twice, with a few of potential drawbacks:

  1. Iterating twice will be slower than iterating once for sequences of any significant size.
  2. Some sequences will throw an exception if you try to iterate them more than once; others might return different results for subsequent iterations.
  3. Your method uses Count which needs to iterate the entire sequence each time. There's no reason why you shouldn't break-out early as soon as you know that there's a duplicate value.

The following method only needs to iterate through the sequence once, and will break-out early as soon as any duplicate value is encountered:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    var set = new HashSet<T>();

    foreach (T item in values)
    {
        if (!set.Add(item))
            return false;
    }
    return true;
}
带刺的爱情 2024-11-01 08:55:30

我认为这取决于您想要做什么(如果存在非唯一值)。 @Jamiec 或 @LukeH 的答案是很好的答案,并且可能对于纯粹的速度来说是最好的,但它不能告诉你问题出在哪里。

您可能还认为像

var group = values.GroupBy(x => x);
return group.Any(g => g.Count() > 1);

On it's 本身这样的东西比 HashSet 实现更糟糕。但如果您保留该组,您可以找到哪些元素是重复的。

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1);

或者

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1).Select(g => g.Key);

使用 GroupBy 进行思考,让您可以对下一步做什么保持开放的选择。但是,如果您关心的只是知道所有值是否都是唯一的,我会选择 HashSet

I think it depends on what you want to do if there are non unique values. @Jamiec's Or @LukeH's answer are great answers and probably best for pure speed, but it can't tell you where issues are.

You might also consider something like

var group = values.GroupBy(x => x);
return group.Any(g => g.Count() > 1);

On it's own its worse than the HashSet implementation. But if you keep that group around you can find which elements are duplicated.

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1);

Or

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1).Select(g => g.Key);

Thinking about it with GroupBy lets you keep your options open for what to do next. But if all you care about is knowing if all values are unique I'd go with the HashSet

烟凡古楼 2024-11-01 08:55:30

您将对上述数据进行两次循环 - 一次获取计数,一次获取不同计数。如果前两项相同,那就更糟糕了!尝试这样的操作:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    HashSet<T> hashSet = new HashSet<T>();
    foreach(var value in values)
    {
        if (hashSet.Contains(value))
        {
            return false;
        }
        hashSet.Add(value);
    }
    return true;
}

一旦找到重复项,该操作就会完成。显然,它取决于哈希查找的速度,但考虑到 Distinct 在内部使用了一个集合,我仍然希望它更快。

You would be doing two loops through the data for the above - once to get the count, once to get the distinct count. Especially bad if the first two items are identical! Try something like this:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    HashSet<T> hashSet = new HashSet<T>();
    foreach(var value in values)
    {
        if (hashSet.Contains(value))
        {
            return false;
        }
        hashSet.Add(value);
    }
    return true;
}

This one will finish as soon as it finds a duplicate. Obviously it on the speed of the hash lookup but given Distinct uses a set internally I'd still expect it to be quicker.

黑白记忆 2024-11-01 08:55:30

两个基本规则:

  1. 最简单的阅读和理解方式几乎总是最好的编码方式。该代码非常容易阅读,所以我想说你可以开始了。
  2. 只有当您能够证明这是减慢程序速度的方法,或者您正在构建一个其他人无需访问即可访问的库时,性能(“更快”)才应该成为一个问题访问源代码。

其他方法会更快(当找到重复值时它们会短路,返回 false),但是,如果这是我的代码,我仍然会坚持使用你的版本。

Two basic rules:

  1. The easiest way to read and understand is almost always the best way to code something. That code is deliciously easy to read, so I'd say you're good to go.
  2. Performance ("faster") should only be a concern if you can prove that this is the method that's slowing your program down, or if you're building a library that other people will have access to without having access to the source code.

The other methods are going to be faster (they'll short circuit when a repeated value is found, returning false), but, I'd still stick with your version if it were my code.

稳稳的幸福 2024-11-01 08:55:30

查找第一个重复项(如果存在)的快速方法是:

public static bool TryFindFirstDuplicate<T>(this IEnumerable<T> source, out T duplicate)
{
    var set = new HashSet<T>();
    foreach (var item in source)
    {
        if (!set.Add(item))
        {
            duplicate = item;
            return true;
        }
    }
    duplicate = default(T);
    return false;
}

A quick way of finding the first duplicate, if present, is:

public static bool TryFindFirstDuplicate<T>(this IEnumerable<T> source, out T duplicate)
{
    var set = new HashSet<T>();
    foreach (var item in source)
    {
        if (!set.Add(item))
        {
            duplicate = item;
            return true;
        }
    }
    duplicate = default(T);
    return false;
}
梦里的微风 2024-11-01 08:55:30

我很惊讶还没有人测试过这个:

将问题中的 Gluip 版本与 JamieC、LukeK 和 MrKWatkins 进行比较,所有三个答案都比提问者的版本更好。

在这三个答案中,它们都相当具有可比性,但在大多数情况下,JamieC 的速度稍快一些。

当数据没有重复项或重复项位于 IEnumerable 末尾时,大小或算法没有重大差异。

当数据在中间或开头附近有重复项时,原始问题中的 Gluip 版本显示与其他三个版本相比其速度较慢。

检查集合的时间似乎与集合的大小成线性比例,这意味着没有算法更适合大型或小型集合。

下面是一个测试程序,它会生成一个 CSV 文件,您可以将其加载到电子表格程序中,并根据需要进行排序和绘制图表:

测试程序:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AreUniqueTest
{
class Program
{
    const int Iterations = 1000;

    enum DupeLocation
    {
        None,
        Early,
        Center,
        Late,
    }

    enum SetSize
    {
        Tiny= 10,
        Small = 100,
        Medium = 1000,
        Large = 10000,
        Huge = 100000,
    }

    static void Main()
    {
        Dictionary<string, Func<IEnumerable<int>, bool>> functions = new Dictionary<string, Func<IEnumerable<int>, bool>>
        {
            {"Gluip", GetIsUniqueGluip},
            {"LukeH", GetIsUniqueLukeH },
            {"Jamiec", GetIsUniqueJamiec },
            {"MrKWatkins", GetIsUniqueMrKWatkins }
        };

        var output = new StringBuilder();

        Console.WriteLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");
        output.AppendLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");

        foreach (SetSize size in Enum.GetValues(typeof(SetSize)))
        {
            var sizevalue = (int) size;
            foreach (DupeLocation location in Enum.GetValues(typeof(DupeLocation)))
            {
                var data = CreateTestData((int)size, location);
                foreach (string functionKey in functions.Keys)
                {
                    var ticks = RunSet(functions[functionKey], data, Iterations);
                    var avg = ticks / Iterations;
                    Console.WriteLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                    output.AppendLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                }
            }
        }

        File.WriteAllText("output.csv", output.ToString());
        Process.Start("output.csv");
    }

    static long RunSet<T>(Func<IEnumerable<T>, bool> getIsUnique, IEnumerable<T> values, int iterations)
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            getIsUnique.Invoke(values);
        }
        stopwatch.Stop();
        return stopwatch.ElapsedTicks;
    }

    static bool GetIsUniqueLukeH<T>(IEnumerable<T> values)
    {
        var set = new HashSet<T>();

        foreach (T item in values)
        {
            if (!set.Add(item))
                return false;
        }
        return true;
    }

    static bool GetIsUniqueGluip<T>(IEnumerable<T> values)
    {
        return values.Count() == values.Distinct().Count();
    }

    static bool GetIsUniqueJamiec<T>(IEnumerable<T> list)
    {
        var hs = new HashSet<T>();
        return list.All(hs.Add);
    }

    static bool GetIsUniqueMrKWatkins<T>(IEnumerable<T> values)
    {
        HashSet<T> hashSet = new HashSet<T>();
        foreach (var value in values)
        {
            if (hashSet.Contains(value))
            {
                return false;
            }
            hashSet.Add(value);
        }
        return true;
    }

    static int[] CreateTestData(int size, DupeLocation location)
    {
        var result = new int[size];
        Parallel.For(0, size, i => { result[i] = i; });
        return SetDupe(result, location);
    }

    static int[] SetDupe(int[] values, DupeLocation location)
    {
        switch (location)
        {
            case DupeLocation.Early:
                values[1] = values[0];
                break;
            case DupeLocation.Center:
                var midpoint = values.Length / 2;
                values[midpoint] = values[midpoint + 1];
                break;
            case DupeLocation.Late:
                values[values.Length - 1] = values[values.Length - 2];
                break;
            // case DupeLocation.None: // do nothing.
        }
        return values;
    }
}
}

I'm suprised no one has tested this just yet:

Comparing Gluip's version in the question with JamieC, LukeK, and MrKWatkins, All three answers are better than the asker's version.

Of the three answers, they are all pretty comparable but in most case JamieC's is slightly faster.

When the data has no duplicates or a duplicate is at the end of the IEnumerable, the size or algorithm had no major differences.

When the data has a duplicate near the middle, or at the beginning, Gluip's version in the original question shows its slowness compared to the other three.

The time to check a set appears to scale linearly with the size of set, meaning no algorithm is preferable for large or small sets.

Here's a test program that spits out a CSV that you can load into a spreadsheet program and sort and graph if you like:

Test Program:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AreUniqueTest
{
class Program
{
    const int Iterations = 1000;

    enum DupeLocation
    {
        None,
        Early,
        Center,
        Late,
    }

    enum SetSize
    {
        Tiny= 10,
        Small = 100,
        Medium = 1000,
        Large = 10000,
        Huge = 100000,
    }

    static void Main()
    {
        Dictionary<string, Func<IEnumerable<int>, bool>> functions = new Dictionary<string, Func<IEnumerable<int>, bool>>
        {
            {"Gluip", GetIsUniqueGluip},
            {"LukeH", GetIsUniqueLukeH },
            {"Jamiec", GetIsUniqueJamiec },
            {"MrKWatkins", GetIsUniqueMrKWatkins }
        };

        var output = new StringBuilder();

        Console.WriteLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");
        output.AppendLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");

        foreach (SetSize size in Enum.GetValues(typeof(SetSize)))
        {
            var sizevalue = (int) size;
            foreach (DupeLocation location in Enum.GetValues(typeof(DupeLocation)))
            {
                var data = CreateTestData((int)size, location);
                foreach (string functionKey in functions.Keys)
                {
                    var ticks = RunSet(functions[functionKey], data, Iterations);
                    var avg = ticks / Iterations;
                    Console.WriteLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                    output.AppendLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                }
            }
        }

        File.WriteAllText("output.csv", output.ToString());
        Process.Start("output.csv");
    }

    static long RunSet<T>(Func<IEnumerable<T>, bool> getIsUnique, IEnumerable<T> values, int iterations)
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            getIsUnique.Invoke(values);
        }
        stopwatch.Stop();
        return stopwatch.ElapsedTicks;
    }

    static bool GetIsUniqueLukeH<T>(IEnumerable<T> values)
    {
        var set = new HashSet<T>();

        foreach (T item in values)
        {
            if (!set.Add(item))
                return false;
        }
        return true;
    }

    static bool GetIsUniqueGluip<T>(IEnumerable<T> values)
    {
        return values.Count() == values.Distinct().Count();
    }

    static bool GetIsUniqueJamiec<T>(IEnumerable<T> list)
    {
        var hs = new HashSet<T>();
        return list.All(hs.Add);
    }

    static bool GetIsUniqueMrKWatkins<T>(IEnumerable<T> values)
    {
        HashSet<T> hashSet = new HashSet<T>();
        foreach (var value in values)
        {
            if (hashSet.Contains(value))
            {
                return false;
            }
            hashSet.Add(value);
        }
        return true;
    }

    static int[] CreateTestData(int size, DupeLocation location)
    {
        var result = new int[size];
        Parallel.For(0, size, i => { result[i] = i; });
        return SetDupe(result, location);
    }

    static int[] SetDupe(int[] values, DupeLocation location)
    {
        switch (location)
        {
            case DupeLocation.Early:
                values[1] = values[0];
                break;
            case DupeLocation.Center:
                var midpoint = values.Length / 2;
                values[midpoint] = values[midpoint + 1];
                break;
            case DupeLocation.Late:
                values[values.Length - 1] = values[values.Length - 2];
                break;
            // case DupeLocation.None: // do nothing.
        }
        return values;
    }
}
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文