我有一个性能不佳的方法，如何提高其效率？

发布于 2024-09-30 11:09:39 字数 612 浏览 5 评论 0原文

我有一个简单的方法来将 FileInfo 对象数组与文件名列表进行比较，以检查哪些文件已被处理。然后返回未处理的列表。

此方法的循环会迭代大约 250,000 个 FileInfo 对象。这需要花费大量的时间来进行竞争。

效率低下的原因显然是对processedFiles 集合的Contains 方法调用。

首先，我如何检查以确保我对原因的怀疑是正确的；其次，我如何改进方法以加快进程？

public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
    if (!processedFiles.Contains(fileInfo.Name))
    {
        unprocessedFiles.Add(fileInfo);
    }
    }
    return unprocessedFiles;
}

原文

I have a simple method to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.

The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.

The inefficiency is obviously the Contains method call on the processedFiles collection.

First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?

public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
    if (!processedFiles.Contains(fileInfo.Name))
    {
        unprocessedFiles.Add(fileInfo);
    }
    }
    return unprocessedFiles;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嘿哥们儿 2024-10-07 11:09:39

List 的 Contains 方法以线性时间运行，因为它可能必须枚举整个列表才能证明某个项目存在/不存在。我建议您使用 HashSet 或类似的东西。 HashSet的 < code>Contains方法被设计为在恒定的O(1)时间内运行，即它不应该依赖于集合中的项目数量。

这个小小的改变应该使整个方法以线性时间运行：

public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, 
                                         List<string> processedFiles)
{
   List<FileInfo> unprocessedFiles = new List<FileInfo>();
   HashSet<string> processedFileSet = new HashSet<string>(processedFiles);

   foreach (FileInfo fileInfo in allFiles)
   {
       if (!processedFileSet.Contains(fileInfo.Name))
       {
           unprocessedFiles.Add(fileInfo);
       }
    }

   return unprocessedFiles;
}

如果可能的话，我建议进行 3 项改进：

为了额外效率，将处理后的文件存储在源处的集合中，因此该方法采用 ISet 作为参数。这样，您就不必每次都重建集合。
尽量不要以这种方式混合和匹配相同实体的不同表示（字符串和FileInfo）。选择一个并使用它。
您可能还想考虑使用 HashSet.ExceptWith 方法，而不是自己进行循环。请记住，这会改变集合。

如果您可以使用 LINQ，并且您有能力在每次调用时建立一个集合，那么还有另一种方法：

public static IEnumerable<string> GetUnprocessedFiles
 (IEnumerable<string> allFiles, IEnumerable<string> processedFiles)
{
  // null-checks here
  return allFiles.Except(processedFiles);     
}

A List<T>'s Contains method runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string> or similar instead. A HashSet<T>'s Containsmethod is designed to run in constant O(1) time, i.e it shouldn't depend on the number of items in the set.

This small change should make the entire method run in linear time:

public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, 
                                         List<string> processedFiles)
{
   List<FileInfo> unprocessedFiles = new List<FileInfo>();
   HashSet<string> processedFileSet = new HashSet<string>(processedFiles);

   foreach (FileInfo fileInfo in allFiles)
   {
       if (!processedFileSet.Contains(fileInfo.Name))
       {
           unprocessedFiles.Add(fileInfo);
       }
    }

   return unprocessedFiles;
}

I would suggest 3 improvements, if possible:

For extra efficiency, store the processed files in a set at the source, so that this method takes an ISet<T> as a parameter. This way, you won't have to reconstruct the set every time.
Try not to mix and match different representations of the same entity (string and FileInfo) in this fashion. Pick one and go with it.
You might also want to consider the HashSet<T>.ExceptWith method instead of doing the looping yourself. Bear in mind that this will mutate the collection.

If you can use LINQ, and you can afford to build up a set on every call, here's another way:

public static IEnumerable<string> GetUnprocessedFiles
 (IEnumerable<string> allFiles, IEnumerable<string> processedFiles)
{
  // null-checks here
  return allFiles.Except(processedFiles);     
}

回复收藏 0 原文

笛声青案梦长安 2024-10-07 11:09:39

我会尝试将processedFiles列表转换为HashSet。对于列表，每次调用 contains 时都需要迭代列表。 HashSet 是一个 O(1) 操作。

回复收藏 0 原文

吹梦到西洲 2024-10-07 11:09:39

您可以使用类似字典/hastable 的类来显着加快查找过程。即使将传入的列表转换为哈希表一次，然后使用该列表也会比您正在使用的快得多。

回复收藏 0 原文

夜唯美灬不弃 2024-10-07 11:09:39

按文件名对搜索到的数组进行排序，
使用 Array.BinarySearch() 来搜索数组。这应该以大约 O(logN) 的效率出现。

回复收藏 0 原文

妄断弥空 2024-10-07 11:09:39

使用排序列表检查列表是否包含元素更快

回复收藏 0 原文

远山浅 2024-10-07 11:09:39

只是过于迂腐......

如果您知道两个列表都已排序（FileInfo 列表通常已预先排序，因此这种方法可能适用于您），那么您可以实现真正的线性性能，而无需花费时间和内存开销哈希集。哈希集构建仍然需要线性时间来构建，因此复杂度更接近 O(n + m)；在您的情况下，哈希集必须在内部为最多 250k 个字符串分配额外的对象引用，这将在 GC 方面产生成本。

像这种半生不熟的概括可能会有所帮助：

public static IEnumerable<string> GetMismatches(IList<string> fileNames, IList<string> processedFileNames, StringComparer comparer)
    {
        var filesIndex = 0;
        var procFilesIndex = 0;

        while (filesIndex < fileNames.Count)
        {
            if (procFilesIndex >= processedFileNames.Count)
            {
                yield return files[filesIndex++];
            }
            else
            {
                var rc = comparer.Compare(fileNames[filesIndex], processedFileNames[procFilesIndex]);
                if (rc != 0)
                {
                    if (rc < 0)
                    {
                        yield return files[filesIndex++];
                    }
                    else
                    {
                        procFilesIndex++;
                    }
                }
                else
                {
                    filesIndex++;
                    procFilesIndex++;
                }
            }
        }

        yield break;
    }

我强烈同意 Ani 的观点，即坚持通用或规范类型确实是一件非常好的事情。
但我会给我的-1表示未完成的概括，-1表示优雅......

Just to be excessively pedantic ...

If you know that both lists are sorted (FileInfo lists often come pre-sorted, so this approach might be applicable to you), then you can achieve true linear performance without the time and memory overhead of a hashset. Hashset construction still requires linear time to build, so complexity is closer to O(n + m); the hashset has to internally allocate additional object references for at most 250k strings in your case and that's going to cost in GC terms.

Something like this half-baked generalisation might help:

public static IEnumerable<string> GetMismatches(IList<string> fileNames, IList<string> processedFileNames, StringComparer comparer)
    {
        var filesIndex = 0;
        var procFilesIndex = 0;

        while (filesIndex < fileNames.Count)
        {
            if (procFilesIndex >= processedFileNames.Count)
            {
                yield return files[filesIndex++];
            }
            else
            {
                var rc = comparer.Compare(fileNames[filesIndex], processedFileNames[procFilesIndex]);
                if (rc != 0)
                {
                    if (rc < 0)
                    {
                        yield return files[filesIndex++];
                    }
                    else
                    {
                        procFilesIndex++;
                    }
                }
                else
                {
                    filesIndex++;
                    procFilesIndex++;
                }
            }
        }

        yield break;
    }

I would strongly agree with Ani that sticking to a generic or canonical type is A Very Good Thing Indeed.
But I'll give mine -1 for unfinished generalisation and -1 for elegance...

回复收藏 0 原文

~没有更多了~