我有一个性能不佳的方法,如何提高其效率?
我有一个简单的方法来将 FileInfo 对象数组与文件名列表进行比较,以检查哪些文件已被处理。然后返回未处理的列表。
此方法的循环会迭代大约 250,000 个 FileInfo 对象。这需要花费大量的时间来进行竞争。
效率低下的原因显然是对processedFiles 集合的Contains 方法调用。
首先,我如何检查以确保我对原因的怀疑是正确的;其次,我如何改进方法以加快进程?
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFiles.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
I have a simple method to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.
The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.
The inefficiency is obviously the Contains method call on the processedFiles collection.
First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFiles.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
List
的Contains
方法以线性时间运行,因为它可能必须枚举整个列表才能证明某个项目存在/不存在。我建议您使用HashSet
或类似的东西。HashSet
的 < code>Contains方法被设计为在恒定的O(1)
时间内运行,即它不应该依赖于集合中的项目数量。这个小小的改变应该使整个方法以线性时间运行:
如果可能的话,我建议进行 3 项改进:
ISet
作为参数。这样,您就不必每次都重建集合。字符串
和FileInfo
)。选择一个并使用它。HashSet.ExceptWith
方法,而不是自己进行循环。请记住,这会改变集合。如果您可以使用 LINQ,并且您有能力在每次调用时建立一个集合,那么还有另一种方法:
A
List<T>
'sContains
method runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string>
or similar instead. AHashSet<T>
'sContains
method is designed to run in constantO(1)
time, i.e it shouldn't depend on the number of items in the set.This small change should make the entire method run in linear time:
I would suggest 3 improvements, if possible:
ISet<T>
as a parameter. This way, you won't have to reconstruct the set every time.string
andFileInfo
) in this fashion. Pick one and go with it.HashSet<T>.ExceptWith
method instead of doing the looping yourself. Bear in mind that this will mutate the collection.If you can use LINQ, and you can afford to build up a set on every call, here's another way:
我会尝试将processedFiles列表转换为HashSet。对于列表,每次调用 contains 时都需要迭代列表。 HashSet 是一个 O(1) 操作。
I would try to convert the processedFiles List to a HashSet. With a list, it needs to iterate the list everytime you call contains. A HashSet is an O(1) operation.
您可以使用类似字典/hastable 的类来显着加快查找过程。即使将传入的列表转换为哈希表一次,然后使用该列表也会比您正在使用的快得多。
You could use a dictionary/hastable like class to speed up the lookup process significantly. Even translation the incoming List into an hashtable once, then using that one will be much quicker than what you're using.
Array.BinarySearch<T>()
to search the array. This should come out at about O(logN) efficiency.使用排序列表检查列表是否包含元素更快
to check if a list contains an element is faster with a sorted list
只是过于迂腐......
如果您知道两个列表都已排序(FileInfo 列表通常已预先排序,因此这种方法可能适用于您),那么您可以实现真正的线性性能,而无需花费时间和内存开销哈希集。哈希集构建仍然需要线性时间来构建,因此复杂度更接近 O(n + m);在您的情况下,哈希集必须在内部为最多 250k 个字符串分配额外的对象引用,这将在 GC 方面产生成本。
像这种半生不熟的概括可能会有所帮助:
我强烈同意 Ani 的观点,即坚持通用或规范类型确实是一件非常好的事情。
但我会给我的-1表示未完成的概括,-1表示优雅......
Just to be excessively pedantic ...
If you know that both lists are sorted (FileInfo lists often come pre-sorted, so this approach might be applicable to you), then you can achieve true linear performance without the time and memory overhead of a hashset. Hashset construction still requires linear time to build, so complexity is closer to O(n + m); the hashset has to internally allocate additional object references for at most 250k strings in your case and that's going to cost in GC terms.
Something like this half-baked generalisation might help:
I would strongly agree with Ani that sticking to a generic or canonical type is A Very Good Thing Indeed.
But I'll give mine -1 for unfinished generalisation and -1 for elegance...