从大目录中有效地随机枚举文件
我希望能够从目录中递归地枚举具有特定搜索模式(例如,*.txt)的文件。但有几个限制:
- 该机制应该非常有效。目标是一一枚举文件(使用 IEnumerable),这样如果有一个巨大的文件列表,那么不应该花很长时间来获取一个文件进行处理。
- 枚举应随机返回文件,因此,如果我的程序的两个实例尝试枚举目录,则两者不应以相同的顺序看到文件。
考虑到这些要求, DirectoryInfo.EnumerateFiles 看起来很有希望,但它确实不满足第二个要求。如果我消除性能考虑,解决方案很简单(只需获取整个集合并在访问之前随机化序列)。
有人可以建议 .net 3.5/4.0 中 C# 实现的可能选择吗?
I want to be able to enumerate files with a specific search pattern (e.g., *.txt) recursively from a directory. But with couple of constraints:
- The mechanism should be very efficient. The goal is to enumerate file one by one (using IEnumerable), so that if there is a huge list of files, then it shouldn't take forever to get one file for processing.
- The enumeration should return files randomly, so that if two instances of my program are trying to enumerate the directory, both should not be seeing the files in the same sequence.
Given the requirements, DirectoryInfo.EnumerateFiles looks promising, except that it does not fulfill the second requirement. If I remove the performance consideration, the solution is straightforward (just get the entire collection and randomize the sequence before accessing).
Can someone suggest possible choices for C# implementation in .net 3.5/4.0 ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你所要求的是不可能的。
真正的“随机”枚举(在顺序可能每次都会改变的意义上)需要“选择而不替换”策略。这种策略必然需要两个池:一个是“选择的”文件,另一个是“未选择的”文件。在随机“选择”列表中的任何内容之前,必须填充“未选择”列表。这打破了你的第一条要求。
关于如何解决问题的两个想法:
两个实例以相同顺序查看文件有什么问题?如果是文件锁定问题,请选择只读锁。
你或许能够通过“持有一堆”的方法逃脱惩罚。在这里,您将创建自己的枚举器类,该类首先将少量 FileInfo 记录读取到“Hold”集合中。然后,每次您的调用代码请求一个文件时,它要么直接从 EnumerateFiles 提供一个文件,要么从那里读取一个文件,但将其与“Hold”堆中的文件交换并返回该文件。该决定将是随机的,直到 EnumerateFiles 不返回任何内容,此时您将清空您的 Hold 堆。这不会提供真正的随机选择顺序,但也许它会为顺序添加足够的模糊性以满足您的需求。 “保留”集合的最大大小可以根据需要进行调整,以平衡您对“随机性”的需求与快速获取第一个文件的需求。
What you are asking for is impossible.
A truly "random" enumeration (in the sense that the order likely changes each time) requires a "pick without replacement" strategy. Such a strategy necessarily requires two pools: one of "chosen" files, and one of "unchosen." The "unchosen" list has to be populated before anything from it can be "chosen" randomly. This breaks your #1 requirement.
Two thoughts on how to solve your problem:
What is the problem with two instances seeing the files in the same order? If it's a file locking issue, choose a read-only lock.
You might be able to get away with a "holding pile" approach. Here, you would create your own enumerator class that starts by reading a small number of FileInfo records into a "Hold" collection. Then, each time your calling code requests a file, it either feeds one directly from the EnumerateFiles, or it reads one from there but swaps it out with one in your "Hold" pile and returns that one instead. The decision would be random until the EnumerateFiles returns nothing, at which point you would empty out your Hold pile. That won't provide a truly random selection order, but maybe it will add enough fuzziness to the order to meet your needs. The max size of the "Hold" collection can be adjusted to taste to balance your need for "randomness" with the need to quickly get the first file.