从文件系统中随机选择一个文件

发布于 2024-12-03 03:18:13 字数 606 浏览 1 评论 0原文

我需要随机选择文件和目录作为文件操作的参数，如重命名、写入、读取等。我打算做的是列出具有路径的所有文件和目录的列表，并从该列表中随机选择。但是，当在实际文件系统中创建和删除文件和目录时，列表也必须更新。我发现以这种方式维护列表并更新它效率很低，而且它还必须是原子的，以便以后的操作不会访问被先前操作删除的文件。

您能否建议一种不同的选择文件的方法……也许可以直接从文件系统中执行此操作……但是我们如何知道文件的路径呢？

谢谢，

我在这里发现了一些有趣的东西以完全公平的方式从目录树中随机选择一个文件特别是在 Michael J. Barber的回答，但由于我对Python的无知，无法完全遵循它

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

多彩岁月 2024-12-10 03:18:14

当文件系统就在那里时，您不想尝试维护文件列表。您应该能够直接从 C.Walk 从根目录中执行此操作，从中选择一个随机文件。您可以选择一个随机的最大深度，如果您在此时或之前击中常规文件，请使用它。如果是目录，则重复至最大深度。如果它是一个特殊文件，也许可以重新开始。

这应该很快。该操作不必是原子的。如果您想要执行操作时该文件不存在，请重试。不应该太复杂。您可以在找到目标文件时建立路径。这比直接使用 fs 更简单（我假设你的意思是在更低的级别），并且应该很容易实现。

回复收藏 0 原文

腹黑女流氓 2024-12-10 03:18:14

这是我提出的解决方案。它不是最快的，但应该很快（准备后），仅使用适度的内存，并且“分布得相当好”。当然，这是 100% 未经测试的，并且有些复杂（无论如何，与维护 RB 树或类似的东西一样复杂）——我为必须使用 C 感到遗憾;-)

对于目标域中的每个目录，使用文件系统的深度优先构建目录树，并记录“之前”文件计数（树中迄今为止找到的文件）和“之后”文件计数（“之前”计数加上目录中的文件数））。它不应该存储文件本身。快速查找号码的方法of files 给出了一些示例 C 代码。它仍然需要迭代目录内容，但不需要存储文件本身。
计算树中的文件总数。（这实际上应该只是树中最后一个节点的“之后”计数。）
在 [0，最大文件数）范围内选择一个随机数。
导航到树中的节点，使得“之前”文件计数 <= 随机数 < “之后”文件计数。这只是沿着 (RB-) 树结构走下去，本身就是 O(lg n) 或类似的。
在与所选节点关联的目录中选择一个随机文件 - 确保再次对目录进行计数并将其用作选择中的 [0, limit]（在运行失败的情况下进行回退）由于并发问题而结束）。如果文件数量发生变化，请确保使用此类信息更新树。如果目录已被删除，还要更新/修复树等。（这里的额外完整计数不应该像听起来那么糟糕，因为 readdir （平均）必须已经导航通过 1 /2 目录中的条目。但是，应该探索重新计数的好处（如果有）。）
根据需要重复步骤 2-5。

定期重建整个树（步骤＃1）以考虑文件系统的更改。删除/添加文件会慢慢扭曲随机性——步骤 #5 可以帮助在某些情况下更新树。重建的频率应通过实验来确定。还可以通过每次传递重建父/祖父节点或[随机]子节点等来减少错误引入。使用修改时间作为检测更改的快速方法也可能值得研究。

快乐编码。

Here is my proposed solution. It is not the fastest, but should be quick (after preparation), use only modest memory, and be "fairly well-distributed". This is, of course, 100% untested and somewhat complex (as complex as maintain a RB-tree or similar, anyway) -- I pitty one for having to use C ;-)

For each directory in the target domain, build a directory tree using a depth-first walk of the filesystem and record the "before" file count (files found to date, in tree) and the "after" file count (the "before" count plus the number of files in directory). It should not store the files themselves. Fast way to find the number of files gives some example C code. It still requires iteration of the directory contents but does not need to store the files themselves.
Count up the total number of files in the tree. (This should really just be the "after" count of the last node in the tree.)
Pick a random number in the range [0, max files).
Navigate to the node in the tree such that the "before" file count <= random number < "after" file count. This is just walking down the (RB-)tree structure and is itself O(lg n) or similar.
Pick a random file in the directory associated with the selected node -- make sure to count the directory again and use this as the [0, limit] in the selection (with a fallback in case of running-off-the-end due to concurrency issues). If the number of files changed, make sure to update the tree with such information. Also update/fix the tree if the directory has been deleted, etc. (The extra full count here shouldn't be as bad as it sounds, as readdir (on average) must already be navigated through 1/2 the entries in the directory. However, the benefit of the re-count, if any, should be explored.)
Repeat steps 2-5 as needed.

Periodically rebuild the entire tree (step #1) to account for filesystems changes. Deleting/adding files will slowly skew the randomness -- step #5 can help to update the tree in certain situations. The frequency of rebuild should be determined through experimentation. It might also be possible to reduce the error introduction with rebuilding the parent/grandparent nodes or [random] child nodes each pass, etc. Using the modified time as a fast way to detect changes may also be worth looking into.

Happy coding.

回复收藏 0 原文