从文件系统中随机选择一个文件

发布于 2024-12-03 03:18:13 字数 606 浏览 1 评论 0原文

此问题与 模拟文件系统访问有关。

我需要随机选择文件和目录作为文件操作的参数,如重命名、写入、读取等。我打算做的是列出具有路径的所有文件和目录的列表,并从该列表中随机选择。但是,当在实际文件系统中创建和删除文件和目录时,列表也必须更新。我发现以这种方式维护列表并更新它效率很低,而且它还必须是原子的,以便以后的操作不会访问被先前操作删除的文件。

您能否建议一种不同的选择文件的方法……也许可以直接从文件系统中执行此操作……但是我们如何知道文件的路径呢?

谢谢,

我在这里发现了一些有趣的东西 以完全公平的方式从目录树中随机选择一个文件特别是在 Michael J. Barber的回答,但由于我对Python的无知,无法完全遵循它

This question relates to Simulating file system access .

I need to choose files and directories randomly to act as arguments of file operations like rename, write, read etc. What I was planning to do was to make a list of all files and directories with thier paths and randomly select from this list. But, as files and directories are created and deleted in the actual file system, the list also has to be updated. I am finding maintaining the list and updating it in this manner to be inefficient and it also has to be atomic so that a later operation does not access a file that was deleted by a previous operation.

Can you suggest a different way of selecting the files ..maybe someway to do it diretly from the file system...but how would we know paths to files then.

Thanks

I found something interesting here Randomly selecting a file from a tree of directories in a completely fair manner specially in
Michael J. Barber's answer, but not being able to follow it completely due to my python ignorance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

多彩岁月 2024-12-10 03:18:14

当文件系统就在那里时,您不想尝试维护文件列表。您应该能够直接从 C.Walk 从根目录中执行此操作,从中选择一个随机文件。您可以选择一个随机的最大深度,如果您在此时或之前击中常规文件,请使用它。如果是目录,则重复至最大深度。如果它是一个特殊文件,也许可以重新开始。

这应该很快。该操作不必是原子的。如果您想要执行操作时该文件不存在,请重试。不应该太复杂。您可以在找到目标文件时建立路径。这比直接使用 fs 更简单(我假设你的意思是在更低的级别),并且应该很容易实现。

You don't want to try to maintain a list of files when the filesystem is right there. You should be able to do it right from C. Walk from the root directory, selecting a random file from it. You can pick a random maximum depth, and if you hit a regular file, at or before this, use it. If it's a directory, repeat up to max depth. If it's a special file, maybe start over.

This should be pretty quick. The operation shouldn't have to be atomic. If the file's not there when you want do your operation, try again. Shouldn't be too complicated. You can build the path up as you find your target file. This will be simpler than fussing with the fs directly (I assume you meant at a much lower level) and should be simple to implement.

腹黑女流氓 2024-12-10 03:18:14

这是我提出的解决方案。它不是最快的,但应该很快(准备后),仅使用适度的内存,并且“分布得相当好”。当然,这是 100% 未经测试的,并且有些复杂(无论如何,与维护 RB 树或类似的东西一样复杂)——我为必须使用 C 感到遗憾;-)

  1. 对于目标域中的每个目录,使用文件系统的深度优先构建目录树,并记录“之前”文件计数(树中迄今为止找到的文件)和“之后”文件计数(“之前”计数加上目录中的文件数) )。它不应该存储文件本身。 快速查找号码的方法of files 给出了一些示例 C 代码。它仍然需要迭代目录内容,但不需要存储文件本身。

  2. 计算树中的文件总数。 (这实际上应该只是树中最后一个节点的“之后”计数。)

  3. 在 [0,最大文件数)范围内选择一个随机数。

  4. 导航到树中的节点,使得“之前”文件计数 <= 随机数 < “之后”文件计数。这只是沿着 (RB-) 树结构走下去,本身就是 O(lg n) 或类似的。

  5. 在与所选节点关联的目录中选择一个随机文件 - 确保再次对目录进行计数并将其用作选择中的 [0, limit](在运行失败的情况下进行回退)由于并发问题而结束)。如果文件数量发生变化,请确保使用此类信息更新树。如果目录已被删除,还要更新/修复树等。(这里的额外完整计数不应该像听起来那么糟糕,因为 readdir (平均)必须已经导航通过 1 /2 目录中的条目。但是,应该探索重新计数的好处(如果有)。)

  6. 根据需要重复步骤 2-5。

定期重建整个树(步骤#1)以考虑文件系统的更改。删除/添加文件会慢慢扭曲随机性——步骤 #5 可以帮助在某些情况下更新树。重建的频率应通过实验来确定。还可以通过每次传递重建父/祖父节点或[随机]子节点等来减少错误引入。使用修改时间作为检测更改的快速方法也可能值得研究。

快乐编码。

Here is my proposed solution. It is not the fastest, but should be quick (after preparation), use only modest memory, and be "fairly well-distributed". This is, of course, 100% untested and somewhat complex (as complex as maintain a RB-tree or similar, anyway) -- I pitty one for having to use C ;-)

  1. For each directory in the target domain, build a directory tree using a depth-first walk of the filesystem and record the "before" file count (files found to date, in tree) and the "after" file count (the "before" count plus the number of files in directory). It should not store the files themselves. Fast way to find the number of files gives some example C code. It still requires iteration of the directory contents but does not need to store the files themselves.

  2. Count up the total number of files in the tree. (This should really just be the "after" count of the last node in the tree.)

  3. Pick a random number in the range [0, max files).

  4. Navigate to the node in the tree such that the "before" file count <= random number < "after" file count. This is just walking down the (RB-)tree structure and is itself O(lg n) or similar.

  5. Pick a random file in the directory associated with the selected node -- make sure to count the directory again and use this as the [0, limit] in the selection (with a fallback in case of running-off-the-end due to concurrency issues). If the number of files changed, make sure to update the tree with such information. Also update/fix the tree if the directory has been deleted, etc. (The extra full count here shouldn't be as bad as it sounds, as readdir (on average) must already be navigated through 1/2 the entries in the directory. However, the benefit of the re-count, if any, should be explored.)

  6. Repeat steps 2-5 as needed.

Periodically rebuild the entire tree (step #1) to account for filesystems changes. Deleting/adding files will slowly skew the randomness -- step #5 can help to update the tree in certain situations. The frequency of rebuild should be determined through experimentation. It might also be possible to reduce the error introduction with rebuilding the parent/grandparent nodes or [random] child nodes each pass, etc. Using the modified time as a fast way to detect changes may also be worth looking into.

Happy coding.

牵你手 2024-12-10 03:18:14

您应该知道每个目录中有多少个文件,以便选择您应该遍历的目录。避免遍历符号链接以及对符号链接中的文件进行计数。

您可以使用与 pst 描述类似的解决方案。

例如,您有 3 个目录,每个目录中有 20,40 和 1000 个文件。
您的总数为 [20,60,1060],并选择随机数 0-1060。如果这个数字大于或等于 60,则进入第三个文件夹。

一旦到达没有文件夹的文件夹,您就停止遍历。

要通过此路径查找随机文件,您可以应用与以前相同的技巧。

这样您就可以以相同的概率选择任何文件。

All you should know is how many files are in each directory in order to pick directory in which you should traverse. Avoid traversing over symbolic links and counting files in symbolic links.

You can use similar solution as pst described.

Example you have 3 directories and there are 20,40 and 1000 files in each.
You make total [20,60,1060] and you pick random number 0-1060. if this number if greater or equal 60 you go 3rd folder.

You stop traversing once you reach folder whitout folders.

To find random file trough this path you can apply same trick as before.

This way you will pick any file whit equal probability.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文