有没有一种方法可以有效地生成包含数百万个文件的目录中的每个文件？

发布于 2024-10-18 23:32:47 字数 1185 浏览 4 评论 0原文

我知道 os.listdir ，但据我所知，它将目录中的所有文件名放入内存，然后返回列表。我想要的是一种生成文件名、对其进行处理，然后生成下一个文件名的方法，而不会将它们全部读入内存。

有什么办法可以做到这一点吗？我担心使用这种方法更改文件名、添加新文件以及删除文件的情况。某些迭代器会阻止您在迭代期间修改集合，主要是通过在开始时拍摄集合状态的快照，并在每个移动操作中比较该状态。如果有一个迭代器能够从路径生成文件名，那么如果存在修改集合的文件系统更改（在迭代目录中添加、删除、重命名文件），它是否会引发错误？

在某些情况下可能会导致迭代器失败，这完全取决于迭代器如何维护状态。使用 S.Lotts 示例：

filea.txt
fileb.txt
filec.txt

迭代器生成 filea.txt。在处理期间，filea.txt 重命名为filey.txt，fileb.txt 重命名为文件z.txt。当迭代器尝试获取下一个文件时，如果要使用文件名 filea.txt 查找其当前位置，以便找到下一个文件和 filea.txt不存在的话会发生什么？它可能无法恢复其在集合中的位置。类似地，如果迭代器在生成 filea.txt 时要获取 fileb.txt，它可能会查找 fileb.txt 的位置，失败，并产生错误。

如果迭代器能够以某种方式维护索引 dir.get_file(0)，则维护位置状态不会受到影响，但某些文件可能会丢失，因为它们的索引可能会移动到索引在迭代器“后面”。

当然，这都是理论上的，因为似乎没有内置（python）方法来迭代目录中的文件。不过，下面有一些很好的答案，可以通过使用队列和通知来解决问题。

编辑：

值得关注的操作系统是Redhat。我的用例是这样的：

进程 A 不断地将文件写入存储位置。进程 B（我正在编写的进程）将迭代这些文件，根据文件名进行一些处理，并将文件移动到另一个位置。

编辑：

有效的定义：

形容词 1. 有充分依据或合理、相关。

（对不起，S.Lott，我无法抗拒）。

我已经编辑了上面有问题的段落。

原文

I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?

There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:

filea.txt
fileb.txt
filec.txt

Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.

If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.

This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.

Edit:

The OS of concern is Redhat. My use case is this:

Process A is continuously writing files to a storage location.
Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

Edit:

Definition of valid:

Adjective
1. Well grounded or justifiable, pertinent.

(Sorry S.Lott, I couldn't resist).

I've edited the paragraph in question above.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七婞 2024-10-25 23:32:47

tl;dr:从 Python 3.5（目前处于测试版）开始，只需使用 os.scandir

正如我之前所写，由于“iglob”只是真正迭代器的一个外观，因此您必须调用低级系统函数才能像您想要的那样一次获取一个迭代器。幸运的是，可以从 Python 调用低级函数。
Windows 和 Posix/Linux 系统的低级函数是不同的。

如果您使用的是 Windows，则应该检查 win32api 是否有任何调用来读取“目录中的下一个条目”或如何继续。
如果您使用的是 Posix/Linux，则可以直接通过 ctypes 调用 libc 函数并一次获取文件目录条目（包括命名信息）。

关于 C 函数的文档在这里：
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory" gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

我提供了一段 Python 代码片段，演示了如何在我的系统上调用低级 C 函数，但此代码片段可能不适用于您的系统[footnote-1]。我建议打开您的 /usr/include/dirent.h 头文件并验证 Python 代码段是否正确（您的 Python Structure 必须与 C struct 匹配>) 在使用代码片段之前。

下面是我使用 ctypes 和 libc 组合而成的代码片段，它允许您获取每个文件名，并对其执行操作。请注意，当您对结构体上定义的 char 数组执行 str(...) 操作时，ctypes 会自动为您提供一个 Python 字符串。（我正在使用 print 语句，它隐式调用 Python 的 str）

#!/usr/bin/env python2
from ctypes import *

libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))

class Dirent(Structure):
    _fields_ = [("d_ino",  c_voidp),
                ("off_t", c_int64),
                ("d_reclen", c_ushort),
                ("d_type", c_ubyte),
                ("d_name", c_char * 2048)
            ]

while True:
    p  = libc.readdir64(dir_)
    if not p:
        break
    entry = Dirent.from_address( p)
    print entry.d_name

更新：Python 3.5 现在处于测试阶段 - 在 Python 3.5 中新版本os.scandir 函数调用可作为 PEP 的实现471（“更好、更快的目录迭代器”），它完全符合这里的要求，此外还有许多其他优化，可以比 os.listdir 提供高达 9 倍的速度提升Windows 下的大目录列表（Posix 系统中增加 2-3 倍）。

[footnote-1] dirent64 C struct 在每个系统的 C 编译时确定。

tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>

As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.

If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.

The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.

Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)

#!/usr/bin/env python2
from ctypes import *

libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))

class Dirent(Structure):
    _fields_ = [("d_ino",  c_voidp),
                ("off_t", c_int64),
                ("d_reclen", c_ushort),
                ("d_type", c_ubyte),
                ("d_name", c_char * 2048)
            ]

while True:
    p  = libc.readdir64(dir_)
    if not p:
        break
    entry = Dirent.from_address( p)
    print entry.d_name

update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).

[footnote-1] The dirent64 C struct is determined at C compile time for each system.

回复收藏 0 原文

夜未央樱花落 2024-10-25 23:32:47

Python 从 2.5 开始，glob 模块有一个 iglob 方法，它返回一个迭代器。
迭代器正是为了不在内存中存储巨大的值。

glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.

例如：

import glob
for eachfile in glob.iglob('*'):
    # act upon eachfile

The glob module Python from 2.5 onwards has an iglob method which returns an iterator.
An iterator is exactly for the purposes of not storing huge values in memory.

glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.

For example:

import glob
for eachfile in glob.iglob('*'):
    # act upon eachfile

回复收藏 0 原文

Oo萌小芽oO 2024-10-25 23:32:47

由于您使用的是 Linux，因此您可能需要查看 pyinotify。
它允许您编写一个 Python 脚本来监视目录中的文件系统更改，例如文件的创建、修改或删除。

每次发生此类文件系统事件时，您都可以安排 Python 脚本调用函数。这大致就像生成每个文件名一次，同时能够对修改和删除做出反应。

听起来您的目录中已经有一百万个文件。在这种情况下，如果您要将所有这些文件移动到新的 pyinotify 监控目录中，则创建新文件生成的文件系统事件将生成所需的文件名。

回复收藏 0 原文

余厌 2024-10-25 23:32:47

@jsbueno 的帖子确实很有用，但在慢速磁盘上仍然有点慢，因为 libc readdir() 一次只能准备 32K 磁盘条目。我不是直接用 python 进行系统调用的专家，但我在博客文章中概述了如何用 C 编写代码来列出包含数百万个文件的目录： http://www.olark。 com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/。

理想的情况是直接在 python 中调用 getdents() (http:// www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html），以便您可以在从磁盘加载目录条目时指定读取缓冲区大小。

而不是调用 readdir() ，据我所知，它有一个在编译时定义的缓冲区大小。

回复收藏 0 原文