有没有一种方法可以有效地生成包含数百万个文件的目录中的每个文件?

发布于 2024-10-18 23:32:47 字数 1185 浏览 4 评论 0原文

我知道 os.listdir ,但据我所知,它将目录中的所有文件名放入内存,然后返回列表。我想要的是一种生成文件名、对其进行处理,然后生成下一个文件名的方法,而不会将它们全部读入内存。

有什么办法可以做到这一点吗?我担心使用这种方法更改文件名、添加新文件以及删除文件的情况。某些迭代器会阻止您在迭代期间修改集合,主要是通过在开始时拍摄集合状态的快照,并在每个移动操作中比较该状态。如果有一个迭代器能够从路径生成文件名,那么如果存在修改集合的文件系统更改(在迭代目录中添加、删除、重命名文件),它是否会引发错误?

在某些情况下可能会导致迭代器失败,这完全取决于迭代器如何维护状态。使用 S.Lotts 示例:

filea.txt
fileb.txt
filec.txt

迭代器生成 filea.txt。在处理期间,filea.txt 重命名为filey.txtfileb.txt 重命名为文件z.txt。当迭代器尝试获取下一个文件时,如果要使用文件名 filea.txt 查找其当前位置,以便找到下一个文件和 filea.txt不存在的话会发生什么?它可能无法恢复其在集合中的位置。类似地,如果迭代器在生成 filea.txt 时要获取 fileb.txt,它可能会查找 fileb.txt 的位置,失败,并产生错误。

如果迭代器能够以某种方式维护索引 dir.get_file(0),则维护位置状态不会受到影响,但某些文件可能会丢失,因为它们的索引可能会移动到索引在迭代器“后面”。

当然,这都是理论上的,因为似乎没有内置(python)方法来迭代目录中的文件。不过,下面有一些很好的答案,可以通过使用队列和通知来解决问题。

编辑:

值得关注的操作系统是Redhat。我的用例是这样的:

进程 A 不断地将文件写入存储位置。 进程 B(我正在编写的进程)将迭代这些文件,根据文件名进行一些处理,并将文件移动到另一个位置。

编辑:

有效的定义:

形容词 1. 有充分依据或合理、相关。

(对不起,S.Lott,我无法抗拒)。

我已经编辑了上面有问题的段落。

I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?

There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:

filea.txt
fileb.txt
filec.txt

Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.

If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.

This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.

Edit:

The OS of concern is Redhat. My use case is this:

Process A is continuously writing files to a storage location.
Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

Edit:

Definition of valid:

Adjective
1. Well grounded or justifiable, pertinent.

(Sorry S.Lott, I couldn't resist).

I've edited the paragraph in question above.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

七婞 2024-10-25 23:32:47

tl;dr:从 Python 3.5(目前处于测试版)开始,只需使用 os.scandir

正如我之前所写,由于“iglob”只是真正迭代器的一个外观,因此您必须调用低级系统函数才能像您想要的那样一次获取一个迭代器。幸运的是,可以从 Python 调用低级函数。
Windows 和 Posix/Linux 系统的低级函数是不同的。

  • 如果您使用的是 Windows,则应该检查 win32api 是否有任何调用来读取“目录中的下一个条目”或如何继续。
  • 如果您使用的是 Posix/Linux,则可以直接通过 ctypes 调用 libc 函数并一次获取文件目录条目(包括命名信息)。

关于 C 函数的文档在这里:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory" gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

我提供了一段 Python 代码片段,演示了如何在我的系统上调用低级 C 函数,但此代码片段可能不适用于您的系统[footnote-1]。我建议打开您的 /usr/include/dirent.h 头文件并验证 Python 代码段是否正确(您的 Python Structure 必须与 C struct 匹配>) 在使用代码片段之前。

下面是我使用 ctypeslibc 组合而成的代码片段,它允许您获取每个文件名,并对其执行操作。请注意,当您对结构体上定义的 char 数组执行 str(...) 操作时,ctypes 会自动为您提供一个 Python 字符串。 (我正在使用 print 语句,它隐式调用 Python 的 str

#!/usr/bin/env python2
from ctypes import *

libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))

class Dirent(Structure):
    _fields_ = [("d_ino",  c_voidp),
                ("off_t", c_int64),
                ("d_reclen", c_ushort),
                ("d_type", c_ubyte),
                ("d_name", c_char * 2048)
            ]

while True:
    p  = libc.readdir64(dir_)
    if not p:
        break
    entry = Dirent.from_address( p)
    print entry.d_name

更新:Python 3.5 现在处于测试阶段 - 在 Python 3.5 中新版本os.scandir 函数调用可作为 PEP 的实现471(“更好、更快的目录迭代器”),它完全符合这里的要求,此外还有许多其他优化,可以比 os.listdir 提供高达 9 倍的速度提升Windows 下的大目录列表(Posix 系统中增加 2-3 倍)。

[footnote-1] dirent64 C struct 在每个系统的 C 编译时确定。

tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>

As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.

  • If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
  • If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.

The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.

Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)

#!/usr/bin/env python2
from ctypes import *

libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))

class Dirent(Structure):
    _fields_ = [("d_ino",  c_voidp),
                ("off_t", c_int64),
                ("d_reclen", c_ushort),
                ("d_type", c_ubyte),
                ("d_name", c_char * 2048)
            ]

while True:
    p  = libc.readdir64(dir_)
    if not p:
        break
    entry = Dirent.from_address( p)
    print entry.d_name

update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).

[footnote-1] The dirent64 C struct is determined at C compile time for each system.

夜未央樱花落 2024-10-25 23:32:47

Python 从 2.5 开始,glob 模块有一个 iglob 方法,它返回一个迭代器。
迭代器正是为了不在内存中存储巨大的值。

glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.

例如:

import glob
for eachfile in glob.iglob('*'):
    # act upon eachfile

The glob module Python from 2.5 onwards has an iglob method which returns an iterator.
An iterator is exactly for the purposes of not storing huge values in memory.

glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.

For example:

import glob
for eachfile in glob.iglob('*'):
    # act upon eachfile
Oo萌小芽oO 2024-10-25 23:32:47

由于您使用的是 Linux,因此您可能需要查看 pyinotify
它允许您编写一个 Python 脚本来监视目录中的文件系统更改,例如文件的创建、修改或删除。

每次发生此类文件系统事件时,您都可以安排 Python 脚本调用函数。这大致就像生成每个文件名一次,同时能够对修改和删除做出反应。

听起来您的目录中已经有一百万个文件。在这种情况下,如果您要将所有这些文件移动到新的 pyinotify 监控目录中,则创建新文件生成的文件系统事件将生成所需的文件名。

Since you are using Linux, you might want to look at pyinotify.
It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.

Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.

It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.

余厌 2024-10-25 23:32:47

@jsbueno 的帖子确实很有用,但在慢速磁盘上仍然有点慢,因为 libc readdir() 一次只能准备 32K 磁盘条目。我不是直接用 python 进行系统调用的专家,但我在博客文章中概述了如何用 C 编写代码来列出包含数百万个文件的目录: http://www.olark。 com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/

理想的情况是直接在 python 中调用 getdents() (http:// www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html),以便您可以在从磁盘加载目录条目时指定读取缓冲区大小。

而不是调用 readdir() ,据我所知,它有一个在编译时定义的缓冲区大小。

@jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.

The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.

Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.

沉溺在你眼里的海 2024-10-25 23:32:47

我想要的是一种生成文件名、对其进行处理,然后生成下一个文件名的方法,而无需将它们全部读入内存。

没有方法会显示“更改”的文件名。甚至不清楚“文件名更改、添加新文件和删除文件”是什么意思?您的用例是什么?

假设您有三个文件:aabbcc

你神奇的“迭代器”以aa开头。你处理它。

神奇的“迭代器”移动到bb。你正在处理它。

同时aa被复制到a1.a1aa被删除。现在怎么办?你的神奇迭代器用这些做什么?它已经通过了aa。由于 a1.a1 位于 bb 之前,因此它永远不会看到它。 “文件名更改、添加新文件和删除文件”会发生什么?

神奇的“迭代器”移动到cc。其他文件应该发生什么?你应该如何得知删除的消息?


进程 A 正在不断地将文件写入存储位置。进程 B(我正在编写的进程)将迭代这些文件,根据文件名进行一些处理,并将文件移动到另一个位置。

不要使用裸文件系统进行协调。

使用队列。

进程 A 写入文件并将添加/更改/删除备忘录放入队列中。

进程B从队列中读取备忘录,然后对备忘录中指定的文件进行后续处理。

What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?

Let's say you have three files: a.a, b.b, c.c.

Your magical "iterator" starts with a.a. You process it.

The magical "iterator" moves to b.b. You're processing it.

Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?

The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?


Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

Don't use the naked file system for coordination.

Use a queue.

Process A writes files and enqueues the add/change/delete memento onto a queue.

Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.

楠木可依 2024-10-25 23:32:47

由于文件 IO 的性质,我认为您所问的问题是不可能的。一旦 python 检索到目录列表,它就无法维护磁盘上实际目录的视图,并且 python 也没有任何方法坚持要求操作系统通知它对目录的任何修改。

python 能做的就是要求定期列表并比较结果以查看是否有任何更改。

你能做的最好的事情就是在目录中创建一个信号量文件,让其他进程知道你的 python 进程不希望其他进程修改该目录。当然,只有在您明确对其进行编程后,它们才会观察信号量。

I think what you are asking is impossible due to the nature of file IO. Once python has retrieved the listing of a directory it cannot maintain a view of the actual directory on disk, nor is there any way for python to insist that the OS inform it of any modifications to the directory.

All python can do is ask for periodic listings and diff the results to see if there have been any changes.

The best you can do is create a semaphore file in the directory which lets other processes know that your python process desires that no other process modify the directory. Of course they will only observe the semaphore if you have explicitly programmed them to.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文