将文件夹中的文件作为流列出以立即开始处理

发布于 2024-10-07 08:48:55 字数 165 浏览 5 评论 0原文

我得到一个包含 100 万个文件的文件夹。

当以 Python 或其他脚本语言列出此文件夹中的文件时,我想立即开始处理。

常用函数(python 中的 os.listdir...)是阻塞的,我的程序必须等待列表末尾,这可能需要很长时间。

列出大文件夹的最佳方式是什么?

I get a folder with 1 million files in it.

I would like to begin process immediately, when listing files in this folder, in Python or other script langage.

The usual functions (os.listdir in python...) are blocking and my program has to wait the end of the list, which can take a long time.

What's the best way to list huge folders ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

空城缀染半城烟沙 2024-10-14 08:48:55

如果方便的话,改变你的目录结构;但如果没有,你可以 使用 ctypes调用opendirreaddir

这是该代码的副本;我所做的就是正确缩进它,添加 try/finally 块,并修复错误。您可能需要对其进行调试。特别是结构布局。

请注意,此代码不可可移植。你需要在 Windows 上使用不同的函数,而且我认为 Unix 之间的结构有所不同。

#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)

if __name__ == "__main__":
    for name in listdir("."):
        print name

If convenient, change your directory structure; but if not, you can use ctypes to call opendir and readdir.

Here is a copy of that code; all I did was indent it properly, add the try/finally block, and fix a bug. You might have to debug it. Particularly the struct layout.

Note that this code is not portable. You would need to use different functions on Windows, and I think the structs vary from Unix to Unix.

#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)

if __name__ == "__main__":
    for name in listdir("."):
        print name
伤感在游骋 2024-10-14 08:48:55

这感觉很脏,但应该可以解决问题:

def listdirx(dirname='.', cmd='ls'):
    proc = subprocess.Popen([cmd, dirname], stdout=subprocess.PIPE)
    filename = proc.stdout.readline()
    while filename != '':
        yield filename.rstrip('\n')
        filename = proc.stdout.readline()
    proc.communicate()

用法:listdirx('/something/with/lots/of/files')

This feels dirty but should do the trick:

def listdirx(dirname='.', cmd='ls'):
    proc = subprocess.Popen([cmd, dirname], stdout=subprocess.PIPE)
    filename = proc.stdout.readline()
    while filename != '':
        yield filename.rstrip('\n')
        filename = proc.stdout.readline()
    proc.communicate()

Usage: listdirx('/something/with/lots/of/files')

痴者 2024-10-14 08:48:55

对于离开 Google 的人员,PEP 471 向 Python 3.5 标准库添加了适当的解决方案,并将其向后移植到 Python 2.6+ 和 3.2+,作为 PIP 上的 scandir 模块。

来源:https://stackoverflow.com/a/34922054/435253

Python 3.5+:

  • os.walk< /code> 已更新为使用此基础架构以获得更好的性能。
  • os.scandir 返回一个对 DirEntry 对象的迭代器。

Python 2.6/2.7 和 3.2/3.3/3.4:

  • scandir.walkos.walk 的更高性能版本
  • scandir.scandir 返回迭代器通过 DirEntry 对象。

scandir() 迭代器包装 POSIX 平台上的 opendir/readdirFindFirstFileW/FindNextFileW 在 Windows 上。

返回 DirEntry 对象的目的是允许缓存元数据以最大限度地减少系统调用的次数。 (例如,DirEntry.stat(follow_symlinks=False) 永远不会在 Windows 上进行系统调用,因为 FindFirstFileWFindNextFileW 函数会抛出 stat 信息免费)

来源:https://docs .python.org/3/library/os.html#os.scandir

For people coming in off Google, PEP 471 added a proper solution to the Python 3.5 standard library and it got backported to Python 2.6+ and 3.2+ as the scandir module on PIP.

Source: https://stackoverflow.com/a/34922054/435253

Python 3.5+:

  • os.walk has been updated to use this infrastructure for better performance.
  • os.scandir returns an iterator over DirEntry objects.

Python 2.6/2.7 and 3.2/3.3/3.4:

  • scandir.walk is a more performant version of os.walk
  • scandir.scandir returns an iterator over DirEntry objects.

The scandir() iterators wrap opendir/readdir on POSIX platforms and FindFirstFileW/FindNextFileW on Windows.

The point of returning DirEntry objects is to allow metadata to be cached to minimize the number of system calls made. (eg. DirEntry.stat(follow_symlinks=False) never makes a system call on Windows because the FindFirstFileW and FindNextFileW functions throw in stat information for free)

Source: https://docs.python.org/3/library/os.html#os.scandir

暮色兮凉城 2024-10-14 08:48:55

这是关于如何在 Windows 上逐个文件遍历大型目录的答案!

我疯狂地寻找 Windows DLL,它可以让我做 Linux 上所做的事情,但没有成功。

因此,我得出的结论是,唯一的方法是创建自己的 DLL,将这些静态函数公开给我,但后来我想起了 pywintypes。
而且,耶!这已经在那里完成了。而且,更重要的是,迭代器函数已经实现了!凉爽的!

带有 FindFirstFile()、FindNextFile() 和 FindClose() 的 Windows DLL 可能仍在某处,但我没有找到它。所以,我使用了 pywintypes。

编辑:它们隐藏在 kernel32.dll 中。请参阅 ssokolow 的回答以及我的评论。

抱歉产生依赖性。但我认为您可以从 ...\site-packages\win32 文件夹和最终依赖项中提取 win32file.pyd ,并在必要时将其独立于您的程序的 win32types 进行分发。

我在搜索如何执行此操作时发现了这个问题,还有其他一些问题。

这里:

如何使用 python 从包含数千个文件的目录中复制前 100 个文件?

我从这里发布了包含 Linux 版本的 listdir() 的完整代码(作者:Jason Orendorff)以及我的 Windows 版本出现在这里。

因此,任何想要或多或少跨平台版本的人,都可以去那里或自己组合两个答案。

编辑:或者更好的是,使用 scandir 模块或 os.scandir() (在 Python 3.5 中)及以下版本。它也可以更好地处理错误和其他一些事情。

from win32file import FindFilesIterator
import os

def listdir (path):
    """
    A generator to return the names of files in the directory passed in
    """
    if "*" not in path and "?" not in path:
        st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        path = path.rstrip("\\/")+"\\*"
    # Else:  Decide that user knows what she/he is doing
    for file in FindFilesIterator(path):
        name = file[-2]
        # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
        if name=="." and name=="..": continue
        yield name

Here is your answer on how to traverse a large directory file by file on Windows!

I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.

So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes.
And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!

A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.

EDIT: They were hiding in plain sight in kernel32.dll. Please see ssokolow's answer, and my comment to it.

Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.

I found this question when searching on how to do this, and some others as well.

Here:

How to copy first 100 files from a directory of thousands of files using python?

I posted a full code with Linux version of listdir() from here (by Jason Orendorff) and with my Windows version that I present here.

So anyone wanting a more or less cross-platform version, go there or combine two answers yourself.

EDIT: Or better still, use scandir module or os.scandir() (in Python 3.5) and following versions. It better handles errors and some other stuff as well.

from win32file import FindFilesIterator
import os

def listdir (path):
    """
    A generator to return the names of files in the directory passed in
    """
    if "*" not in path and "?" not in path:
        st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        path = path.rstrip("\\/")+"\\*"
    # Else:  Decide that user knows what she/he is doing
    for file in FindFilesIterator(path):
        name = file[-2]
        # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
        if name=="." and name=="..": continue
        yield name
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文