将文件夹中的文件作为流列出以立即开始处理
我得到一个包含 100 万个文件的文件夹。
当以 Python 或其他脚本语言列出此文件夹中的文件时,我想立即开始处理。
常用函数(python 中的 os.listdir...)是阻塞的,我的程序必须等待列表末尾,这可能需要很长时间。
列出大文件夹的最佳方式是什么?
I get a folder with 1 million files in it.
I would like to begin process immediately, when listing files in this folder, in Python or other script langage.
The usual functions (os.listdir in python...) are blocking and my program has to wait the end of the list, which can take a long time.
What's the best way to list huge folders ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果方便的话,改变你的目录结构;但如果没有,你可以 使用 ctypes调用
opendir
和readdir
。这是该代码的副本;我所做的就是正确缩进它,添加
try/finally
块,并修复错误。您可能需要对其进行调试。特别是结构布局。请注意,此代码不可可移植。你需要在 Windows 上使用不同的函数,而且我认为 Unix 之间的结构有所不同。
If convenient, change your directory structure; but if not, you can use ctypes to call
opendir
andreaddir
.Here is a copy of that code; all I did was indent it properly, add the
try/finally
block, and fix a bug. You might have to debug it. Particularly the struct layout.Note that this code is not portable. You would need to use different functions on Windows, and I think the structs vary from Unix to Unix.
这感觉很脏,但应该可以解决问题:
用法:
listdirx('/something/with/lots/of/files')
This feels dirty but should do the trick:
Usage:
listdirx('/something/with/lots/of/files')
对于离开 Google 的人员,PEP 471 向 Python 3.5 标准库添加了适当的解决方案,并将其向后移植到 Python 2.6+ 和 3.2+,作为 PIP 上的
scandir
模块。来源:https://stackoverflow.com/a/34922054/435253
Python 3.5+:
os.walk< /code> 已更新为使用此基础架构以获得更好的性能。
os.scandir
返回一个对DirEntry
对象的迭代器。Python 2.6/2.7 和 3.2/3.3/3.4:
scandir.walk
是os.walk
的更高性能版本scandir.scandir
返回迭代器通过DirEntry
对象。scandir()
迭代器包装 POSIX 平台上的opendir
/readdir
和FindFirstFileW
/FindNextFileW
在 Windows 上。返回
DirEntry
对象的目的是允许缓存元数据以最大限度地减少系统调用的次数。 (例如,DirEntry.stat(follow_symlinks=False)
永远不会在 Windows 上进行系统调用,因为FindFirstFileW
和FindNextFileW
函数会抛出stat
信息免费)来源:https://docs .python.org/3/library/os.html#os.scandir
For people coming in off Google, PEP 471 added a proper solution to the Python 3.5 standard library and it got backported to Python 2.6+ and 3.2+ as the
scandir
module on PIP.Source: https://stackoverflow.com/a/34922054/435253
Python 3.5+:
os.walk
has been updated to use this infrastructure for better performance.os.scandir
returns an iterator overDirEntry
objects.Python 2.6/2.7 and 3.2/3.3/3.4:
scandir.walk
is a more performant version ofos.walk
scandir.scandir
returns an iterator overDirEntry
objects.The
scandir()
iterators wrapopendir
/readdir
on POSIX platforms andFindFirstFileW
/FindNextFileW
on Windows.The point of returning
DirEntry
objects is to allow metadata to be cached to minimize the number of system calls made. (eg.DirEntry.stat(follow_symlinks=False)
never makes a system call on Windows because theFindFirstFileW
andFindNextFileW
functions throw instat
information for free)Source: https://docs.python.org/3/library/os.html#os.scandir
这是关于如何在 Windows 上逐个文件遍历大型目录的答案!
我疯狂地寻找 Windows DLL,它可以让我做 Linux 上所做的事情,但没有成功。
因此,我得出的结论是,唯一的方法是创建自己的 DLL,将这些静态函数公开给我,但后来我想起了 pywintypes。
而且,耶!这已经在那里完成了。而且,更重要的是,迭代器函数已经实现了!凉爽的!
带有 FindFirstFile()、FindNextFile() 和 FindClose() 的 Windows DLL 可能仍在某处,但我没有找到它。所以,我使用了 pywintypes。
编辑:它们隐藏在 kernel32.dll 中。请参阅 ssokolow 的回答以及我的评论。
抱歉产生依赖性。但我认为您可以从 ...\site-packages\win32 文件夹和最终依赖项中提取 win32file.pyd ,并在必要时将其独立于您的程序的 win32types 进行分发。
我在搜索如何执行此操作时发现了这个问题,还有其他一些问题。
这里:
如何使用 python 从包含数千个文件的目录中复制前 100 个文件?
我从这里发布了包含 Linux 版本的 listdir() 的完整代码(作者:Jason Orendorff)以及我的 Windows 版本出现在这里。
因此,任何想要或多或少跨平台版本的人,都可以去那里或自己组合两个答案。
编辑:或者更好的是,使用 scandir 模块或 os.scandir() (在 Python 3.5 中)及以下版本。它也可以更好地处理错误和其他一些事情。
Here is your answer on how to traverse a large directory file by file on Windows!
I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.
So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes.
And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!
A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.
EDIT: They were hiding in plain sight in kernel32.dll. Please see ssokolow's answer, and my comment to it.
Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.
I found this question when searching on how to do this, and some others as well.
Here:
How to copy first 100 files from a directory of thousands of files using python?
I posted a full code with Linux version of listdir() from here (by Jason Orendorff) and with my Windows version that I present here.
So anyone wanting a more or less cross-platform version, go there or combine two answers yourself.
EDIT: Or better still, use scandir module or os.scandir() (in Python 3.5) and following versions. It better handles errors and some other stuff as well.