Windows 上的 Python 快速计算文件夹大小
我正在寻找一种在 Windows 上用 Python 计算文件夹大小的快速方法。这是我到目前为止所拥有的:
def get_dir_size(path):
total_size = 0
if platform.system() == 'Windows':
try:
items = win32file.FindFilesW(path + '\\*')
except Exception, err:
return 0
# Add the size or perform recursion on folders.
for item in items:
attr = item[0]
name = item[-2]
size = item[5]
if (attr & win32con.FILE_ATTRIBUTE_DIRECTORY) and \
not (attr & win32con.FILE_ATTRIBUTE_SYSTEM): # skip system dirs
if name not in DIR_EXCLUDES:
total_size += get_dir_size("%s\\%s" % (path, name))
total_size += size
return total_size
当文件夹大小超过 100G 时,这还不够好。有什么想法如何改进吗?
在一台快速机器(2Ghz+ - 5G RAM)上,需要 72 秒才能浏览 226,001 个文件和 12,043 个文件夹中的 422GB。使用资源管理器属性选项需要 40 秒。
我知道我有点贪心,但我希望有更好的解决方案。
洛朗·卢斯
I am looking for a fast way to calculate the size of a folder in Python on Windows. This is what I have so far:
def get_dir_size(path):
total_size = 0
if platform.system() == 'Windows':
try:
items = win32file.FindFilesW(path + '\\*')
except Exception, err:
return 0
# Add the size or perform recursion on folders.
for item in items:
attr = item[0]
name = item[-2]
size = item[5]
if (attr & win32con.FILE_ATTRIBUTE_DIRECTORY) and \
not (attr & win32con.FILE_ATTRIBUTE_SYSTEM): # skip system dirs
if name not in DIR_EXCLUDES:
total_size += get_dir_size("%s\\%s" % (path, name))
total_size += size
return total_size
This is not good enough when size of folder is over 100G. Any ideas how to improve it?
On a fast machine (2Ghz+ - 5G of RAM), it took 72 seconds to go over 422GB in 226,001 files and 12,043 folders. It takes 40 seconds using the explorer properties option.
I know I am being a bit greedy but I am hoping for a better solution.
Laurent Luce
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
对代码的快速分析表明,超过 90% 的时间都消耗在
FindFilesW()
调用上。这意味着通过调整 Python 代码进行的任何改进都是很小的。微小的调整(如果您坚持使用 FindFilesW)可能包括确保 DIR_EXCLUDES 是一个集合而不是列表,避免在其他模块上重复查找,懒惰地索引到 item[] 以及将 sys.platform 检查移到外部。这包含了这些更改和其他更改,但速度提升不会超过 1-2%。
唯一显着的加速来自于使用不同的 API 或不同的技术。您在评论中提到在后台执行此操作,因此您可以将其构造为使用监视文件夹中更改的包之一进行增量更新。可能是 FindFirstChangeNotification API 或类似的东西。您可以设置为监视整个树,或者根据该例程的工作方式(我没有使用它),您可能最好在整个树的各个子集上注册多个请求,如果这减少了您的搜索量(收到通知后)找出实际发生的变化以及现在的大小。
编辑:我在评论中询问您是否考虑了 Windows XP 及更高版本所做的大量文件系统元数据缓存。我刚刚对照 Windows 本身检查了您的代码(和我的代码)的性能,选择了 C:\ 文件夹中的所有项目,然后按 Alt-Enter 键打开属性窗口。在执行此操作一次(使用您的代码)并获得 40 秒的运行时间后,我现在从两种方法中获得了 20 秒的运行时间。换句话说,您的代码实际上与 Windows 本身一样快,至少在我的机器上是这样。
A quick profiling of your code suggests that over 90% of the time is consumed in the
FindFilesW()
call alone. This means any improvements by tweaking the Python code would be minor.Tiny tweaks (if you were to stick with FindFilesW) could include ensuring DIR_EXCLUDES is a set instead of a list, avoiding the repeated lookups on other modules, and indexing into item[] lazily, as well as moving the sys.platform check outside. This incorporates these changes and others, but it won't give more than a 1-2% speedup.
The only significant speedup would come from using a different API, or a different technique. You mentioned in a comment doing this in the background, so you could structure it to do an incremental update using one of the packages for monitoring changes in folders. Possibly the FindFirstChangeNotification API or something like it. You could set up to monitor the entire tree, or depending on how that routine works (I haven't used it) you might be better off registering multiple requests on various subsets of the full tree, if that reduces the amount of searching you have to do (when notified) to figure out what actually changed and what size it is now.
Edit: I asked in a comment whether you were taking into account the heavy filesystem metadata caching that Windows XP and later do. I just checked performance of your code (and mine) against Windows itself, selecting all items in my C:\ folder and hitting Alt-Enter to bring up the properties window. After doing this once (using your code) and getting a 40s elapsed time, I now get 20s elapsed from both methods. In other words, your code is actually just as fast as Windows itself, at least on my machine.
如果使用 os.walk,则不需要使用递归算法。 请检查这个问题。
您应该对两种方法进行计时,但这应该要快得多:
You don't need to use a recursive algorithm if you use os.walk. Please check this question.
You should time both approaches, but this is supposed to be much faster:
我目前没有 Windows 盒子可供测试,但文档指出
win32file.FindFilesIterator
“与win32file.FindFiles
类似,但避免为巨大的目录创建列表”。这有帮助吗?I don't have a Windows box to test on at the moment, but the documentation states that
win32file.FindFilesIterator
is "similar towin32file.FindFiles
, but avoid the creation of the list for huge directories". Does that help?这是一个巨大的目录树。正如其他人所说,我不确定你可以加快速度......不是那样的,冷无数据。这意味着...
如果您可以以某种方式缓存数据(不确定实际含义是什么),那么您可以加快速度(我认为...一如既往,测量,测量,测量)。
我想我不必告诉你如何做缓存,我想,你看起来是一个知识渊博的人。无论如何我也不会即兴知道 Windows 的情况。 ;-)
It's a whopper of a directory tree. As others have said, I'm not sure you can speed it up... not like that, cold w/o data. And that means...
If you can cache data, somehow (not sure what the actual implication is), then you could speed things up (I think... as always, measure, measure, measure).
I don't think I have to tell you how to do caching, I guess, you seem like a knowledgeable person. And I wouldn't know off the cuff for Windows anyway. ;-)
这让我突然意识到:
异常处理会显着增加算法的时间。如果您可以以一种您始终知道是安全的方式以不同的方式指定路径,从而避免需要捕获异常(例如,在查找该文件夹中的文件之前首先检查给定路径是否是一个文件夹),您可以发现显着的加速。
This jumps out at me:
Exception handling can add significant time to your algorithm. If you can specify the path differently, in a way that you always know is safe, and thus prevent the need to capture exceptions (eg, checking first to see if the given path is a folder before finding files in that folder), you may find a significant speedup.