`os.path.getsize()` 在网络驱动器上速度缓慢(Python、Windows)
我有一个程序,可以迭代 SMB 共享网络驱动器(2TB 三星 970 Evo+)上的数千个 PNG 文件,并将它们各自的文件大小相加。不幸的是,它非常慢。对代码进行分析后,发现 90% 的执行时间都花在一个函数上:
filesize += os.path.getsize(png)
,
其中每个 png
变量是for 循环中单个 PNG 文件(数千个)的文件路径,该循环遍历从 glob.glob()
获得的每个文件(相比之下,该文件负责 7.5% 的执行)时间)。
代码可以在这里找到: https://pastebin.com/SsDCFHLX
显然在那里是关于通过网络获取文件大小非常慢,但我不确定是什么。有什么方法可以提高性能吗?使用 filesize += os.stat(png).st_size
也需要同样长的时间。
当PNG文件存储在本地计算机上时,速度不是问题。当文件存储在我使用千兆位以太网电缆通过本地网络访问的另一台计算机上时,这尤其成为一个问题。两者都运行 Windows 10。
[2022-08-21 更新]
这次我使用 10 GB 网络连接再次尝试,发现了一些有趣的事情。我第一次在网络共享上运行代码时,分析器如下所示:
但如果我之后再次运行它,glob()
占用的时间明显减少,而获取大小()
大致相同:
如果我使用存储在本地 NVMe 驱动器 (WD SN750) 而不是 newtwork 驱动器上的 PNG 文件来运行此代码,则分析器如下所示:
似乎一旦它在网络共享上第二次运行,就会缓存一些东西,使 glob()
在网络共享上运行得更快,速度大约相同它在本地 NVMe 上运行的速度 驾驶。但是 getsize() 仍然非常慢,大约是本地速度的 1/10。
有人可以帮助我理解这两点:
- 为什么在网络共享上
getsize()
慢得多?有什么办法可以加快速度吗? - 为什么
glob()
第一次在网络共享上运行很慢,但之后立即再次运行它时却没有?
I have a program that iterates over several thousand PNG files on an SMB shared network drive (a 2TB Samsung 970 Evo+) and adds up their individual file sizes. Unfortunately, it is very slow. After profiling the code, it turns out 90% of the execution time is spent on one function:
filesize += os.path.getsize(png)
where each png
variable is the filepath to a single PNG file (of the several thousands) in a for loop that iterates over each one obtained from glob.glob()
(which, to compare, is responsible for 7.5% of the execution time).
The code can be found here: https://pastebin.com/SsDCFHLX
Clearly there is something about obtaining the filesize over the network that is extremely slow, but I'm not sure what. Is there any way I can improve the performance? It takes just as long using filesize += os.stat(png).st_size
too.
When the PNG files are stored on the computer locally, the speed is not an issue. It specifically becomes a problem when the files are stored on another machine that I access over the local network with a gigabit ethernet cable. Both are running Windows 10.
[2022-08-21 Update]
I tried it again with a 10 gigabit network connection this time and noticed something interesting. The very first time I run the code on the network share, the profiler looks like this:
but if I run it again afterward, glob()
takes up significantly less time while getsize()
is about the same:
if I instead run this code with the PNG files stored on a local NVMe drive (WD SN750) rather than a newtwork drive, here's what the profiler looks like:
It seems like once it is run for a second time on the network share, something has gotten cached that allows glob()
to run much faster on the network share, at around the same speed it would run at on the local NVMe drive. But getsize()
remains extremely slow, about 1/10th of the speed as when local.
Can somebody help me understand these two points:
- Why is
getsize()
so much slower on the network share? Is there something that can be done to speed it up? - Why is
glob()
slow the first time on the network share but not when I run it again immediately afterward?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不知道为什么 getsize() 与网络上的速度一样慢,但是为了加快速度,您可以尝试同时调用它:
您还可以使用中定义的线程数
ThreadPool(10)
有可能进一步提高性能。I don't know why
getsize()
is as slow as it is over the network, however to speed it up you could try calling it concurrently:You can also play around with the number of threads defined in
ThreadPool(10)
to potentially increase performance even further.使用 GetFileSizeEx。您不能拥有更少系统调用的代码。
这是此要点的精简代码: https://gist.github.com/Pagliacii/774ed5d3ea78a36cdb0754be6a25408d< /a>
Using
GetFileSizeEx
. You cannot have code with less syscalls.This is a trimmed down code from this gist: https://gist.github.com/Pagliacii/774ed5d3ea78a36cdb0754be6a25408d
您可以尝试从 pathlib 获取Path
如果这没有帮助,请了解有关此 Python 模块的更多信息 [PathLib]
You Can Try Getting Path From pathlib
If This Did Not Help, Learn More About This Python Module [PathLib]