如何快速打开和读取数千个文件
我的问题是应用程序需要很长时间才能加载数千个文件。是的,我知道这会花费很长时间,但我想让它更快一些。我所说的“加载”是指打开文件以获取其描述符,然后读取其中的前 100 个字节左右。
因此,我的主要策略是创建第二个线程来打开和关闭(不读取任何内容)所有文件。这似乎有帮助,因为线程在主线程之前运行,并且我猜测操作系统正在提前缓存这些文件描述符,以便当我的主线程打开它们时可以快速打开。这实际上很有帮助,因为当我的主线程正在解析从这些文件读取的数据时,线程可以开始缓存这些文件描述符。
所以我真正的问题是......我还能做些什么来加快速度?有哪些方法?有人成功做到这一点吗?
我听说过操作系统预取调用,但它是针对虚拟内存页面的。有没有办法告诉操作系统,嘿,我很快就会需要所有这些文件 - 我建议您提前开始为我收集它们。我的前瞻线程非常粗糙。
我可以使用低级磁盘技术吗?是否有可能有帮助的文件访问模式?现在,加载的文件都来自同一个文件夹。我想没有办法确定它们在磁盘上的确切位置以及哪种文件打开顺序对于磁盘来说是最快的。我还猜测磁盘上有一些硬件可以使其尽可能高效。
我的应用程序主要适用于 Windows,但 Unix 的建议也会有所帮助。
如果这有什么区别的话,我正在用 C++ 编程。
谢谢, -朱利安
My problem is that application takes too long to load thousands of files. Yes, I know it's going to take a long time, but I would like to make it faster by any amount of time. What I mean by "load" is open the file to get its descriptor and then read the first 100 bytes or so of it.
So, my main strategy has been to create a second thread that will open and close (without reading any contents) all the files. This seems to help because the thread runs ahead of the main thread and I'm guessing the OS is caching these file descriptors ahead of time so that when my main thread opens them it's a quick open. This has actually helped because the thread can start caching these file descriptors while my main thread is parsing the data read in from these files.
So my real question is...what else can I do to make this faster? What approaches are there? Has anyone had success doing this?
I've heard of OS prefetching calls but it was for virtual memory pages. Is there a way to tell the OS, hey I'm going to be needed all these files pretty soon - I suggest that you start gathering them for me ahead of time. My lookahead thread is pretty crude.
Are there low level disk techniques I could use? Is there possibly a pattern of file access that would help? Right now, the files that are loaded all come from the same folder. I suppose there is no way to determine where exactly on disk they lie and which ordering of file opens would be fastest for the disk. I'm also guessing that the disk has some hard ware to make this as efficient as possible too.
My application is mainly for windows, but unix suggestions would help as well.
I am programming in C++ if that makes a difference.
Thanks,
-julian
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的第一个想法是,从程序层面来说,这将很难解决。
您会发现 Linux 和 OSX 可以访问数千个这样的文件,而所需时间只是 Windows 的一小部分。我不知道你对机器的控制能力有多大。如果您可以在 FAT 分区上保存数千个文件,您应该会看到比 NTFS 更好的结果。
您扫描这些文件的频率以及它们更改的频率。如果该比率主要集中在读取方面,则将每个文件的开头复制到缓存中是有意义的。缓存可以存储文件名、修改时间以及一千个文件中每个文件的 100 字节。
My first thought is that this is going to be hard to work around from a programmatic level.
You'll find Linux and OSX can access thousands of files like this in a fraction of the time it takes Windows. I don't know how much control you have over the machine. If you can keep the thousands of files on a FAT partition, you should see better results than with NTFS.
How often are you scanning these files and how often are they changing. If the ratio is heavily on the reading side, it would make sense to copy the start of each file into a cache. The cache could store the filename, modification time, and 100 bytes of each of the thousand files.