数以百万计的小图形文件以及如何克服 XP 上文件系统访问缓慢的问题
我正在渲染数百万个图块,这些图块将在 Google 地图上显示为叠加层。这些文件由伦敦大学学院高级空间分析中心的 GMapCreator 创建。该应用程序一次将文件渲染到一个文件夹中,在某些情况下,我需要创建大约 420 万个图块。我在 Windows XP 上使用 NTFS 文件系统运行它,磁盘大小为 500GB,并使用默认操作系统选项进行格式化。
我发现随着渲染图块数量的增加,图块的渲染变得越来越慢。我还发现,如果我尝试查看 Windows 资源管理器中的文件夹或使用命令行,那么整个计算机实际上会锁定几分钟,然后才能恢复到足以再次执行某些操作。
我一直将输入形状文件分割成更小的部分,在不同的机器上运行等等,但这个问题仍然给我带来了相当大的痛苦。我想知道我的磁盘上的簇大小是否可能会阻碍这个事情,或者我是否应该考虑完全使用另一个文件系统。有谁知道我如何克服这个问题?
谢谢,
巴里。
更新:
感谢大家的建议。最终的解决方案涉及编写一段代码来监视 GMapCreator 输出文件夹,根据文件名将文件移动到目录层次结构中;因此名为 abcdefg.gif 的文件将被移动到 \a\b\c\d\e\f\g.gif 中。与 GMapCreator 同时运行它克服了文件系统性能问题。关于生成 DOS 8.3 文件名的提示也非常有用 - 如下所述,我很惊讶这会产生如此大的差异。干杯:-)
I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以/应该做几件事
或限制文件名使用 8.3 模式(例如 i0000001.jpg,...)
在任何情况下尝试使文件名的前六个字符尽可能唯一/不同
如果您使用相同的文件夹(例如添加文件、删除文件、读取文件...)
如已发布考虑将文件拆分到多个目录中。
.eg 而不是
使用
There are several things you could/should do
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
use
您可以尝试 SSD...
http://www.crucial.com /promo/index.aspx?prog=ssd
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
使用更多文件夹并限制任何给定文件夹中的条目数量。枚举目录中条目数量的时间随着条目数量的增加而增加(指数?我不确定),如果同一目录中有数百万个小文件,甚至可以执行类似
的操作dirfolder_with_millions_of_files
可能需要几分钟的时间。切换到另一个文件系统或操作系统并不能解决问题——Linux 也有同样的行为,我上次检查过。找到一种方法将图像分组到每个不超过几百个文件的子文件夹中。为了支持这一点,使目录树尽可能深。
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like
dir folder_with_millions_of_files
can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
解决方案很可能是限制每个目录的文件数量。
我在处理大约 200,000 个平面文件中的财务数据时遇到了非常类似的问题。我们通过根据文件名称将文件存储在目录中来解决这个问题。例如
存储在
如果您的文件命名适当(我们有多种字符可供使用),则效果很好。因此,生成的目录和文件树在分布方面并不是最佳的,但它的工作效果足以将每个目录减少到 100 个文件并释放磁盘瓶颈。
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
was stored in
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
一种解决方案是实施干草堆。 这就是 Facebook 对照片所做的事情, 作为元数据和随机数据获取文件所需的读取量非常高,并且对数据存储没有任何价值。
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.