数以百万计的小图形文件以及如何克服 XP 上文件系统访问缓慢的问题

发布于 2024-08-09 03:27:05 字数 650 浏览 4 评论 0原文

我正在渲染数百万个图块,这些图块将在 Google 地图上显示为叠加层。这些文件由伦敦大学学院高级空间分析中心的 GMapCreator 创建。该应用程序一次将文件渲染到一个文件夹中,在某些情况下,我需要创建大约 420 万个图块。我在 Windows XP 上使用 NTFS 文件系统运行它,磁盘大小为 500GB,并使用默认操作系统选项进行格式化。

我发现随着渲染图块数量的增加,图块的渲染变得越来越慢。我还发现,如果我尝试查看 Windows 资源管理器中的文件夹或使用命令行,那么整个计算机实际上会锁定几分钟,然后才能恢复到足以再次执行某些操作。

我一直将输入形状文件分割成更小的部分,在不同的机器上运行等等,但这个问题仍然给我带来了相当大的痛苦。我想知道我的磁盘上的簇大小是否可能会阻碍这个事情,或者我是否应该考虑完全使用另一个文件系统。有谁知道我如何克服这个问题?

谢谢,

巴里。

更新:

感谢大家的建议。最终的解决方案涉及编写一段代码来监视 GMapCreator 输出文件夹,根据文件名将文件移动到目录层次结构中;因此名为 abcdefg.gif 的文件将被移动到 \a\b\c\d\e\f\g.gif 中。与 GMapCreator 同时运行它克服了文件系统性能问题。关于生成 DOS 8.3 文件名的提示也非常有用 - 如下所述,我很惊讶这会产生如此大的差异。干杯:-)

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.

I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.

I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?

Thanks,

Barry.

Update:

Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

沐歌 2024-08-16 03:27:05

您可以/应该做几件事

  • 禁用自动 NTFS 短文件名生成(谷歌它)
  • 或限制文件名使用 8.3 模式(例如 i0000001.jpg,...)

  • 在任何情况下尝试使文件名的前六个字符尽可能唯一/不同

  • 如果您使用相同的文件夹(例如添加文件、删除文件、读取文件...)

    • 使用contig将目录的索引文件保留为尽可能减少碎片(检查这个 供解释)
    • 特别是在删除许多文件时,请考虑使用文件夹删除技巧以减少目录索引文件大小
  • 如已发布考虑将文件拆分到多个目录中。

.eg 而不是

directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg

使用

directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg

There are several things you could/should do

  • Disable automatic NTFS short file name generation (google it)
  • Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)

  • In any case try making the first six characters of the filename as unique/different as possible

  • If you use the same folder over and (say adding file, removing file, readding files, ...)

    • Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
    • Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
  • As already posted consider splitting up the files in multiple directories.

.e.g. instead of

directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg

use

directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
嘿咻 2024-08-16 03:27:05

使用更多文件夹并限制任何给定文件夹中的条目数量。枚举目录中条目数量的时间随着条目数量的增加而增加(指数?我不确定),如果同一目录中有数百万个小文件,甚至可以执行类似 的操作dirfolder_with_millions_of_files 可能需要几分钟的时间。切换到另一个文件系统或操作系统并不能解决问题——Linux 也有同样的行为,我上次检查过。

找到一种方法将图像分组到每个不超过几百个文件的子文件夹中。为了支持这一点,使目录树尽可能深。

Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.

Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.

深爱不及久伴 2024-08-16 03:27:05

解决方案很可能是限制每个目录的文件数量。

我在处理大约 200,000 个平面文件中的财务数据时遇到了非常类似的问题。我们通过根据文件名称将文件存储在目录中来解决这个问题。例如

gbp97m.xls

存储在

g/b/p97m.xls

如果您的文件命名适当(我们有多种字符可供使用),则效果很好。因此,生成的目录和文件树在分布方面并不是最佳的,但它的工作效果足以将每个目录减少到 100 个文件并释放磁盘瓶颈。

The solution is most likely to restrict the number of files per directory.

I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.

gbp97m.xls

was stored in

g/b/p97m.xls

This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.

七秒鱼° 2024-08-16 03:27:05

一种解决方案是实施干草堆这就是 Facebook 对照片所做的事情, 作为元数据和随机数据获取文件所需的读取量非常高,并且对数据存储没有任何价值。

Haystack 提出了一个基于 HTTP 的通用对象存储,其中包含映射到存储的不透明对象的指针。将照片存储为大海捞针,通过将数十万张图像聚合在一个大海捞针存储文件中,消除了元数据开销。这使得元数据开销非常小,并允许我们将每个针的位置存储在内存索引的存储文件中。这允许以最少的 I/O 操作检索图像数据,从而消除所有不必要的元数据开销。

One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.

Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文