如何处理大量小文件?

发布于 2024-07-05 00:51:33 字数 318 浏览 9 评论 0原文

我正在开发的一个产品每天收集数千个读数,并将它们作为 64k 二进制文件存储在 NTFS 分区 (Windows XP) 上。 经过一年的生产,单个目录中有超过 300000 个文件,并且这个数字还在不断增长。 这使得从 Windows 资源管理器访问父/祖先目录非常耗时。

我尝试关闭索引服务,但这没有什么区别。 我还考虑过将文件内容移动到数据库/zip 文件/tarballs 中,但这对我们单独访问文件是有益的; 基本上,这些文件仍然需要用于研究目的,研究人员不愿意处理其他任何事情。

有没有办法优化 NTFS 或 Windows,使其可以处理所有这些小文件?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.

I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.

Is there a way to optimize NTFS or Windows so that it can work with all these small files?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

伤感在游骋 2024-07-12 00:51:33

要创建可扩展到大量未知数量文件的文件夹结构,我喜欢以下系统:

将文件名拆分为固定长度的片段,然后为除最后一个片段之外的每个片段创建嵌套文件夹。

该系统的优点是文件夹结构的深度仅与文件名的长度一样深。 因此,如果您的文件是按数字序列自动生成的,那么结构的深度就只是它需要的深度。

12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg

这种方法确实意味着文件夹包含文件和子文件夹,但我认为这是一个合理的权衡。

这里有一个漂亮的 PowerShell 语句来帮助您继续前进!

$s = '123456'

-join  (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*
,'' ), $s )

To create a folder structure that will scale to a large unknown number of files, I like the following system:

Split the filename into fixed length pieces, and then create nested folders for each piece except the last.

The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.

12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg

This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.

And here's a beautiful PowerShell one-liner to get you going!

$s = '123456'

-join  (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*
,'' ), $s )
独孤求败 2024-07-12 00:51:33

考虑将它们推送到另一台使用对大量小文件更友好的文件系统的服务器(例如带有 ZFS 的 Solaris)?

Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?

遇见了你 2024-07-12 00:51:33

如果数据有任何有意义的、分类的方面,您可以将它们嵌套在目录树中。 我相信速度减慢是由于一个目录中的文件数量,而不是文件本身的数量。

最明显的一般分组是按日期,并为您提供三层嵌套结构(年、月、日),每个叶目录中的文件数量 (1-3k) 具有相对安全的界限。

即使您能够提高文件系统/文件浏览器的性能,听起来这也是您在另外 2 年或 3 年内会遇到的问题......仅查看 0.3-100 万个文件的列表就会产生问题成本,因此从长远来看,找到只查看文件的较小子集的方法可能会更好。

使用“find”(在 cygwin 或 mingw 下)等工具可以使浏览文件时子目录树的存在不再成为问题。

If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.

The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).

Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.

Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.

可是我不能没有你 2024-07-12 00:51:33

每天用时间戳重命名该文件夹。

如果应用程序将文件保存到 c:\Readings,则设置计划任务以在午夜重命名 Reading 并创建一个新的空文件夹。

然后你每天都会得到一个文件夹,每个文件夹包含数千个文件。

您可以将该方法进一步扩展为按月分组。 例如,C:\Reading 变为 c:\Archive\September\22。

您必须小心安排时间,以确保在产品保存到文件夹时不会尝试重命名该文件夹。

Rename the folder each day with a time stamp.

If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.

Then you will get one folder for each day, each containing several thousand files.

You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.

You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.

送君千里 2024-07-12 00:51:33

一个常见的技巧是简单地创建一些子目录并分配文件。

例如,Doxygen,一个可以生成大量 html 页面的自动化代码文档程序,可以选择创建两级深层目录层次结构。 然后文件均匀分布在底部目录中。

One common trick is to simply create a handful of subdirectories and divvy up the files.

For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.

寄意 2024-07-12 00:51:33

除了将文件放在子目录中之外......

就我个人而言,我会开发一个应用程序,使该文件夹的界面保持相同,即所有文件都显示为单独的文件。 然后在应用程序后台实际获取这些文件并将它们组合成一个更大的文件(并且由于大小始终为 64k,获取您需要的数据应该相对容易)以摆脱您所拥有的混乱。

因此,您仍然可以让他们轻松访问他们想要的文件,而且还可以让您更好地控制一切的结构。

Aside from placing the files in sub-directories..

Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.

So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.

小草泠泠 2024-07-12 00:51:33

在一个目录中拥有数十万个文件确实会削弱 NTFS,对此您无能为力。 您应该重新考虑以更实用的格式存储数据,例如一个大 tarball 或数据库。

如果每次阅读确实需要一个单独的文件,则应该将它们分类到几个子目录中,而不是将它们全部放在同一目录中。 您可以通过创建目录层次结构并根据文件名将文件放在不同的目录中来完成此操作。 这样,您仍然可以仅知道文件名来存储和加载文件。

我们使用的方法是获取文件名的最后几个字母,反转它们,并从中创建一个字母目录。 例如,考虑以下文件:

1.xml
24.xml
12331.xml
2304252.xml

您可以将它们分类到目录中,如下所示:

data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml

此方案将确保每个目录中的文件永远不会超过 100 个。

Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.

If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.

The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:

1.xml
24.xml
12331.xml
2304252.xml

you can sort them into directories like so:

data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml

This scheme will ensure that you will never have more than 100 files in each directory.

情定在深秋 2024-07-12 00:51:33

过去,我已经看到了巨大的改进,例如通过文件名的第一个字母和第二个字母将文件拆分为嵌套的目录层次结构; 那么每个目录不包含过多的文件。 然而,操作整个数据库仍然很慢。

I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.

半窗疏影 2024-07-12 00:51:33

我过去多次遇到过这个问题。 我们尝试按日期存储、将文件压缩到日期以下,这样就不会出现大量小文件等。所有这些都是针对将数据作为大量小文件存储在 NTFS 上的实际问题的创可贴。

您可以使用 ZFS 或其他一些可以更好地处理小文件的文件系统,但仍然停下来询问是否需要存储小文件。

在我们的例子中,我们最终使用了一个系统,其中特定日期的所有小文件都以 TAR 类型的方式附加,并使用简单的分隔符来解析它们。 磁盘文件从 120 万个减少到不到几千个。 它们实际上加载速度更快,因为 NTFS 不能很好地处理小文件,而且驱动器无论如何都能更好地缓存 1MB 文件。 在我们的例子中,与存储文件的实际存储和维护相比,找到文件正确部分的访问和解析时间是最少的。

I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.

You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.

In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.

冰雪之触 2024-07-12 00:51:33

如果您可以计算文件的名称,您也许可以按日期将它们分类到文件夹中,以便每个文件夹仅包含特定日期的文件。 您可能还想创建月份和年份层次结构。

另外,您是否可以将一年以上的文件移动到不同的(但仍然可以访问)位置?

最后,再一次,这要求您能够计算名称,您会发现直接访问文件比尝试通过资源管理器打开文件要快得多。 例如,说
notepad.exe“P:\ath\to\your\filen.ame”
假设您知道所需文件的路径而无需获取目录列表,那么从命令行执行实际上应该非常快。

If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.

Also, could you move files older than say, a year, to a different (but still accessible) location?

Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.

睡美人的小仙女 2024-07-12 00:51:33

您可以尝试使用诸如 Solid File System 之类的东西。

这为您提供了一个虚拟文件系统,应用程序可以像物理磁盘一样安装该文件系统。 您的应用程序会看到许多小文件,但硬盘上只有一个文件。

http://www.eldos.com/solfsdrv/

You could try using something like Solid File System.

This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.

http://www.eldos.com/solfsdrv/

不顾 2024-07-12 00:51:33

性能问题是由单个目录中的大量文件引起的:一旦消除了这个问题,就应该没问题了。 这不是 NTFS 特有的问题:事实上,大型 UNIX 系统上的用户主目录/邮件文件通常会遇到这种问题。

解决此问题的一种明显方法是将文件移动到具有基于文件名的名称的文件夹中。 假设所有文件的文件名长度相似,例如ABCDEFGHI.db、ABCEFGHIJ.db 等,请创建如下目录结构:

ABC\
    DEF\
        ABCDEFGHI.db
    EFG\
        ABCEFGHIJ.db

使用此结构,您可以根据文件名快速定位文件。 如果文件名具有可变长度,请选择最大长度,并在前面添加零(或任何其他字符)以确定文件所属的目录。

The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.

One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:

ABC\
    DEF\
        ABCDEFGHI.db
    EFG\
        ABCEFGHIJ.db

Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.

鲜血染红嫁衣 2024-07-12 00:51:33

目录中存在 10,000 个文件后,NTFS 性能会严重下降。 您要做的就是在目录层次结构中创建一个附加级别,每个子目录包含 10,000 个文件。

无论如何,这是 SVN 人员在版本 1.5。 他们使用 1,000 个文件作为默认阈值。

NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.

For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.

天涯沦落人 2024-07-12 00:51:33

只要您告诉 NTFS 停止创建与 16 位 Windows 平台兼容的替代文件名,NTFS 实际上就能很好地处理目录中超过 10,000 个文件。 默认情况下,NTFS 会自动为创建的每个文件创建一个“8 点 3”文件名。 当目录中有许多文件时,这会成为一个问题,因为 Windows 会查看目录中的文件以确保它们创建的名称尚未使用。 您可以通过将 NtfsDisable8dot3NameCreation 注册表值设置为 1 来禁用“8 点 3”命名。该值可在 HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem 注册表路径中找到。 进行此更改是安全的,因为只有为非常旧的 Windows 版本编写的程序才需要“8 点 3”名称文件。

此设置生效之前需要重新启动。

NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.

A reboot is required before this setting will take effect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文