当前位置：文江博客话题详情

打开多个文件的大中央策略

发布于 2024-10-09 01:11:23 字数 321 浏览 1 评论 0原文

我有一个使用 Grand Central 调度队列的工作实现，它 (1) 打开一个文件并计算“queue1”上的 OpenSSL DSA 哈希，(2) 将哈希写入新的“side car”文件，以便稍后在“queue2”上进行验证。

我想同时打开多个文件，但基于一些逻辑，不会因为打开数百个文件并超过硬盘驱动器的可持续输出而“阻塞”操作系统。 iPhoto 或 Aperture 等照片浏览应用程序似乎可以打开多个文件并显示它们，因此我假设这是可以完成的。

我假设最大的限制是磁盘 I/O，因为应用程序（理论上）可以同时读取和写入多个文件。

有什么建议吗？

TIA

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绮筵 2024-10-16 01:11:23

你是对的，你肯定会受到 I/O 的限制。同时打开多个文件并主动读取的随机访问性质将使情况变得更加复杂。

因此，您需要取得一些平衡。正如您所观察到的，一个文件很可能不是最有效的。

亲自？

我会使用调度信号量。

类似于：

@property(nonatomic, assign) dispatch_queue_t dataQueue;
@property(nonatomic, assign) dispatch_semaphore_t execSemaphore;

以及：

- (void) process:(NSData *)d {
    dispatch_async(self.dataQueue, ^{
        if (!dispatch_semaphore_wait(self.execSemaphore, DISPATCH_TIME_FOREVER)) {
            dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
                ... do calcualtion work here on d ...
                dispatch_async(dispatch_get_main_queue(), ^{
                    .... update main thread w/new data here ....
                });
                dispatch_semaphore_signal(self.execSemaphore);
            });
        }
    });
}

从哪里开始：

self.dataQueue = dispatch_queue_create("com.yourcompany.dataqueue", NULL);
self.execSemaphore = dispatch_semaphore_create(3);
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
.... etc ....

您需要确定如何最好地处理排队。如果有很多项目并且存在取消的概念，那么将所有项目排队可能会造成浪费。同样，您可能希望将 URL 排入要处理的文件，而不是像上面那样的 NSData 对象。

无论如何，上面的代码将同时处理三件事，无论有多少件事已排队。

You are correct in that you'll be I/O bound, most assuredly. And it will be compounded by the random access nature of having multiple files open and being actively read at the same time.

Thus, you need to strike a bit of a balance. More likely than not, one file is not the most efficient, as you've observed.

Personally?

I'd use a dispatch semaphore.

Something like:

@property(nonatomic, assign) dispatch_queue_t dataQueue;
@property(nonatomic, assign) dispatch_semaphore_t execSemaphore;

And:

- (void) process:(NSData *)d {
    dispatch_async(self.dataQueue, ^{
        if (!dispatch_semaphore_wait(self.execSemaphore, DISPATCH_TIME_FOREVER)) {
            dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
                ... do calcualtion work here on d ...
                dispatch_async(dispatch_get_main_queue(), ^{
                    .... update main thread w/new data here ....
                });
                dispatch_semaphore_signal(self.execSemaphore);
            });
        }
    });
}

Where it is kicked off with:

self.dataQueue = dispatch_queue_create("com.yourcompany.dataqueue", NULL);
self.execSemaphore = dispatch_semaphore_create(3);
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
.... etc ....

You'll need to determine how best you want to handle the queueing. If there are many items and there is a notion of cancellation, enqueueing everything is likely wasteful. Similarly, you'll probably want to enqueue URLs to the files to process, and not NSData objects like the above.

In any case, the above will process three things simultaneously, regardless of how many have been enqueued.

回复收藏 0 原文

走野 2024-10-16 01:11:23

您已经收到了很好的答案，但我想补充几点。我参与过的项目枚举文件系统中的所有文件并计算每个文件的 MD5 和 SHA1 哈希值（除了其他处理之外）。如果您正在执行类似的操作，即搜索大量文件并且这些文件可能具有任意内容，则需要考虑以下几点：

如上所述，您将受到 I/O 限制。如果同时读取超过 1 个文件，则会对每次计算的性能产生负面影响。显然，并行调度计算的目标是保持文件之间的磁盘繁忙，但您可能需要考虑以不同的方式构建您的工作。例如，设置一个枚举并打开文件的线程，设置第二个线程从第一个线程一次获取一个打开的文件句柄并处理它们。文件系统会缓存目录信息，因此枚举不会对读取数据产生严重影响，数据实际上必须写入磁盘。
如果文件可以任意大，Chris 的方法可能不实用，因为整个内容都读入内存。
如果除了计算哈希值之外，数据没有其他用途，那么我建议在读取数据之前禁用文件系统缓存。

如果使用 NSFileHandles，一个简单的类别方法将为每个文件执行此操作：

@interface NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache;
@end

#include <fcntl.h>

@implementation NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache {
     return (fcntl([self fileDescriptor], F_NOCACHE, 1) != -1);
}
@end

如果 sidecar 文件很小，您可能需要将它们收集在内存中并批量写出，以尽量减少处理的中断。
文件系统（至少是 HFS）按顺序存储目录中文件的文件记录，因此要广度优先遍历文件系统（即，在进入子目录之前处理目录中的每个文件）。

当然，以上只是建议。您将需要进行实验和测量性能以确认实际影响。

You have received excellent answers already, but I wanted to add a couple points. I have worked on projects that enumerate all the files in a file system and calculate MD5 and SHA1 hashes of each file (in addition to other processing). If you are doing something similar, where you are searching a large number of files and the files may have arbitrary content, then some points to consider:

As noted, you will be I/O bound. If you read more than 1 file simultaneously, you will have a negative impact on the performance of each calculation. Obviously, the goal of scheduling calculations in parallel is to keep the disk busy between files, but you may want to consider structuring your work differently. For example, set up one thread that enumerates and opens the files and a second thread the gets open file handles from the first thread one at a time and processes them. The file system will cache catalog information, so the enumeration won't have a severe impact on reading the data, which will actually have to hit the disk.
If the files can be arbitrarily large, Chris' approach may not be practical since the entire content is read into memory.
If you have no other use for the data than calculating the hash, then I suggest disabling file system caching before reading the data.

If using NSFileHandles, a simple category method will do this per-file:

@interface NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache;
@end

#include <fcntl.h>

@implementation NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache {
     return (fcntl([self fileDescriptor], F_NOCACHE, 1) != -1);
}
@end

If the sidecar files are small, you may want to collect them in memory and write them out in batches to minimize disruption of the processing.
The file system (HFS, at least) stores file records for files in a directory sequentially, so traverse the file system breadth-first (i.e., process each file in a directory before entering subdirectories).

The above is just suggestions, of course. You will want to experiment and measure performance to confirm the actual impact.

回复收藏 0 原文

故事未完 2024-10-16 01:11:23

我会使用 NSOperation 来实现此目的，因为可以轻松处理依赖项和取消。

我将创建一个操作，分别用于读取数据文件、计算数据文件的哈希值和写入 sidecar 文件。我将使每个写入操作依赖于其关联的计算操作，并且每个计算操作依赖于其关联的读取操作。

然后，我将读取和写入操作添加到一个宽度受限的 NSOperationQueue（“I/O 队列”）中。我将计算操作添加到一个单独的 NSOperationQueue（“计算队列”），其宽度不受限制。

I/O 队列宽度受限的原因是您的工作可能会受 I/O 限制；您可能希望它的宽度大于 1，但它很可能与输入文件所在的物理磁盘的数量直接相关。（可能类似于 2x，您需要通过实验来确定这一点。）

代码最终看起来像这样

@implementation FileProcessor

static NSOperationQueue *FileProcessorIOQueue = nil;
static NSOperationQueue *FileProcessorComputeQueue = nil;

+ (void)inititalize
{
    if (self == [FileProcessor class]) {
        FileProcessorIOQueue = [[NSOperationQueue alloc] init];
        [FileProcessorIOQueue setName:@"FileProcessorIOQueue"];
        [FileProcessorIOQueue setMaxConcurrentOperationCount:2]; // limit width

        FileProcessorComputeQueue = [[NSOperationQueue alloc] init];
        [FileProcessorComputeQueue setName:@"FileProcessorComputeQueue"];
    }
}

- (void)processFilesAtURLs:(NSArray *)URLs
{
    for (NSURL *URL in URLs) {
        __block NSData *fileData = nil; // set by readOperation
        __block NSData *fileHashData = nil; // set by computeOperation

        // Create operations to do the work for this URL

        NSBlockOperation *readOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileData = CreateDataFromFileAtURL(URL);
            }];

        NSBlockOperation *computeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileHashData = CreateHashFromData(fileData);
                [fileData release]; // created in readOperation
            }];

        NSBlockOperation *writeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                WriteHashSidecarForFileAtURL(fileHashData, URL);
                [fileHashData release]; // created in computeOperation
            }];

        // Set up dependencies between operations

        [computeOperation addDependency:readOperation];
        [writeOperation addDependency:computeOperation];

        // Add operations to appropriate queues

        [FileProcessorIOQueue addOperation:readOperation];
        [FileProcessorComputeQueue addOperation:computeOperation];
        [FileProcessorIOQueue addOperation:writeOperation];
    }
}

@end

： NSOperation 不像使用 dispatch_* API 那样处理同步/异步的多重嵌套层，而是允许您独立定义工作单元以及它们之间的依赖关系。对于某些情况，这可以更容易理解和调试。

I'd use NSOperation for this because of the ease of handling both dependencies and cancellation.

I'd create one operation each for reading the data file, computing the data file's hash, and writing the sidecar file. I'd make each write operation dependent on its associated compute operation, and each compute operation dependent on its associated read operation.

Then I'd add the read and write operations to one NSOperationQueue, the "I/O queue," with a restricted width. The compute operations I'd add to a separate NSOperationQueue, the "compute queue," with a non-restricted width.

The reason for the restriced width on the I/O queue is that your work will likely be I/O bound; you may want it to have a width greater than 1, but it's very likely to be directly related to the number of physical disks on which your input files reside. (Probably something like 2x, you'll want to determine this experimentally.)

The code would wind up looking something like this:

@implementation FileProcessor

static NSOperationQueue *FileProcessorIOQueue = nil;
static NSOperationQueue *FileProcessorComputeQueue = nil;

+ (void)inititalize
{
    if (self == [FileProcessor class]) {
        FileProcessorIOQueue = [[NSOperationQueue alloc] init];
        [FileProcessorIOQueue setName:@"FileProcessorIOQueue"];
        [FileProcessorIOQueue setMaxConcurrentOperationCount:2]; // limit width

        FileProcessorComputeQueue = [[NSOperationQueue alloc] init];
        [FileProcessorComputeQueue setName:@"FileProcessorComputeQueue"];
    }
}

- (void)processFilesAtURLs:(NSArray *)URLs
{
    for (NSURL *URL in URLs) {
        __block NSData *fileData = nil; // set by readOperation
        __block NSData *fileHashData = nil; // set by computeOperation

        // Create operations to do the work for this URL

        NSBlockOperation *readOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileData = CreateDataFromFileAtURL(URL);
            }];

        NSBlockOperation *computeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileHashData = CreateHashFromData(fileData);
                [fileData release]; // created in readOperation
            }];

        NSBlockOperation *writeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                WriteHashSidecarForFileAtURL(fileHashData, URL);
                [fileHashData release]; // created in computeOperation
            }];

        // Set up dependencies between operations

        [computeOperation addDependency:readOperation];
        [writeOperation addDependency:computeOperation];

        // Add operations to appropriate queues

        [FileProcessorIOQueue addOperation:readOperation];
        [FileProcessorComputeQueue addOperation:computeOperation];
        [FileProcessorIOQueue addOperation:writeOperation];
    }
}

@end

It's pretty straightforward; rather than deal with multiply-nested layers of sync/async as you would with the dispatch_* APIs, NSOperation allows you to define your units of work and your dependencies between them independently. For some situations this can be easier to understand and debug.

回复收藏 0 原文