IPC瓶颈?

发布于 2024-09-15 14:51:34 字数 1584 浏览 5 评论 0原文

我有两个进程,一个生产者和一个消费者。 IPC 在 Win32 上通过 OpenFileMapping/MapViewOfFile 完成。

生产者从另一个源接收视频,然后将其传递给消费者,并通过两个事件完成同步。

对于制作者:

Receive frame
Copy to shared memory using CopyMemory
Trigger DataProduced event
Wait for DataConsumed event

对于消费者

Indefinitely wait for DataProducedEvent
Copy frame to own memory and send for processing
Signal DataConsumed event

如果没有这些,视频的平均帧率为 5 fps。 如果我在两侧添加事件,但没有 CopyMemory,它仍然在 5fps 左右,尽管慢了一点。 当我添加 CopyMemory 操作时,它下降到 2.5-2.8fps。 Memcpy 甚至更慢。

我很难相信简单的内存复制会导致这种速度减慢。 关于补救措施有什么想法吗?

这是我创建共享内存的代码:

HANDLE fileMap = CreateFileMapping(INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, fileMapSize, L"foomap");
void* mapView = MapViewOfFile(fileMap, FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, fileMapSize);

大小为 1024 * 1024 * 3

编辑 - 添加了实际代码:

在生产者上:

void OnFrameReceived(...)
{
    // get buffer
    BYTE *buffer = 0;
...

    // copy data to shared memory
    CopyMemory(((BYTE*)mapView) + 1, buffer, length);

    // signal data event
SetEvent(dataProducedEvent);

    // wait for it to be signaled back!
    WaitForSingleObject(dataConsumedEvent, INFINITE);
}

在消费者上:

while(WAIT_OBJECT_0 == WaitForSingleObject(dataProducedEvent, INFINITE))
    {   
        SetEvent(dataConsumedEvent);
    }

好吧,看来从 DirectShow 缓冲区复制到共享内存毕竟是瓶颈。我尝试使用命名管道来传输数据,你猜怎么着——性能恢复了。

有谁知道这可能是什么原因吗?

添加一个我之前认为不相关的细节:生产者被注入并挂接到 DirectShow 图上以检索帧。

I have two processes, a producer and a consumer. IPC is done with OpenFileMapping/MapViewOfFile on Win32.

The producer receives video from another source, which it then passes over to the consumer and synchronization is done through two events.

For the producer:

Receive frame
Copy to shared memory using CopyMemory
Trigger DataProduced event
Wait for DataConsumed event

For the consumer

Indefinitely wait for DataProducedEvent
Copy frame to own memory and send for processing
Signal DataConsumed event

Without any of this, the video averages at 5fps.
If I add the events on both sides, but without the CopyMemory, it's still around 5fps though a tiny bit slower.
When I add the CopyMemory operation, it goes down to 2.5-2.8fps. Memcpy is even slower.

I find hard to believe that a simple memory copy can cause this kind of slowdown.
Any ideas on a remedy?

Here's my code to create the shared mem:

HANDLE fileMap = CreateFileMapping(INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, fileMapSize, L"foomap");
void* mapView = MapViewOfFile(fileMap, FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, fileMapSize);

The size is 1024 * 1024 * 3

Edit - added the actual code:

On the producer:

void OnFrameReceived(...)
{
    // get buffer
    BYTE *buffer = 0;
...

    // copy data to shared memory
    CopyMemory(((BYTE*)mapView) + 1, buffer, length);

    // signal data event
SetEvent(dataProducedEvent);

    // wait for it to be signaled back!
    WaitForSingleObject(dataConsumedEvent, INFINITE);
}

On the consumer:

while(WAIT_OBJECT_0 == WaitForSingleObject(dataProducedEvent, INFINITE))
    {   
        SetEvent(dataConsumedEvent);
    }

Well, it seems that copying from the DirectShow buffer onto shared memory was the bottleneck after all. I tried using a Named Pipe to transfer the data over and guess what - the performance is restored.

Does anyone know of any reasons why this may be?

To add a detail that I didn't think was relevant before: the producer is injected and hooks onto a DirectShow graph to retrieve the frames.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

马蹄踏│碎落叶 2024-09-22 14:51:34

内存复制涉及某些幕后操作,对于视频来说这可能很重要。

我会尝试另一条路线:为每个帧或多个帧创建一个共享块。依次命名它们,即 block1、block2、block3 等,以便接收者知道接下来要读取哪个块。现在直接将帧接收到分配的块X,通知消费者新块的可用性,并立即分配并开始使用另一个块。消费者映射该块并且不复制它 - 该块现在属于消费者并且消费者可以在进一步处理中使用原始缓冲区。一旦消费者关闭了该块的映射,该映射就会被销毁。所以你会得到一个块流并避免阻塞。

如果帧处理不需要太多时间而创建共享块却需要太多时间,那么您可以创建一个共享块池,该池足够大以确保生产者和消费者永远不会尝试使用相同的块(您可以通过使用信号量或信号量来使方案复杂化) mutx 来保护每个块)。

希望我的想法很清楚 - 通过使用生产者中的块而不是消费者中的块来避免复制

Copying of memory involves certain operations under the hood, and for video this can be significant.

I'd try another route: create a shared block for each frame or several of frames. Name them consequently, i.e. block1, block2, block3 etc, so that the recipient knows what block to read next. Now receive the frame directly to the allocated blockX, notify the consumer about availability of the new block and allocate and start using another block immediately. Consumer maps the block and doesn't copy it - the block belongs to consumer now and consumer can use the original buffer in further processing. Once the consumer closes mapping of the block, this mapping is destroyed. So you get a stream of blocks and avoid blocking.

If frame processing doesn't take much time and creation of shared block does, you can create a pool of shared blocks, large enough to ensure that producer and consumer never attempt to use the same block (you can complicate scheme by using a semaphore or mutx to guard each block).

Hope my idea is clear - avoid copying by using the block in producer, than in consumer

我不是你的备胎 2024-09-22 14:51:34

复制 3MB 内存所花费的时间实际上根本不应该被注意到。对我的旧(和损坏的)笔记本电脑进行快速测试,能够在大约 10 秒内完成 10,000 个 memcpy(buf1, buf2, 1024 * 1024 * 3) 操作。在 1/1000 秒时,它不会显着降低帧速率。

无论如何,似乎可能会发生一些优化来加快速度。目前,您似乎对数据进行了双重或三重处理。双重处理,因为您“接收帧”然后“复制到共享内存”。如果“将帧复制到自己的内存并发送进行处理”意味着您真正复制到本地缓冲区然后进行处理,而不仅仅是从缓冲区进行处理,则三重处理。

另一种方法是直接将帧接收到共享缓冲区中,然后直接从缓冲区中处理它。正如我怀疑的那样,如果您希望能够在处理另一帧的同时接收一帧,则只需增加内存映射的大小以容纳多个帧并将其用作循环数组即可。在消费者方面,它看起来像这样。

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...)     // Generate data consumed event

而生产者

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...)     // Generate data produced event

只需确保数据消费信号量被初始化为FRAME_IN_ARRAY_COUNT并且数据产生信号量被初始化为0。

The time it takes to copy 3MB of memory really shouldn't be at all noticeable. A quick test on my old (and busted) laptop was able to complete 10,000 memcpy(buf1, buf2, 1024 * 1024 * 3) operations in around 10 seconds. At 1/1000th of a second it shouldn't be slowing down your frame rate by a noticeable amount.

Regardless, it would seem that there is probably some optimisation that could occur to speed things up. Currently you seem to be either double or tripple handling the data. Double handling because you "recieve the frame" then "copy to shared memory". Triple handling if "Copy frame to own memory and send for processing" means that you truly copy to a local buffer and then process instead of just processing from the buffer.

The alternative is to receive the frame into the shared buffer directly and process it directly out of the buffer. If, as I suspect, you want to be able to receive one frame while processing another you just increase the size of the memory mapping to accomodate more than one frame and use it as a circular array. On the consumer side it would look something like this.

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...)     // Generate data consumed event

And the producer

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...)     // Generate data produced event

Just make sure that the semaphore the data consumed semaphore is initialised to FRAME_IN_ARRAY_COUNT and the data produced semaphore is initialised to 0.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文