DirectShow 音频/视频 PTS 时钟计算
您好,
我编写了一个 directshow 源过滤器,它从 ATSC-153 广播中获取 AVC/AAC 视频帧/AAC 访问单元,在 WinCE/ARM 视频处理器上编写。输出引脚(其中 2 个,1 个用于视频,1 个用于音频)连接到适当的解码器和渲染器。目前,我正在从适当的 RTP 标头中获取 PTS,并将它们传递到源过滤器并对 directshow 时钟执行计算。视频 PTS 为 90Khz 速率,音频 PTS 速率各不相同,我当前的测试流的音频滴答声为 55.2Khz。
接下来是convert_to_dshow_timestamp() 和FillBuffer() 例程。当我打印出过滤器检索视频/音频时转换后的时间戳时,时间差异在 100-200 毫秒之内。这还不错,值得合作。然而,视频落后音频 2-3 秒。
/* 将时钟速率转换为 directshow 时钟速率的例程 */ 静态无符号长长convert_to_dshow_timestamp( 无符号长长 ts, 无符号长利率 ) { 长双赫兹; 长双多; 长双 tmp;
if (rate == 0)
{
return 0;
}
hz = (long double) 1.0 / rate;
multi = hz / 1e-7;
tmp = ((long double) ts * multi) + 0.5;
return (unsigned long long) tmp;
}
/* 源过滤器 FillBuffer() 例程 */ HRESULT OutputPin::FillBuffer(IMediaSample *pSamp) { 字节*pData; DWORD 数据大小; 管道流流; BOOL retVal; DWORD 返回字节; HRESULT 小时; DWORD 不连续; REFERENCE_TIME ts; REFERENCE_TIME df; 无符号长长差异; 无符号长长difTimeRef;
pSamp->GetPointer(&pData);
dataSize = pSamp->GetSize();
ZeroMemory(pData, dataSize);
stream.lBuf = pData;
stream.dataSize = dataSize;
/* Pin type 1 is H.264 AVC video frames */
if (m_iPinType == 1)
{
retVal = DeviceIoControl(
ghMHTune,
IOCTL_MHTUNE_RVIDEO_STREAM,
NULL,
0,
&stream,
sizeof(pipeStream),
&returnBytes,
NULL
);
if (retVal == TRUE)
{
/* Get the data */
/* Check for the first of the stream, if so, set the start time */
pSamp->SetActualDataLength(returnBytes);
hr = S_OK;
if (returnBytes > 0)
{
/* The discontinuety is set in upper layers, when an RTP
* sequence number has been lost.
*/
discont = stream.discont;
/* Check for another break in stream time */
if (
m_PrevTimeRef &&
((m_PrevTimeRef > (stream.timeRef + 90000 * 10)) ||
((m_PrevTimeRef + 90000 * 10) < stream.timeRef))
)
{
dbg_log(TEXT("MY:DISC HERE\n"));
if (m_StartStream > 0)
{
discont = 1;
}
}
/* If the stream has not started yet, or there is a
* discontinuety then reset the stream time.
*/
if ((m_StartStream == 0) || (discont != 0))
{
sys_time = timeGetTime() - m_ClockStartTime;
m_OtherSide->sys_time = sys_time;
/* For Video, the clockRate is 90Khz */
m_RefGap = (sys_time * (stream.clockRate / 1000)) +
(stream.clockRate / 2);
/* timeRef is the PTS for the frame from the RTP header */
m_TimeGap = stream.timeRef;
m_StartStream = 1;
difTimeRef = 1;
m_PrevPTS = 0;
m_PrevSysTime = timeGetTime();
dbg_log(
TEXT("MY:StartStream %lld: %lld: %lld\n"),
sys_time,
m_RefGap,
m_TimeGap
);
}
else
{
m_StartStream++;
}
difTimeRef = stream.timeRef - m_PrevTimeRef;
m_PrevTimeRef = stream.timeRef;
/* Difference in 90 Khz clocking */
ts = stream.timeRef - m_TimeGap + m_RefGap;
ts = convert_to_dshow_timestamp(ts, stream.clockRate);
if (discont != 0)
{
dbg_log(TEXT("MY:VDISC TRUE\n"));
pSamp->SetDiscontinuity(TRUE);
}
else
{
pSamp->SetDiscontinuity(FALSE);
pSamp->SetSyncPoint(TRUE);
}
difPts = ts - m_PrevPTS;
df = ts + 1;
m_PrevPTS = ts;
dbg_log(
TEXT("MY:T %lld: %lld = %lld: %d: %lld\n"),
ts,
m_OtherSide->m_PrevPTS,
stream.timeRef,
(timeGetTime() - m_PrevSysTime),
difPts
);
pSamp->SetTime(&ts, &df);
m_PrevSysTime = timeGetTime();
}
else
{
Sleep(10);
}
}
else
{
dbg_log(TEXT("MY: Fill FAIL\n"));
hr = E_FAIL;
}
}
else if (m_iPinType == 2)
{
/* Pin Type 2 is audio AAC Access units, with ADTS headers */
retVal = DeviceIoControl(
ghMHTune,
IOCTL_MHTUNE_RAUDIO_STREAM,
NULL,
0,
&stream,
sizeof(pipeStream),
&returnBytes,
NULL
);
if (retVal == TRUE)
{
/* Get the data */
/* Check for the first of the stream, if so, set the start time */
hr = S_OK;
if (returnBytes > 0)
{
discont = stream.discont;
if ((m_StartStream == 0) || (discont != 0))
{
sys_time = timeGetTime() - m_ClockStartTime;
m_RefGap = (sys_time * (stream.clockRate / 1000)) +
(stream.clockRate / 2);
/* Mark the first PTS from stream. This PTS is from the
* RTP header, and is usually clocked differently than the
* video clock.
*/
m_TimeGap = stream.timeRef;
m_StartStream = 1;
difTimeRef = 1;
m_PrevPTS = 0;
m_PrevSysTime = timeGetTime();
dbg_log(
TEXT("MY:AStartStream %lld: %lld: %lld\n"),
sys_time,
m_RefGap,
m_TimeGap
);
}
/* Let the video side stream in first before letting audio
* start to flow.
*/
if (m_OtherSide->m_StartStream < 32)
{
pSamp->SetActualDataLength(0);
Sleep(10);
return hr;
}
else
{
pSamp->SetActualDataLength(returnBytes);
}
difTimeRef = stream.timeRef - m_PrevTimeRef;
m_PrevTimeRef = stream.timeRef;
if (discont != 0)
{
dbg_log(TEXT("MY:ADISC TRUE\n"));
pSamp->SetDiscontinuity(TRUE);
}
else
{
pSamp->SetDiscontinuity(FALSE);
pSamp->SetSyncPoint(TRUE);
}
/* Difference in Audio PTS clock, TESTING AT 55.2 Khz */
ts = stream.timeRef - m_TimeGap + m_RefGap;
ts = convert_to_dshow_timestamp(ts, stream.clockRate);
difPts = ts - m_PrevPTS;
df = ts + 1;
m_PrevPTS = ts;
dbg_log(
TEXT("MY:AT %lld = %lld: %d: %lld\n"),
ts,
stream.timeRef,
(timeGetTime() - m_PrevSysTime),
difPts
);
pSamp->SetTime(&ts, &df);
m_PrevSysTime = timeGetTime();
}
else
{
pSamp->SetActualDataLength(0);
Sleep(10);
}
}
}
return hr;
} /* 代码结束 */
我尝试通过简单地添加 (90000 * 10) 来调整视频 PTS,以查看视频是否会远远领先于音频,但事实并非如此。视频仍然落后于音频 2 秒或更长时间。我真的不明白为什么这行不通。每个视频帧应提前 10 秒呈现。这难道不正确吗?
他们的主要问题基本上是算法是否合理?他们似乎可以独立运行视频/音频。
源过滤器不是推送过滤器,我不确定这是否会产生影响。我没有遇到解码器与广播输入不同步的问题。
非常感谢。
Greetings,
I have written a directshow source filter that takes the AVC/AAC video frames/AAC access units from the ATSC-153 broadcast, written on WinCE/ARM video processor. The output pins (2 of them, one for video, one for audio) are connected to the appropriate decoders and renderers. Currently, I am taking the PTS from the appropriate RTP headers, and passing them to the source filter and perform the calculation to the directshow clock. Video PTS is at the 90Khz rate, audio PTS rate varies, my current test stream has the audio ticking at 55.2Khz.
What follows is the convert_to_dshow_timestamp() and FillBuffer() routines. As I print out the converted time stamps as the video/audio are retrieved by the filter, the times are within 100-200ms difference. This would not be to bad, something to work with. However, the video trails the audio by 2-3 seconds.
/* Routine to convert a clock rate to directshow clock rate */
static unsigned long long convert_to_dshow_timestamp(
unsigned long long ts,
unsigned long rate
)
{
long double hz;
long double multi;
long double tmp;
if (rate == 0)
{
return 0;
}
hz = (long double) 1.0 / rate;
multi = hz / 1e-7;
tmp = ((long double) ts * multi) + 0.5;
return (unsigned long long) tmp;
}
/* Source filter FillBuffer() routine */
HRESULT OutputPin::FillBuffer(IMediaSample *pSamp)
{
BYTE *pData;
DWORD dataSize;
pipeStream stream;
BOOL retVal;
DWORD returnBytes;
HRESULT hr;
DWORD discont;
REFERENCE_TIME ts;
REFERENCE_TIME df;
unsigned long long difPts;
unsigned long long difTimeRef;
pSamp->GetPointer(&pData);
dataSize = pSamp->GetSize();
ZeroMemory(pData, dataSize);
stream.lBuf = pData;
stream.dataSize = dataSize;
/* Pin type 1 is H.264 AVC video frames */
if (m_iPinType == 1)
{
retVal = DeviceIoControl(
ghMHTune,
IOCTL_MHTUNE_RVIDEO_STREAM,
NULL,
0,
&stream,
sizeof(pipeStream),
&returnBytes,
NULL
);
if (retVal == TRUE)
{
/* Get the data */
/* Check for the first of the stream, if so, set the start time */
pSamp->SetActualDataLength(returnBytes);
hr = S_OK;
if (returnBytes > 0)
{
/* The discontinuety is set in upper layers, when an RTP
* sequence number has been lost.
*/
discont = stream.discont;
/* Check for another break in stream time */
if (
m_PrevTimeRef &&
((m_PrevTimeRef > (stream.timeRef + 90000 * 10)) ||
((m_PrevTimeRef + 90000 * 10) < stream.timeRef))
)
{
dbg_log(TEXT("MY:DISC HERE\n"));
if (m_StartStream > 0)
{
discont = 1;
}
}
/* If the stream has not started yet, or there is a
* discontinuety then reset the stream time.
*/
if ((m_StartStream == 0) || (discont != 0))
{
sys_time = timeGetTime() - m_ClockStartTime;
m_OtherSide->sys_time = sys_time;
/* For Video, the clockRate is 90Khz */
m_RefGap = (sys_time * (stream.clockRate / 1000)) +
(stream.clockRate / 2);
/* timeRef is the PTS for the frame from the RTP header */
m_TimeGap = stream.timeRef;
m_StartStream = 1;
difTimeRef = 1;
m_PrevPTS = 0;
m_PrevSysTime = timeGetTime();
dbg_log(
TEXT("MY:StartStream %lld: %lld: %lld\n"),
sys_time,
m_RefGap,
m_TimeGap
);
}
else
{
m_StartStream++;
}
difTimeRef = stream.timeRef - m_PrevTimeRef;
m_PrevTimeRef = stream.timeRef;
/* Difference in 90 Khz clocking */
ts = stream.timeRef - m_TimeGap + m_RefGap;
ts = convert_to_dshow_timestamp(ts, stream.clockRate);
if (discont != 0)
{
dbg_log(TEXT("MY:VDISC TRUE\n"));
pSamp->SetDiscontinuity(TRUE);
}
else
{
pSamp->SetDiscontinuity(FALSE);
pSamp->SetSyncPoint(TRUE);
}
difPts = ts - m_PrevPTS;
df = ts + 1;
m_PrevPTS = ts;
dbg_log(
TEXT("MY:T %lld: %lld = %lld: %d: %lld\n"),
ts,
m_OtherSide->m_PrevPTS,
stream.timeRef,
(timeGetTime() - m_PrevSysTime),
difPts
);
pSamp->SetTime(&ts, &df);
m_PrevSysTime = timeGetTime();
}
else
{
Sleep(10);
}
}
else
{
dbg_log(TEXT("MY: Fill FAIL\n"));
hr = E_FAIL;
}
}
else if (m_iPinType == 2)
{
/* Pin Type 2 is audio AAC Access units, with ADTS headers */
retVal = DeviceIoControl(
ghMHTune,
IOCTL_MHTUNE_RAUDIO_STREAM,
NULL,
0,
&stream,
sizeof(pipeStream),
&returnBytes,
NULL
);
if (retVal == TRUE)
{
/* Get the data */
/* Check for the first of the stream, if so, set the start time */
hr = S_OK;
if (returnBytes > 0)
{
discont = stream.discont;
if ((m_StartStream == 0) || (discont != 0))
{
sys_time = timeGetTime() - m_ClockStartTime;
m_RefGap = (sys_time * (stream.clockRate / 1000)) +
(stream.clockRate / 2);
/* Mark the first PTS from stream. This PTS is from the
* RTP header, and is usually clocked differently than the
* video clock.
*/
m_TimeGap = stream.timeRef;
m_StartStream = 1;
difTimeRef = 1;
m_PrevPTS = 0;
m_PrevSysTime = timeGetTime();
dbg_log(
TEXT("MY:AStartStream %lld: %lld: %lld\n"),
sys_time,
m_RefGap,
m_TimeGap
);
}
/* Let the video side stream in first before letting audio
* start to flow.
*/
if (m_OtherSide->m_StartStream < 32)
{
pSamp->SetActualDataLength(0);
Sleep(10);
return hr;
}
else
{
pSamp->SetActualDataLength(returnBytes);
}
difTimeRef = stream.timeRef - m_PrevTimeRef;
m_PrevTimeRef = stream.timeRef;
if (discont != 0)
{
dbg_log(TEXT("MY:ADISC TRUE\n"));
pSamp->SetDiscontinuity(TRUE);
}
else
{
pSamp->SetDiscontinuity(FALSE);
pSamp->SetSyncPoint(TRUE);
}
/* Difference in Audio PTS clock, TESTING AT 55.2 Khz */
ts = stream.timeRef - m_TimeGap + m_RefGap;
ts = convert_to_dshow_timestamp(ts, stream.clockRate);
difPts = ts - m_PrevPTS;
df = ts + 1;
m_PrevPTS = ts;
dbg_log(
TEXT("MY:AT %lld = %lld: %d: %lld\n"),
ts,
stream.timeRef,
(timeGetTime() - m_PrevSysTime),
difPts
);
pSamp->SetTime(&ts, &df);
m_PrevSysTime = timeGetTime();
}
else
{
pSamp->SetActualDataLength(0);
Sleep(10);
}
}
}
return hr;
}
/* End of code */
I have tried adjusting the video PTS, by simply adding (90000 * 10), to see if the video would go far ahead of the audio, however it does not. Video still trails the audio by 2 seconds or more. I really don't understand why this would not work. Each video frame should present 10 seconds ahead. Would this not be correct?
They main question is, basically, are the algorithms sound? They seem to work okay running the video/audio independently.
The source filter is not a push filter, I am not sure if this will make a difference. I am not having issues with the decoders getting out of sync with the input from the broadcast.
Many thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
其实我已经找到问题所在了,有两个。
第一个是对 SPS H.264 框架的糟糕处理。当解码器启动时,它会丢弃每一帧,直到找到 SPS 帧。该流以每秒 15 帧的速度进行编码。这会打乱计时,因为解码器将在不到 10 毫秒的时间内消耗多达一秒的视频。之后呈现的每一帧都被认为是迟到的,它会尝试快进帧以赶上。作为实时信号源,它会再次耗尽帧数。解决方法放在我之前的代码中,以确保有至少 32 帧的缓冲区,即大约 2 秒。
第二个问题实际上围绕着问题的根源。我使用 RTP 标头中的 PTS 作为时间参考。虽然这在单独的音频和/或视频情况下可行,但不能保证视频 RTP PTS 与相应的音频 RTP PTS 匹配,而且通常不会。因此,根据规范,根据以下公式使用 RTCP NTP 时间:
这使我能够将实际视频 PTS 与相应的音频 PTS 进行匹配。
Actually I figured out the problem, of which there were two.
The first one was bad work around to the SPS H.264 frame. When the decoder started it would ditch every frame until it found the SPS frame. The stream was encoded at 15 frames per second. This would throw off the timing, as the decoder would consume up to a second worth of video in less than 10ms. Every frame that was presented after that was considered late, and it would try and fast forward the frames to catch up. Being a live source, it would run out of frames again. The workaround was placed in the code ahead of mine, to make sure there was a buffer of at least 32 frames, which is about 2 seconds.
The second problem really centers around the root of the problem. I was using the PTS's from the RTP header as the time reference. While this would work in the individual audio and/or video case, there is no guarantee that the video RTP PTS would match the corresponding audio RTP PTS, and typically would not. Hence the use of the RTCP NTP time according to the following formula, as per the spec:
This allows me to match the actual video PTS to the corresponding audio PTS.