视频编解码器的解码输出是什么?
各位,
我想知道是否有人可以向我解释视频解码的输出到底是什么。假设它是 MP4 容器中的 H.264 流。
通过在屏幕上显示某些内容,我猜解码器可以提供两种不同类型的输出:
- 点 - 位置的 (x, y) 坐标和像素矩形的 (R, G, B) 颜色
- (x, y, w, h) 矩形的单位和要显示的(R、G、B)颜色
还有时间戳的问题。
您能否启发我或向我指出解码器生成的内容以及视频客户端如何使用此信息在屏幕上显示内容的正确链接?
我打算下载 VideoLAN 源代码并检查它,但一些解释会有所帮助。
预先感谢您的帮助。
问候, 彼得
Folks,
I am wondering if someone can explain to me what exactly is the output of video decoding. Let's say it is a H.264 stream in an MP4 container.
From displaying something on the screen, I guess decoder can provider two different types of output:
- Point - (x, y) coordinate of the location and the (R, G, B) color for the pixel
- Rectangle (x, y, w, h) units for the rectangle and the (R, G, B) color to display
There is also the issue of time stamp.
Can you please enlighten me or point me the right link on what is generated out of a decoder and how a video client can use this information to display something on screen?
I intend to download VideoLAN source and examine it but some explanation would be helpful.
Thank you in advance for your help.
Regards,
Peter
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以上都不是。
通常输出是仅包含颜色数据的字节流。 X、Y 位置由视频的尺寸暗示。
换句话说,前三个字节可能对 (0, 0) 处的颜色值进行编码,第二个三个字节对 (0, 1) 处的值进行编码,依此类推。某些格式可能使用 4 个字节组,甚至使用加起来不等于 1 个字节的多个位 - 例如,如果每个颜色分量使用 5 位,并且有 3 个颜色分量,则每个像素 15 位。为了提高效率,可以将其填充为 16 位(恰好是两个字节),因为这将以 CPU 可以更好地处理数据的方式对齐数据。
当您处理的值与视频宽度完全相同时,您就到达了该行的末尾。当您处理的行数与视频的高度完全相同时,您就已到达该帧的结尾。
至于这些字节的解释,取决于编解码器使用的颜色空间。常见的色彩空间有 YUV、RGB 和 HSL/HSV。
它很大程度上取决于所使用的编解码器及其支持的输入格式;输出格式通常仅限于可接受输入的格式集。
时间戳数据有点复杂,因为它可以在视频流本身或容器中进行编码。至少,流需要一个帧速率;由此,可以通过计算已经解码的帧数来确定每一帧的时间。其他方法(如 AVI 所采用的方法)是在文件末尾的每第 N 帧(或仅关键帧)包含一个字节偏移量,以实现快速查找。 (否则,您需要将每个帧解码到您要查找的时间戳,以便确定该帧在文件中的位置。)
如果您也考虑音频数据,请注意,对于大多数编解码器和容器,音频和视频流是独立的,彼此一无所知。在编码过程中,将两个流写入容器格式的软件会执行一个称为“混合”的过程。它将以每个 N 秒的块的形式写出数据,在流之间交替。这使得读取流的任何人都可以获得 N 秒的视频,然后是 N 秒的音频,然后是另外 N 秒的视频,依此类推。 (也可能包含多个音频流 - 这种技术经常用于将视频、英语和西班牙语音轨混合到包含三个流的单个文件中。)事实上,甚至字幕也可以与其他字幕混合在一起溪流。
None of the above.
Usually the output will be a stream of bytes that contains just the color data. The X,Y location is implied by the dimensions of the video.
In other words, the first three bytes might encode the color value at (0, 0), the second three byte the value at (0, 1), and so on. Some formats might use four bytes groups, or even a number of bits that doesn't add up to one byte -- for example, if you use 5 bits for each color component and you have three color components, that's 15 bits per pixel. This might be padded to 16 bits (exactly two bytes) for efficiency, since that will align data in a way that CPUs can better process it.
When you've processed exactly as many values as the video is wide, you've reached the end of that row. When you've processed exactly as many rows as the video is high, you've reached the end of that frame.
As for the interpretation of those bytes, that depends on the color space used by the codec. Common color spaces are YUV, RGB, and HSL/HSV.
It depends strongly on the codec in use and what input format(s) it supports; the output format is usually restricted to the set of formats that are acceptable inputs.
Timestamp data is a bit more complex, since that can be encoded in the video stream itself, or in the container. At a minimum, the stream would need a framerate; from that, the time of each frame can be determined by counting how many frames have been decoded already. Other approaches, like the one taken by AVI, is to include a byte-offset for every Nth frame (or just the keyframes) at the end of the file to enable rapid seeking. (Otherwise, you would need to decode every frame up to the timestamp you're looking for in order to determine where in the file that frame is.)
And if you're considering audio data too, note that with most codecs and containers, the audio and video streams are independent and know nothing about each other. During encoding, the software that writes both streams into the container format does a process called muxing. It will write out the data in chunks of N seconds each, alternating between streams. This allows whoever is reading the stream to get N seconds of video, then N seconds of audio, then another N seconds of video, and so on. (More than one audio stream might be included too -- this technique is frequently used to mux together video, and English and Spanish audio tracks into a single file that contains three streams.) In fact, even subtitles can be muxed in with the other streams.
cdhowie 得到了大部分。
当谈到时间戳时,MPEG4 容器包含每个帧的表,告诉视频客户端何时显示每个帧。您应该查看 MPEG4 的规范。我认为你通常需要为此付费,但它绝对可以从地方下载。
http://en.wikipedia.org/wiki/MPEG-4_Part_14
cdhowie got most of it.
When it comes to timestamps the MPEG4 container contains tables for each frame that tells the video client when to display each frame. You should look at the spec for MPEG4. You normally have to pay for this I think but it's definitely downloadable from places.
http://en.wikipedia.org/wiki/MPEG-4_Part_14