Semper Ad Meliora: Towards Better Video Codecs with Machine Learning (How Machine Learning ML is used in Video Encoding Part 3)

On Oct 15, 2022

Increasing demand for video content and stable upgrades in display technology, such as higher resolution, increases the importance of designing efficient video codecs. That is why improving standardized video codecs has always been an active topic. However, although each iteration of video codecs improves the encoding efficiency quite significantly, it also comes with a cost of increased encoding time complexity. The below figure summarizes the findings of a blog post about VVC and a study conducted on video codec complexities.

Encoding time and bitrate changes between standard codecs. It can be seen the bitrate of the encoded video reduces with the cost of an increased encoding time.

There is also no denying that applications of machine learning-based solutions can be seen in almost any area nowadays. Video codecs also got their fair share of this trend, as machine learning-based improvements on video codecs have been getting attention recently. These improvements can tackle the components of codecs to address the high encoding time complexity, or they can be used to provide a better encoding performance compared to using the default component.

The extremely complex decision process of encoding enables the use of machine learning-based improvements in video codecs. As discussed in the first part of this blog series, the first step in video encoding is partitioning the frame into smaller blocks. It might sound like a simple task; however, the number of possible partitions for each block makes this task extremely time-consuming.

It is important to allocate smaller-sized blocks to the areas where the motion is high to better capture the motion. On the other hand, we can allocate larger block sizes to areas with little motion as we do not need to capture precise information there. This allocation is usually done by brute-force approach in video codecs, meaning that every possible block size is tried, and the most suitable combination is chosen based on the rate-distortion cost.

For example, the latest state-of-the-art video codec HEVC, uses a default block size of 64×64 called Coding Tree Unit (CTU). Therefore, each frame is split into 64×64 sized CTUs as the first step. So, if we have a 1920×1080-sized frame, we will have around 510 CTUs. However, each block can be recursively split twice more, bringing the smallest square size to 8×8. Therefore, there are 83522 different possible partitioning schemes for each CTU. If you consider every single one of them has to be tried for each CTU in the frame, and this has to be done 510 times for each frame, you can understand how much time it takes for the entire video.

What if there was another solution? Instead of trying every combination one by one, what if we could eliminate some options or even predict the optimal partitioning at once? These are the questions researchers have asked themselves in this domain, and their solution was to utilize from machine-learning.

This was one example of how machine learning can be used to improve video codecs. There are other components in the video encoding pipeline where machine learning can be used for guidance.

*Video encoding pipeline and possible ML use cases*

Machine learning-based improvements can be applied whenever there is a decision-making process in the encoding pipeline. Whether it’s a block partitioning prediction to reduce encoding time complexity or optical flow detection to improve the motion compensation performance, there will always be a use case for an ML-based enhancement for video codecs.

That was a brief sneak peek of what can be done to improve video codecs using machine learning. In the next post, we will discuss how we can utilize machine learning in the remainder of the video encoding pipeline.

Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.

Credit: Source link