Solus Ipse?: The Status of End-to-End Video Encoding with Machine Learning (How Machine Learning ML is used in Video Encoding Part 5)

On Oct 15, 2022

We saw machine learning is used to improve individual components of video codecs. You could get quite an improvement when you replace standard codec components with machine-learning-boosted ones. But what would happen if we replaced the entire video codec with machine learning?

No intervention by humans, no manually designed features in between. Machine learning literature is full of successful examples of what could happen if we took out the human factor from the equation and just let the machine solve the problem end-to-end.

The story of AlphaZero is the best example of this. For decades, computer chess engines were designed with human biases like “If you can take the enemy queen, take it.” They were given the rules of the game, what pieces are more important, what would be the best move in the current board status, etc. The rest was up to the engine to learn by playing against other human players.

AlphaZero was different by design. It didn’t know what chess was. Nobody told it the rules of the game or what is the best move in the given situation. There was no human intervention. It just kept playing against itself, again and again, and again (44 million games in 9 hours). In the end, the machine that didn’t even know what chess was 9 hours ago could defeat not only the best human chess players in the world but also the most advanced chess engines at that time. It was a game changer.

A similar story applied to games like Go with AlphaGo Zero and even advanced video games like Dota2 with OpenAI Five. The pattern is simple here; if you let the machine learn the task from end-to-end, it performs better in most cases. So naturally, there comes the question, how about video encoding?

First of all, it is crucial to mention that end-to-end video codecs are still performing trivial compared to the standard video codecs. However, there are multiple successful and promising early studies in this domain.

DVC: An End-to-end Deep Video Compression Framework is one of the successful examples of an end-to-end machine learning-based video codec. DVC follows the structure of a standard video codec, but every single component is a machine learning-based method which are jointly optimized during the training phase. Instead of the standard motion compensation approach, a learning-based optical flow estimation is used. Moreover, two auto-encoders are used to compress the motion and residual information. The loss function is designed to make components collaborate with each other while considering the trade-off between size compression and visual quality. The authors showed that DVC could outperform AVC, which was standardized in 2003.

Another end-to-end machine learning-based codec is proposed in the Learned Video Compression paper. This study was designed for the low-latency mode, so the encoding time complexity was a key design factor. The key contribution here is an alternative design about motion compensation. In traditional codecs, prior information about the current frame is propagated from previous or future frames, which could be problematic in maintaining long-term memory. As an alternative, arbitrary state information, which the model learns, is propagated to minimize the information loss.

As a result, the proposed end-to-end encoding method can achieve competitive performance even with the latest state-of-the-art codecs.

This was a brief exploration of end-to-end machine learning-based video codecs. In our final post, we will dive into the machine learning-based standardizations in the video encoding domain.

Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.

Credit: Source link