In this work, we present a deep learning model for video super-resolution that allows real-time video quality enhancement. The proposed architecture includes three main components: 2D convolutional module and transformer module for spatial feature extraction, and customized architecture of the BasicVSR model for extracting temporal dependencies between video frames.
The key contribution of this work is the introduction of the transformer module into the architecture of video super-resolution models. We used unfolding technique to convert the input image into a 1D sequence, which serves as input to the transformer. This enables us to capture long-term dependencies within the image, which can be crucial for the reconstruction itself.
The results have shown that our model achieves satisfactory results compared to currently established models for video super-resolution, with improved execution time. Despite the higher memory requirement, our model successfully enhances the visual quality of videos in real-time. We also emphasized that high PSNR and SSIM values are not always the best indicators of image quality, as visual evaluation is also important for assessing the results.
|