DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Chang-Han Yeh1 Chin-Yang Lin1 Zhixiang Wang2
Chi-Wei Hsiao3Ting-Hsuan Chen1Yu-Lun Liu1
1
National Yang Ming Chiao Tung University  2University of Tokyo  3MediaTek Inc
Abstract

This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8×\times× super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output. See our project page for video results: jimmycv07.github.io/DiffIR2VR_web.

1 Introduction

Video restoration is a valuable topic that transforms low-quality video into high-quality video. It usually involves video denoising, super-resolution, and deblurring. The state-of-art methods that employ convolutional neural networks (CNNs) [2, 32, 57] or transformers [19, 49, 78] trained on large-scale data achieve incredible effectiveness. However, the regression-based methods often result in blurry outputs without realistic details (Fig. 2(a)). Furthermore, the degradations they address are typically well-defined (e.g., bicubic downsampling, given noise standard deviation), and models are often tailored to specific degradations only. This limitation restricts their generalization capabilities, as different settings often require additional paired data and retraining the model.

Diffusion models recently are adapted to image restorations [85, 45] Because of their powerful generative ability, they can hallucinate realistic details. But this ability inherently comes with high randomness. As a result, directly performing per-frame inference to process videos leads to severe flickering (Fig. 2(b)). This phenomenon is even more pronounced in Latent Diffusion Models (LDM) because the decoder will magnify the randomness.

A potential solution to reduce the temporal flickering is to fine-tune or train a single-image diffusion model by inserting 3D convolution and temporal attention layers into it. However, the modified model needs to be trained on videos to impose temporal consistency, which often requires unfordable computational resources (e.g., 32 A100-80G GPUs for video upscaling  [102]). Moreover, different tasks necessitate retraining a model.

Given limited computational resources, we present a novel zero-shot video restoration framework that transforms low-quality input videos into temporally consistent high-quality outputs. We design two training-free modules— hierarchical latent warping, hybrid flow-guided spatial-aware token merging—to enforce temporal consistency in both latent and token (feature from the attention layer) spaces. Our method can be applied to any pre-trained image diffusion model without additional training or fine-tuning. Extensive experiments demonstrate that our video restoration method outperforms state-of-the-art approaches in both video quality and temporal consistency, even under extreme degradation conditions. To summarize, our main contributions are as follows:

  • We propose a novel, zero-shot video restoration method that achieves realistic results and maintains temporal consistency, compatible with any image-based diffusion models.

  • Our training-free framework manipulates both latent and token spaces to enforce semantic consistency across frames, introducing hierarchical latent warping to maintain consistency within and between batches and improving token merging with flow correspondence and spatial information.

  • Our method demonstrates state-of-the-art restoration results, especially in scenarios of extreme degradation and large motion, which handle various levels of degradation with a single model, offering greater generalizability and robustness compared to traditional regression-based methods.

Refer to caption
Figure 1: Zero-shot temporal-consistent diffusion model for video restoration. Given a pre-trained diffusion model for single-image restoration, our method generates temporally consistent restored video with fine details without any further training.
Refer to caption
Figure 2: 4×\times× video super-resolution results. (a) Traditional regression-based methods such as FMA-Net [96] are limited to the training data domain and tend to produce blurry results when encountering out-of-domain inputs. (b) Although applying image-based diffusion models such as DiffBIR [45] to individual frames can generate realistic details, these details often lack consistency across frames. (c) Our method leverages an image diffusion model to restore videos, achieving both realistic and consistent results without any additional training.

2 Related Work

Video Restoration.

Video restoration aims to restore high-quality frames from degraded videos, addressing issues such as noise, blur, and low resolution [7, 9, 31, 39, 96, 101, 47, 48]. This task is more challenging than image restoration because it requires maintaining temporal consistency across frames. Learning-based approaches often employ architectures like optical flow warping [30, 58, 67, 68, 88], deformable convolutions [7, 8, 17, 77, 80, 81, 103], and attention mechanisms to handle temporal dependencies [5, 40, 43, 44, 98]. One major limitation is their dependency on paired high-quality (HQ) and low-quality (LQ) data for training [10, 86, 93], which is even more difficult to obtain for videos than for images. Moreover, most existing approaches [37, 36, 38, 40, 44] assume predefined degradation processes, reducing their effectiveness in real-world applications where degradations are unknown and diverse, thus leading to poor generalization performance. Additionally, these models often need retraining for different degradation levels or types [43, 46, 55, 95, 96], highlighting their limited generalization capabilities. Last but not least, these methods tend to lose significant detail, similar to image restoration [11, 42, 82, 99]

Diffusion Models for Image Restoration.

With significant advancements in diffusion models [13, 18, 25, 26, 63], many diffusion-based approaches have been proposed for image restoration [21, 26, 56, 70, 73, 79, 92]. One straightforward method involves training a diffusion model from scratch [63, 66, 85, 97], conditioned on low-quality images [66, 1]. However, this approach requires substantial computational resources. To reduce these costs, a common strategy is to introduce constraints into the reverse diffusion process of pre-trained models, as demonstrated by DDRM [34]. While efficient, these methods [13, 15, 21, 34, 72, 83, 89] depend on predefined image degradation processes or pretrained super-resolution (SR) models, which limits their generalizability. Recent works have enhanced performance by fine-tuning frozen pre-trained diffusion models with additional trainable layers [79, 92, 100], as seen in StableSR [79] and DiffBIR [45]. Despite their effectiveness, these methods encounter challenges in video restoration, where the inherent randomness of the diffusion process can cause temporal inconsistencies across frames. Our method allows these methods to work on video without any training.

Diffusion Models for Video Task.

Building on the success of text-to-image diffusion models [3, 13, 22, 24, 25, 53, 62, 65, 100], recent research explores using diffusion models for video tasks [20, 27, 28, 29, 50, 51, 52], extending pre-trained image diffusion models to video processing. Methods [12, 69], like Upscale-A-Video [102] and MGLD-VSR [94], achieve this by integrating and fine-tuning temporal layers. Upscale-A-Video adds temporal layers to UNet and VAE-Decoder, ensuring sequence consistency. Similarly, MGLD-VSR uses motion dynamics from low-resolution videos and calculated optical flows to align latent features. These methods require paired video data and substantial computational resources. Alternatively, zero-shot methods use existing image diffusion models to generate videos without training [6, 14, 16, 23, 61, 84, 91, 93, 100], employing techniques like token merging [4], noise shuffling, and latent warping. VidToMe [41] and TokenFlow [23] enhance temporal consistency by merging and aligning attention tokens across frames; Rerender-A-Video [90] employs latent warping [76, 87] and frame interpolation; RAVE [33] uses noise shuffling to maintain frame consistency in longer videos with reduced time complexity. These techniques are capable of generating impressive video sequences with minimal effort. However, they often produce blurry results and struggle with semantic consistency in demanding video restoration tasks. Inspired by these methods, our training-free framework manipulates latent and token spaces to ensure semantic consistency across frames, introducing hierarchical latent warping and improving token merging with flow correspondence and spatial information.

Refer to caption
Figure 3: Pipeline of our proposed zero-shot video restoration method. We process low-quality (LQ) videos in batches using a diffusion model, with a keyframe randomly sampled within each batch. (a) At the beginning of the diffusion denoising process, hierarchical latent warping provides rough shape guidance both globally, through latent warping between keyframes, and locally, by propagating these latents within the batch. (b) Throughout most of the denoising process, tokens are merged before the self-attention layer. For the downsample blocks, optical flow is used to find the correspondence between tokens, and for the upsample blocks, cosine similarity is utilized. This hybrid flow-guided, spatial-aware token merging accurately identifies correspondences between tokens by leveraging both flow and spatial information, thereby enhancing overall consistency at the token level.

3 Method

Given a low-quality video with n𝑛nitalic_n frames {lq1,lq2,,lqn}𝑙subscript𝑞1𝑙subscript𝑞2𝑙subscript𝑞𝑛\left\{lq_{1},lq_{2},\ldots,lq_{n}\right\}{ italic_l italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, our goal is to restore it into a high-quality video {hq1,hq2,,hqn}subscript𝑞1subscript𝑞2subscript𝑞𝑛\left\{hq_{1},hq_{2},\ldots,hq_{n}\right\}{ italic_h italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_h italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } using off-the-shelf image-based diffusion models. However, as illustrated in Fig. 2 and Fig. 7, directly applying these models to each frame individually results in temporal inconsistency due to the inherent stochasticity of the diffusion models, particularly in cases of extreme degradation. Our method, as depicted in the Fig. 3, addresses this by enforcing temporal stability in both latent and token space during restoration through two main components: Hierarchical Latent Warping (Sec. 3.2) and Hybrid Flow-guided Spatial-aware Token Merging (Sec. 3.3). In this section, we first briefly introduce the background of diffusion models and video token merging in Sec. 3.1. We then introduce the hierarchical latent warping strategy in Sec. 3.2, hybrid flow-guided spatial-aware token merging in Sec. 3.3, and scheduling of them in Sec. 3.4.

Refer to caption
Figure 4: An illustration of our key modules. Without requiring any training, these modules can achieve coherence across frames by enforcing temporal stability in both latent and token space. Hierarchical latent warping provides global and local shape guidance; Hybrid spatial-aware token merging before the self-attention layer improves temporal consistency by matching similar tokens using optical flow in the down blocks and cosine similarity in the up blocks of the UNet.
Refer to caption
Figure 5: Token correspondences. Correspondences found by cosine similarity and by optical flow. (Top) At the beginning of the denoising process, the latents in the UNet downblocks are too noisy for cosine similarity to be effective, while optical flow estimated from LQ frames remains reliable. (Bottom) Flow and cosine similarity often identify different correspondences, so a hybrid approach is more effective.

3.1 Preliminaries

Diffusion Models.

Diffusion models are a type of generative model that models a data distribution pdatasubscript𝑝datap_{\text{data}}italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT through gradual diffusing and denoising. The forward process diffuses a clean image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by Gaussian noises in T𝑇Titalic_T steps, given by

xt=αtxt1+1αtϵt1xt=α¯tx0+1α¯tϵ,subscript𝑥𝑡subscript𝛼𝑡subscript𝑥𝑡11subscript𝛼𝑡subscriptitalic-ϵ𝑡1subscript𝑥𝑡subscript¯𝛼𝑡subscript𝑥01subscript¯𝛼𝑡italic-ϵ{x_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}}\epsilon_{t-1}}\Rightarrow x% _{t}=\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⇒ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , (1)

where t[1,T]similar-to𝑡1𝑇t\sim[1,T]italic_t ∼ [ 1 , italic_T ], ϵt,ϵ𝒩(𝟎,𝐈)similar-tosubscriptitalic-ϵ𝑡italic-ϵ𝒩0𝐈\epsilon_{t},\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ), and α¯t=s=1tαssubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The latent variable xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT will be nearly a standard Gaussian distribution when T𝑇Titalic_T is large enough. A denoiser ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, usually implemented with UNet [64], is trained to estimate the noise ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by minimizing 𝔼t[1,T],x0,ϵt[ϵtϵθ(xt,t)2]subscript𝔼similar-to𝑡1𝑇subscript𝑥0subscriptitalic-ϵ𝑡delimited-[]superscriptnormsubscriptitalic-ϵ𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡2\mathbb{E}_{t\sim\left[1,T\right],x_{0},\epsilon_{t}}\left[||\epsilon_{t}-% \epsilon_{\theta}(x_{t},t)||^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t ∼ [ 1 , italic_T ] , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | | italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. During inference, the inverse process starts from an i.i.d. noise xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and produce a clean image x0pdatasimilar-tosubscript𝑥0subscript𝑝datax_{0}\sim p_{\text{data}}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT by gradual denoising with the well-trained denoiser over T¯¯𝑇\bar{T}over¯ start_ARG italic_T end_ARG steps [26, 71, 74], where T¯[1,T]¯𝑇1𝑇\bar{T}\in[1,T]over¯ start_ARG italic_T end_ARG ∈ [ 1 , italic_T ]. These unconditional generative models can be further enhanced with additional guidance, such as text prompts and images for guided generations. One will inject the control signal c𝑐citalic_c into the denoiser ϵθ(xt,t,c)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑐\epsilon_{\theta}(x_{t},t,c)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) through training [100] or optimization [34].

Video Token Merging.

Video Token Merging (VidToMe) [41] enhances the temporal consistency of generated videos by utilizing image diffusion models to merge similar tokens within frame chunks in the attention blocks. This token merging not only improves temporal coherence but also reduces the computational overhead of the attention mechanism by decreasing the size of token chunks.

Given a token chunk TB×A×CTsuperscript𝐵𝐴𝐶\textbf{T}\in\mathbb{R}^{B\times A\times C}T ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_A × italic_C end_POSTSUPERSCRIPT, where A=wh𝐴𝑤A=w*hitalic_A = italic_w ∗ italic_h, the algorithm first separates the tokens into source tokens TsrcB×A1×CsubscriptTsrcsuperscript𝐵𝐴1𝐶\textbf{T}_{\text{src}}\in\mathbb{R}^{B\times A-1\times C}T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_A - 1 × italic_C end_POSTSUPERSCRIPT and a target token TtarB×1×CsubscriptTtarsuperscript𝐵1𝐶\textbf{T}_{\text{tar}}\in\mathbb{R}^{B\times 1\times C}T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × 1 × italic_C end_POSTSUPERSCRIPT. It then calculates the cosine between each source and target token, determining their corresponding similarity levels, denoted score((B1)A)×A𝑠𝑐𝑜𝑟𝑒superscript𝐵1𝐴𝐴score\in\mathbb{R}^{((B-1)*A)\times A}italic_s italic_c italic_o italic_r italic_e ∈ blackboard_R start_POSTSUPERSCRIPT ( ( italic_B - 1 ) ∗ italic_A ) × italic_A end_POSTSUPERSCRIPT. The algorithm then identifies the most similar target token for each source token by taking the maximum value in the last column.

s(Tsrc,Ttar)=TsrcTtarTsrcTtar,c=max{tTtar}(s(Tsrc,t)),formulae-sequence𝑠subscriptTsrcsubscriptTtarsubscriptTsrcsubscriptTtarnormsubscriptTsrcnormsubscriptTtar𝑐subscripttsubscriptTtar𝑠subscriptTsrct\displaystyle s(\textbf{T}_{\text{src}},\textbf{T}_{\text{tar}})=\frac{\textbf% {T}_{\text{src}}\cdot\textbf{T}_{\text{tar}}}{\left\|\textbf{T}_{\text{src}}% \right\|\left\|\textbf{T}_{\text{tar}}\right\|}\,,\,\,\,c=\max_{\{\textbf{t}% \in\textbf{T}_{\text{tar}}\}}(s(\textbf{T}_{\text{src}},\textbf{t})),italic_s ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT , T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT ) = divide start_ARG T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ⋅ T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT end_ARG start_ARG ∥ T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ∥ ∥ T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT ∥ end_ARG , italic_c = roman_max start_POSTSUBSCRIPT { t ∈ T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( italic_s ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT , t ) ) , (2)

where s(,)𝑠s(\cdot,\cdot)italic_s ( ⋅ , ⋅ ) is the cosine similarity score and c𝑐citalic_c indicates the correspondences. Next, the r𝑟ritalic_r most similar paired source-target tokens are merged, and the remaining tokens are concatenated as the output. Merged tokens are subsequently unmerged after self-attention to preserve the original shape by simply assigning the merged source-target tokens the exact same value. The token merging and unmerging are defined as follows:

Tmerge=(Tsrc,Ttar,c,r),Tunmerge=𝒰(Tmerge,c),formulae-sequencesubscriptTmergesubscriptTsrcsubscriptTtar𝑐𝑟subscriptTunmerge𝒰subscriptTmerge𝑐\displaystyle\textbf{T}_{\text{merge}}=\mathcal{M}(\textbf{T}_{\text{src}},% \textbf{T}_{\text{tar}},\;c,\;r)\,,\,\,\,\textbf{T}_{\text{unmerge}}=\mathcal{% U}(\textbf{T}_{\text{merge}},\;c)\,,T start_POSTSUBSCRIPT merge end_POSTSUBSCRIPT = caligraphic_M ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT , T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT , italic_c , italic_r ) , T start_POSTSUBSCRIPT unmerge end_POSTSUBSCRIPT = caligraphic_U ( T start_POSTSUBSCRIPT merge end_POSTSUBSCRIPT , italic_c ) , (3)

where \mathcal{M}caligraphic_M and 𝒰𝒰\mathcal{U}caligraphic_U denote the merging and unmerging operations, respectively.

3.2 Hierarchical Latent Warping

We introduce a hierarchical latent warping module that operates in the latent space. As illustrated in Fig. 5, this module provides rough shape guidance on both global and local scales by hierarchically propagating latents within keyframes and further from keyframes to their respective batches. Let x^t0isubscriptsuperscript^𝑥𝑖𝑡0\hat{x}^{i}_{t\rightarrow 0}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT be the predicted x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT latent for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT keyframe at denoising step t𝑡titalic_t. We warp the latents between keyframes as follows:

x^t0iMjix^t0i+(1Mji)𝒲(x^t0j,fji)subscriptsuperscript^𝑥𝑖𝑡0subscript𝑀𝑗𝑖subscriptsuperscript^𝑥𝑖𝑡01subscript𝑀𝑗𝑖𝒲subscriptsuperscript^𝑥𝑗𝑡0subscript𝑓𝑗𝑖\hat{x}^{i}_{t\rightarrow 0}\leftarrow M_{ji}\cdot\hat{x}^{i}_{t\rightarrow 0}% +\left(1-M_{ji}\right)\cdot\mathcal{W}(\hat{x}^{j}_{t\rightarrow 0},f_{ji})over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT ← italic_M start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ⋅ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT + ( 1 - italic_M start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) ⋅ caligraphic_W ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) (4)

where j=i1𝑗𝑖1j=i-1italic_j = italic_i - 1 and fjisubscript𝑓𝑗𝑖f_{ji}italic_f start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT, Mjisubscript𝑀𝑗𝑖M_{ji}italic_M start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT denotes the optical flow and the occlusion mask from lqj𝑙subscript𝑞𝑗lq_{j}italic_l italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to lqi𝑙subscript𝑞𝑖lq_{i}italic_l italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT estimated by GMFlow [87]. After keyframe latent warping, these latents are further warped to the remaining frames following the same procedure. The flows and masks are downsampled to match the latent size, and these operations are omitted for simplicity. This module primarily functions in the early stages of the denoising process, ensuring that corresponding points between frames share similar latents both globally and locally from the beginning.

3.3 Hybrid Flow-guided Spatial-aware Token Merging

While latent manipulation can achieve a certain degree of consistency, manipulating latents during the later stages of the denoising process would result in blurry outcomes. Additionally, the token space is highly semantically related to the image. Therefore, we propose hybrid flow-guided spatial-aware token merging to achieve consistency in the token space.

Flow-guided.

Even with low-quality video, we can identify correspondences between frames based on color, indicating that flow calculated from lq𝑙𝑞lqitalic_l italic_q can still provide useful guidance. As shown in the top of Fig. 5, in the early stages of the denoising process, when the latents are still very noisy, cosine similarity struggles to find the correct correspondences, especially in the downsample blocks of the UNet. Therefore, we use flow for correspondences at the downsample blocks in the UNet and employ the confidence from forward-backward consistency check as a criterion to determine r𝑟ritalic_r most similar paired source token TsrcsubscriptTsrc\textbf{T}_{\text{src}}T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT and target token TtarsubscriptTtar\textbf{T}_{\text{tar}}T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT:

σ=exp(fsrctar(X(Tsrc))+ftarsrc(X(Tsrc)+fsrctar(X(Tsrc)))22),𝜎superscriptsubscriptnormsubscript𝑓srctar𝑋subscriptTsrcsubscript𝑓tarsrc𝑋subscriptTsrcsubscript𝑓srctar𝑋subscriptTsrc22\sigma=\exp(-\left\|f_{\text{src}\rightarrow\text{tar}}(X(\textbf{T}_{\text{% src}}))+f_{\text{tar}\rightarrow\text{src}}\left(X(\textbf{T}_{\text{src}})+f_% {\text{src}\rightarrow\text{tar}}(X(\textbf{T}_{\text{src}}))\right)\right\|_{% 2}^{2}),italic_σ = roman_exp ( - ∥ italic_f start_POSTSUBSCRIPT src → tar end_POSTSUBSCRIPT ( italic_X ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) ) + italic_f start_POSTSUBSCRIPT tar → src end_POSTSUBSCRIPT ( italic_X ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT src → tar end_POSTSUBSCRIPT ( italic_X ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (5)

where σ𝜎\sigmaitalic_σ is the confidence, X(Tsrc)𝑋subscriptTsrcX(\textbf{T}_{\text{src}})italic_X ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) is the spatial location of TsrcsubscriptTsrc\textbf{T}_{\text{src}}T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT, and fsrctarsubscript𝑓𝑠𝑟𝑐𝑡𝑎𝑟f_{src\rightarrow tar}italic_f start_POSTSUBSCRIPT italic_s italic_r italic_c → italic_t italic_a italic_r end_POSTSUBSCRIPT, ftarsrcsubscript𝑓𝑡𝑎𝑟𝑠𝑟𝑐f_{tar\rightarrow src}italic_f start_POSTSUBSCRIPT italic_t italic_a italic_r → italic_s italic_r italic_c end_POSTSUBSCRIPT denotes the forward and backward flow between TsrcsubscriptTsrc\textbf{T}_{\text{src}}T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT and TtarsubscriptTtar\textbf{T}_{\text{tar}}T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT. Thus, the proposed flow-guided token merging is termed:

Tmerge=(Tsrc,Ttar,fsrctar,σ,r).subscriptTmergesubscriptTsrcsubscriptTtarsubscript𝑓srctar𝜎𝑟\textbf{T}_{\text{merge}}=\mathcal{M}(\textbf{T}_{\text{src}},\;\textbf{T}_{% \text{tar}},\;f_{\text{src}\rightarrow\text{tar}},\;\sigma,\;r).T start_POSTSUBSCRIPT merge end_POSTSUBSCRIPT = caligraphic_M ( T start_POSTSUBSCRIPT src end_POSTSUBSCRIPT , T start_POSTSUBSCRIPT tar end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT src → tar end_POSTSUBSCRIPT , italic_σ , italic_r ) . (6)

Fig. 5 provides a clearer illustration of our proposed component. Additionally, as shown at the bottom of Fig. 5, flow and cosine similarity identify different correspondences, so a hybrid approach can provide comprehensive guidance, leading to improved temporal consistency and overall video quality.

Spatial-awareness and Padding Removal.

Directly finding correspondences relying on cosine similarity can easily lead to mismatches in places with uniform textures, especially video backgrounds (e.g., sky, sand, grass), bottom of  Fig. 5, resulting in blurrier outcomes. Given that corresponding points in adjacent frames are typically spatially close in videos, leveraging spatial information is crucial for accurate correspondence. We can effectively utilize this information by weighting the cosine similarity scores with the tokens’ spatial distances. The weighted scores are defined as:

sij=sijeτ,withτ=[X(i)X(j)22]/R,formulae-sequencesuperscriptsubscript𝑠𝑖𝑗subscript𝑠𝑖𝑗superscript𝑒𝜏with𝜏delimited-[]superscriptsubscriptnorm𝑋𝑖𝑋𝑗22𝑅s_{ij}^{\prime}=s_{ij}\cdot e^{-\tau}\,,\,\,\text{with}\,\,\tau=\left\lfloor% \left[{\|X(i)-X(j)\|_{2}^{2}}\right]/{R}\right\rfloor\,,italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT - italic_τ end_POSTSUPERSCRIPT , with italic_τ = ⌊ [ ∥ italic_X ( italic_i ) - italic_X ( italic_j ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / italic_R ⌋ , (7)

where X(i)𝑋𝑖X(i)italic_X ( italic_i ), X(j)𝑋𝑗X(j)italic_X ( italic_j ) is the spatial location of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT source token and the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT target token; R𝑅Ritalic_R is a hyperparameter defining the radius of the region with uniform weight.

Images are often padded to propagate through the UNet, and we find that this padding can significantly impact the correspondences found in the tokens. While it is acceptable for padding to correspond to padding in another token, when the token feature dimension is low, cosine similarity may mistakenly identify padding as corresponding to the actual image content. This issue persists until the later stages of the denoising process. Please refer to the appendix for the non-padding correspondence figure. To address this, we remove padding before merging and add it back when unmerging.

Merging Ratio Annealing.

For restoration tasks that demand fine details, maintaining a high merging ratio in the later stages of the denoising process can result in blurred and unrealistic outcomes. To address this, we employ ratio annealing to gradually reduce the merging ratio, preserving detail and realism in the restored video. The merging ratio of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT denoising step is computed as:

ri=rcos(π2max(min(δiibegiendibeg,1),0)),subscript𝑟𝑖𝑟𝜋2maxmin𝛿𝑖subscript𝑖begsubscript𝑖endsubscript𝑖beg10r_{i}=r\cdot\cos\left(\frac{\pi}{2}\cdot\text{max}\left(\text{min}\left(\delta% \cdot\frac{i-i_{\text{beg}}}{i_{\text{end}}-i_{\text{beg}}},1\right),0\right)% \right),italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r ⋅ roman_cos ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ⋅ max ( min ( italic_δ ⋅ divide start_ARG italic_i - italic_i start_POSTSUBSCRIPT beg end_POSTSUBSCRIPT end_ARG start_ARG italic_i start_POSTSUBSCRIPT end end_POSTSUBSCRIPT - italic_i start_POSTSUBSCRIPT beg end_POSTSUBSCRIPT end_ARG , 1 ) , 0 ) ) , (8)

where ibegsubscript𝑖begi_{\text{beg}}italic_i start_POSTSUBSCRIPT beg end_POSTSUBSCRIPT, iendsubscript𝑖endi_{\text{end}}italic_i start_POSTSUBSCRIPT end end_POSTSUBSCRIPT are predefined steps indicating the beginning and end of the merging process, and δ𝛿\deltaitalic_δ represents a hyperparameter for controlling the annealing speed.

3.4 Scheduling

As depicted in Fig. 3, at the initial stage of the diffusion denoising process, hierarchical latent warping offers rough shape guidance on a global scale by warping latents between keyframes and on a local scale by propagating these latents within the batch. During the majority of the denoising process, tokens are processed with our hybrid spatial-aware token merging before entering the attention layer. This component further improves temporal consistency by matching similar tokens, utilizing both flow and spatial information.

Table 1: Quantitative comparisons. (Left) 4×\times× and 8×\times× video super-resolution on the SPMCS [75] and DAVIS [60] datasets. (Right) video denoising of various noise levels on the REDS30 [55] dataset. The best and second performances are marked in red and blue, respectively. Ewarpsuperscriptsubscript𝐸warpE_{\text{warp}}^{*}italic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes Ewarp(×103)E_{\text{warp}}(\times 10^{-3})italic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT ( × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ) and Eintersubscript𝐸interE_{\text{inter}}italic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT, LPIPSinterinter{}_{\text{inter}}start_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT denotes interpolation error and LPIPS.
SD ×\times×4 DiffBIR
Metrics VidToMe FMA-Net Frame Ours Frame Ours
SPMCS PSNR \uparrow 20.516 21.910 20.573 20.636 21.534 21.843
SSIM \uparrow 0.471 0.617 0.490 0.517 0.544 0.572
LPIPS \downarrow 0.352 0.230 0.298 0.286 0.261 0.258
Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ 0.531 0.157 1.058 8.290 0.571 0.571
Eintersubscript𝐸interabsentE_{\text{inter}}\downarrowitalic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT ↓ 10.102 3.271 13.817 11.961 10.257 9.712
LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓ 0.218 0.015 0.241 0.226 0.177 0.158
DAVIS ×\times×4 PSNR \uparrow 23.948 25.215 23.504 23.843 23.780 24.182
SSIM \uparrow 0.608 0.727 0.584 0.618 0.601 0.621
LPIPS \downarrow 0.298 0.347 0.277 0.272 0.264 0.262
Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ 0.512 0.186 0.912 0.745 0.654 0.474
Eintersubscript𝐸interabsentE_{\text{inter}}\downarrowitalic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT ↓ 14.615 11.558 18.125 17.431 16.529 14.666
LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓ 0.278 0.078 0.292 0.274 0.266 0.232
DAVIS ×\times×8 PSNR \uparrow 22.570 22.690 20.268 20.519 21.964 22.331
SSIM \uparrow 0.527 0.594 0.446 0.424 0.502 0.519
LPIPS \downarrow 0.454 0.528 0.470 0.434 0.362 0.367
Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ 0.523 0.351 2.199 1.759 0.964 0.699
Eintersubscript𝐸interabsentE_{\text{inter}}\downarrowitalic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT ↓ 14.117 13.978 24.496 21.746 17.981 15.853
LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓ 0.379 0.132 0.457 0.442 0.372 0.333
DiffBIR
σ𝜎\sigmaitalic_σ Metrics VRT Shift-Net VidToMe Frame Ours
75 PSNR \uparrow 25.050 21.033 23.791 24.585 24.520
SSIM \uparrow 0.787 0.381 0.618 0.649 0.649
LPIPS \downarrow 0.275 0.735 0.296 0.276 0.275
Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ 0.314 1.757 0.765 0.751 0.706
Eintersubscript𝐸interabsentE_{\text{inter}}\downarrowitalic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT ↓ 17.825 27.094 21.751 21.798 21.166
LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓ 0.095 0.501 0.287 0.275 0.264
100 PSNR \uparrow 24.582 22.573 24.606 24.524 24.534
SSIM \uparrow 0.744 0.484 0.676 0.648 0.652
LPIPS \downarrow 0.346 0.518 0.318 0.275 0.271
Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ 0.294 1.126 0.781 0.763 0.696
Eintersubscript𝐸interabsentE_{\text{inter}}\downarrowitalic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT ↓ 17.079 23.424 21.460 21.835 20.639
LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓ 0.095 0.375 0.278 0.281 0.267
random PSNR \uparrow 24.989 21.113 23.692 24.579 24.508
SSIM \uparrow 0.780 0.386 0.615 0.650 0.649
LPIPS \downarrow 0.284 0.728 0.303 0.276 0.270
Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ 0.363 1.896 0.772 0.755 0.713
Eintersubscript𝐸interabsentE_{\text{inter}}\downarrowitalic_E start_POSTSUBSCRIPT inter end_POSTSUBSCRIPT ↓ 18.147 27.565 21.929 21.743 21.140
LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓ 0.099 0.542 0.291 0.282 0.272
Table 2: Ablation studies for 8×\times× VSR on DAVIS [59] test sets. (Left) different correspondence matching methods.  (Right) the proposed components applied at different stages of the denoising process. We apply our two proposed components, hierarchical latent warping (HLW) and hybrid spatial-aware token merging (HS-ToMe) at the early, mid, and late stages of the denoising process.
Down blocks Up blocks Spatial- aware LPIPS \downarrow Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓
Flow Flow 0.518 1.214 0.563
Cos Cos 0.390 0.736 0.350
Cos Flow 0.507 1.049 0.545
Flow Cos 0.375 0.677 0.347
Flow Cos 0.367 0.699 0.333
HLW (Sec. 3.2) HS-ToMe (Sec. 3.3)
Early Mid Late Early Mid Late LPIPS \downarrow Ewarpsuperscriptsubscript𝐸warpabsentE_{\text{warp}}^{*}\downarrowitalic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↓ LPIPSinter{}_{\text{inter}}\downarrowstart_FLOATSUBSCRIPT inter end_FLOATSUBSCRIPT ↓
0.362 0.964 0.372
0.368 0.887 0.369
0.43 0.804 0.383
0.411 0.704 0.339
0.367 0.699 0.333
Refer to caption
Figure 6: Comparison of temporal profile. We examine a row of pixels and track changes over time. The profiles from Flow + Flow and Cosine + Flow methods exhibit noise, indicating flickering artifacts. The Cosine + Cosine method shows smoother profiles but contains some discontinuities. Flow + Cosine demonstrates improved consistency but retains some distortions. Utilizing flow, cosine, and spatial-aware techniques, our method achieves the most seamless and consistent transitions, effectively minimizing artifacts.
Refer to caption
Figure 7: Qualitative comparisons on 8×\times× video super-resolution. As shown in the first row, the low-quality input lacks almost all details. In the zoomed-in patches, our method produces clearer and more consistent results.
Refer to caption
Figure 8: Qualitative comparisons on video denoising in REDS30 [54] dataset for σ=100𝜎100\sigma=100italic_σ = 100. Our method effectively denoises and generates detailed results while maintaining temporal coherence.

4 Experiments

Testing Dataset.

For video super-resolution, we evaluate on SPMCS [95] and DAVIS [59] testing sets, with two downsample scales (×\times×4, ×\times×8), following the same degradation pipeline of RealBasicVSR [10] to generate LQ-HQ video pairs. For video denoising, we evaluate on REDS30 [54] with 3 different noise levels (std. === 75, 100, and std. is uniformly sampled from the range [50, 100]).

Evaluation Metrics.

We evaluate the restoration performance based on two aspects: (1) image quality, using LPIPS, SSIM, and PSNR; (2) temporal consistency, using warping error Ewarpsubscript𝐸warpE_{\text{warp}}italic_E start_POSTSUBSCRIPT warp end_POSTSUBSCRIPT, interpolation error, and interpolation LPIPS. Since LPIPS better reflects visual quality, we propose interpolation LPIPS, based on the interpolation error used in a previous study  [41], to more accurately measure video continuity from a visual perspective. This involves interpolating a target frame from its previous and next frames and computing the LPIPS between the estimated and target frames.

Implementation Details.

The experiment is conducted on an NVIDIA RTX 4090 GPU. We apply our method to DiffBIR [85] and SDx4 upscaler [1], both image-based diffusion models, to demonstrate the propsed method’s compatibility with different models. Noted that for models that are restricted to a super-resolution scale of 4×\times×, we will apply the process twice and then use bicubic downsampling to achieve 8×\times× results.

4.1 Comparisons with State-of-the-Art Methods

To verify the effectiveness of our approach, we compare it with several state-of-the-art methods, including FMA-Net [96] for video super-resolution, and VRT [43] and Shift-Net [39] for video denoising. We also compare our method to per-frame restoration and the application of VidToMe [41], a zero-shot video editing method, onto the same model as ours.

Video Super-resolution.

As shown in Tab. 1, regression-based methods like FMA-Net [96] perform better on datasets like SPMCS that have minimal motion. However, their generalization ability diminishes significantly with increased motion or severe degradation. VidToMe [41] can generate highly consistent results, but they are often very blurry, leading to poor visual quality. In contrast, our method enhances temporal consistency while maintaining the generation quality of the original diffusion model, making it the most competitive approach. Fig. 7 provides visualizations of two challenging VSR cases. FMA-Net fails to produce sharp results due to domain gaps between training and testing. Diffusion-based image restoration method DiffBIR [45] and SD×\times×4 upscaler [1] can generate sharp results with details, while per-frame processing makes the result video temporal inconsistent and jitters across frames. On the contrary, our zero-shot video restoration framework restores a low-quality input video into a temporally consistent high-quality video.

Video Denoising.

Video denoising, compared to VSR, is a simpler task for regression models, as they can often find the correct pixel value given a sufficiently large batch size. However, our method consistently outperforms others in terms of visual quality (LPIPS) and remains highly robust even as degradation becomes severe. Fig. 8 visualizes the denoising results on the REDS30 dataset. Shift-Net [39] fails to remove all noise, likely due to the out-of-domain noise level; VRT [43] produces smooth results but lacks fine details. Although DiffBIR [45] generates highly detailed images, it suffers from poor temporal consistency, as evident in the changes to the pedestrian’s head and the statue’s face. In contrast, our method preserves both fine details and temporal consistency, effectively balancing these two aspects.

4.2 Ablation Study

Ways of Identifying Correspondence.

Tab. 2 presents an ablation study comparing different approaches (optical flow and cosine similarity) for finding correspondences and their order in the UNet. As detailed in Sec. 3.3, the hybrid approach of using optical flow at the downsample blocks and cosine similarity at the upsample blocks achieves the best performance. Additionally, our proposed spatial-aware token merging further enhances performance by utilizing spatial information to guide correspondences. The comparisons in Fig. 6 also indicate that our results are smoother, demonstrating better temporal stability.

Applied Stages in the Denoising Process.

Tab. 2 presents an ablation study evaluating the application of our two proposed components, hierarchical latent warping (HLW, Sec. 3.2) and hybrid spatial-aware token merging (HS-ToMe, Sec. 3.3), at the early, mid, and late stages of the denoising process. The results indicate that applying latent warping in the mid or late stages can significantly degrade the generated outcomes. Furthermore, ensuring consistency in the token space is crucial for achieving coherent and high-quality results.

5 Conclusion

We introduce a novel zero-shot video restoration framework utilizing pre-trained image-based diffusion models, eliminating the need for extensive retraining. Our approach integrates hierarchical latent warping and hybrid flow-guided, spatial-aware token merging, significantly enhancing temporal consistency and video quality under various degradation conditions. Experimental results demonstrate that our framework surpasses existing methods both in quality and consistency.

Limitations.

Our zero-shot video restoration framework has some limitations. Random keyframe sampling may not always select the most representative frames, potentially affecting restoration quality, especially if frames with severe degradation are chosen. Additionally, the sensitivity of the LDM decoder to minor variations in input latents can cause flickering, particularly in dynamic scenes. Future improvements will focus on refining keyframe selection and stabilizing the decoder output to enhance the practical application of diffusion-based video restoration methods.

References

  • sdx [2023] Stable diffusion x4 upscaler, 2023. URL https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler.
  • Albawi et al. [2017] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. Understanding of a convolutional neural network. In 2017 international conference on engineering and technology (ICET), pages 1–6. Ieee, 2017.
  • Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, June 2022.
  • Bolya et al. [2023] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
  • Cao et al. [2021] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
  • Ceylan et al. [2023] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
  • Chan et al. [2021a] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2021a.
  • Chan et al. [2021b] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Understanding deformable alignment in video super-resolution. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 973–981, 2021b.
  • Chan et al. [2021c] Kelvin C.K. Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. 2021c.
  • Chan et al. [2022] Kelvin C.K. Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  • Chen et al. [2022] Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xiaoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
  • Chen et al. [2024] Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, and Tao Mei. Learning spatial adaptation and temporal coherence in diffusion models for video super-resolution. arXiv preprint arXiv:2403.17000, 2024.
  • Choi et al. [2021] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  • Chu et al. [2024] Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, and Jun-Cheng Chen. Medm: Mediating image diffusion models for video-to-video translation with temporal correspondence guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  • Chung et al. [2022] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. Advances in Neural Information Processing Systems, 35:25683–25696, 2022.
  • Cong et al. [2023] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
  • Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017. doi: 10.1109/ICCV.2017.89.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  • Fei et al. [2023] Ben Fei, Zhaoyang Lyu, Liang Pan, Junzhe Zhang, Weidong Yang, Tianyue Luo, Bo Zhang, and Bo Dai. Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9935–9946, 2023.
  • Gal et al. [2023] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 42(4):1–13, 2023.
  • Geyer et al. [2023] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  • Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  • Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. In International Conference on Learning Representations, 2023.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  • Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
  • Hu et al. [2023] Yaosi Hu, Zhenzhong Chen, and Chong Luo. Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023.
  • Huang et al. [2022] Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In European conference on computer vision, pages 668–685. Springer, 2022.
  • Isobe et al. [2020] Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network, 2020.
  • Kalchbrenner et al. [2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
  • Kara et al. [2024] Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • Kawar et al. [2022] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
  • Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. 2024.
  • Kim et al. [2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
  • Kim et al. [2017] Tae Hyun Kim, Seungjun Nah, and Kyoung Mu Lee. Dynamic video deblurring using a locally adaptive blur model. IEEE transactions on pattern analysis and machine intelligence, 40(10):2374–2387, 2017.
  • Kong et al. [2023] Lingshun Kong, Jiangxin Dong, Jianjun Ge, Mingqiang Li, and Jinshan Pan. Efficient frequency domain-based transformers for high-quality image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5886–5895, 2023.
  • Li et al. [2023] Dasong Li, Xiaoyu Shi, Yi Zhang, Ka Chun Cheung, Simon See, Xiaogang Wang, Hongwei Qin, and Hongsheng Li. A simple baseline for video restoration with grouped spatial-temporal shift. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9822–9832, June 2023.
  • Li et al. [2020] Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan: Multi-correspondence aggregation network for video super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 335–351. Springer, 2020.
  • Li et al. [2024] Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • Liang et al. [2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
  • Liang et al. [2022a] Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer. arXiv preprint arXiv:2201.12288, 2022a.
  • Liang et al. [2022b] Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, 35:378–393, 2022b.
  • Lin et al. [2024] Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior, 2024.
  • Liu and Sun [2013] Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
  • Liu et al. [2019] Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang. Deep video frame interpolation using cyclic frame generation. 2019.
  • Liu et al. [2021a] Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Hybrid neural fusion for full-frame video stabilization. 2021a.
  • Liu et al. [2021b] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021b.
  • Lu et al. [2023] Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: An empirical study on video diffusion with transformers. arXiv preprint arXiv:2305.13311, 2023.
  • Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  • Mei and Patel [2023] Kangfu Mei and Vishal Patel. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9117–9125, 2023.
  • Mou et al. [2024] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024.
  • Nah et al. [2019a] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1996–2005, 2019a. doi: 10.1109/CVPRW.2019.00251.
  • Nah et al. [2019b] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019b.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • O’shea and Nash [2015] Keiron O’shea and Ryan Nash. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458, 2015.
  • Pan et al. [2020] Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3043–3051, 2020.
  • Perazzi et al. [2016a] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 724–732, 2016a. doi: 10.1109/CVPR.2016.85.
  • Perazzi et al. [2016b] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016b.
  • Qi et al. [2023] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022a.
  • Saharia et al. [2022b] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726, 2022b.
  • Shi et al. [2023a] Xiaoyu Shi, Zhaoyang Huang, Weikang Bian, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Videoflow: Exploiting temporal cues for multi-frame optical flow estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12469–12480, 2023a.
  • Shi et al. [2023b] Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1599–1610, 2023b.
  • Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  • Song et al. [2022] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2022.
  • Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023.
  • Tao et al. [2017] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE international conference on computer vision, pages 4472–4480, 2017.
  • Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  • Tian et al. [2020] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3360–3369, 2020.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. [2023] Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023.
  • Wang et al. [2019] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  • Wang et al. [2020] Xintao Wang, Ke Yu, Kelvin C.K. Chan, Chao Dong, and Chen Change Loy. Basicsr. https://github.com/xinntao/BasicSR, 2020.
  • Wang et al. [2021] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
  • Wang et al. [2022] Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
  • Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  • Xia et al. [2023] Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, and Luc Van Gool. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13095–13105, 2023.
  • Xie et al. [2023] Liangbin Xie, Xintao Wang, Shuwei Shi, Jinjin Gu, Chao Dong, and Ying Shan. Mitigating artifacts in real-world video super-resolution models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2956–2964, 2023.
  • Xu et al. [2022] Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8121–8130, 2022.
  • Xue et al. [2019] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision (IJCV), 127(8):1106–1125, 2019.
  • Yang et al. [2024a] Peiqing Yang, Shangchen Zhou, Qingyi Tao, and Chen Change Loy. Pgdiff: Guiding diffusion models for versatile face restoration via partial guidance. Advances in Neural Information Processing Systems, 36, 2024a.
  • Yang et al. [2023a] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023a.
  • Yang et al. [2024b] Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy. Fresco: Spatial-temporal correspondence for zero-shot video translation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024b.
  • Yang et al. [2023b] Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. arXiv preprint arXiv:2308.14469, 2023b.
  • Yang et al. [2021] Xi Yang, Wangmeng Xiang, Hui Zeng, and Lei Zhang. Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4781–4790, 2021.
  • Yang et al. [2023c] Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion-guided latent diffusion for temporally consistent real-world video super-resolution. arXiv preprint arXiv:2312.00853, 2023c.
  • Yi et al. [2019] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3106–3115, 2019.
  • Youk et al. [2024] Geunhyuk Youk, Jihyong Oh, and Munchurl Kim. Fma-net: Flow-guided dynamic filtering and iterative feature refinement with multi-attention for joint video super-resolution and deblurring. In CVPR, 2024.
  • Yue et al. [2024] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. Advances in Neural Information Processing Systems, 36, 2024.
  • Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
  • Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • Zhang et al. [2018] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018.
  • Zhou et al. [2023] Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. arXiv preprint arXiv:2312.06640, 2023.
  • Zhu et al. [2019] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9308–9316, 2019.

Appendix A Appendix / supplemental material

In this supplementary material, we first provide additional details on the testing datasets and evaluation metrics. Subsequently, we present more visual comparisons of various methods.

A.1 Correspondences identified by cosine similarity without padding removal

Fig. 9 shows that padding value will affect the matching severely.

Refer to caption
Figure 9: The padded regions affect the matching severely.

A.2 Additional Application: Consistent Video Depth

Our zero-shot framework is applicable to any pre-trained image-based diffusion models and could improve the predicted video consistency. Therefore, we integrate our proposed zero-shot framework into a state-of-the-art latent diffusion-based monocular depth estimator: Marigold [35]. Fig. 10 shows that integrating our proposed framework into Marigold helps improve the temporal consistency of video depth estimation.

Input Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Marigold [35] Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Input Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Marigold [35] Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Input Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Marigold [35] Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10: Integrating our proposed framework into Marigold [35] helps improve the temporal consistency of video depth estimation.