DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Chang-Han Yeh¹ Chin-Yang Lin¹ Zhixiang Wang²
Chi-Wei Hsiao³ Ting-Hsuan Chen¹ Yu-Lun Liu¹
¹National Yang Ming Chiao Tung University ²University of Tokyo ³MediaTek Inc

Abstract

This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8 $\times$ super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output. See our project page for video results: jimmycv07.github.io/DiffIR2VR_web.

1 Introduction

Video restoration is a valuable topic that transforms low-quality video into high-quality video. It usually involves video denoising, super-resolution, and deblurring. The state-of-art methods that employ convolutional neural networks (CNNs) [2, 32, 57] or transformers [19, 49, 78] trained on large-scale data achieve incredible effectiveness. However, the regression-based methods often result in blurry outputs without realistic details (Fig. 2(a)). Furthermore, the degradations they address are typically well-defined (e.g., bicubic downsampling, given noise standard deviation), and models are often tailored to specific degradations only. This limitation restricts their generalization capabilities, as different settings often require additional paired data and retraining the model.

Diffusion models recently are adapted to image restorations [85, 45] Because of their powerful generative ability, they can hallucinate realistic details. But this ability inherently comes with high randomness. As a result, directly performing per-frame inference to process videos leads to severe flickering (Fig. 2(b)). This phenomenon is even more pronounced in Latent Diffusion Models (LDM) because the decoder will magnify the randomness.

A potential solution to reduce the temporal flickering is to fine-tune or train a single-image diffusion model by inserting 3D convolution and temporal attention layers into it. However, the modified model needs to be trained on videos to impose temporal consistency, which often requires unfordable computational resources (e.g., 32 A100-80G GPUs for video upscaling [102]). Moreover, different tasks necessitate retraining a model.

Given limited computational resources, we present a novel zero-shot video restoration framework that transforms low-quality input videos into temporally consistent high-quality outputs. We design two training-free modules— hierarchical latent warping, hybrid flow-guided spatial-aware token merging—to enforce temporal consistency in both latent and token (feature from the attention layer) spaces. Our method can be applied to any pre-trained image diffusion model without additional training or fine-tuning. Extensive experiments demonstrate that our video restoration method outperforms state-of-the-art approaches in both video quality and temporal consistency, even under extreme degradation conditions. To summarize, our main contributions are as follows:

•

We propose a novel, zero-shot video restoration method that achieves realistic results and maintains temporal consistency, compatible with any image-based diffusion models.
•

Our training-free framework manipulates both latent and token spaces to enforce semantic consistency across frames, introducing hierarchical latent warping to maintain consistency within and between batches and improving token merging with flow correspondence and spatial information.
•

Our method demonstrates state-of-the-art restoration results, especially in scenarios of extreme degradation and large motion, which handle various levels of degradation with a single model, offering greater generalizability and robustness compared to traditional regression-based methods.

Refer to caption — Figure 1: Zero-shot temporal-consistent diffusion model for video restoration. Given a pre-trained diffusion model for *single-image* restoration, our method generates temporally consistent restored video with fine details *without* any further training.

2 Related Work

Video Restoration.

Video restoration aims to restore high-quality frames from degraded videos, addressing issues such as noise, blur, and low resolution [7, 9, 31, 39, 96, 101, 47, 48]. This task is more challenging than image restoration because it requires maintaining temporal consistency across frames. Learning-based approaches often employ architectures like optical flow warping [30, 58, 67, 68, 88], deformable convolutions [7, 8, 17, 77, 80, 81, 103], and attention mechanisms to handle temporal dependencies [5, 40, 43, 44, 98]. One major limitation is their dependency on paired high-quality (HQ) and low-quality (LQ) data for training [10, 86, 93], which is even more difficult to obtain for videos than for images. Moreover, most existing approaches [37, 36, 38, 40, 44] assume predefined degradation processes, reducing their effectiveness in real-world applications where degradations are unknown and diverse, thus leading to poor generalization performance. Additionally, these models often need retraining for different degradation levels or types [43, 46, 55, 95, 96], highlighting their limited generalization capabilities. Last but not least, these methods tend to lose significant detail, similar to image restoration [11, 42, 82, 99]

Diffusion Models for Image Restoration.

With significant advancements in diffusion models [13, 18, 25, 26, 63], many diffusion-based approaches have been proposed for image restoration [21, 26, 56, 70, 73, 79, 92]. One straightforward method involves training a diffusion model from scratch [63, 66, 85, 97], conditioned on low-quality images [66, 1]. However, this approach requires substantial computational resources. To reduce these costs, a common strategy is to introduce constraints into the reverse diffusion process of pre-trained models, as demonstrated by DDRM [34]. While efficient, these methods [13, 15, 21, 34, 72, 83, 89] depend on predefined image degradation processes or pretrained super-resolution (SR) models, which limits their generalizability. Recent works have enhanced performance by fine-tuning frozen pre-trained diffusion models with additional trainable layers [79, 92, 100], as seen in StableSR [79] and DiffBIR [45]. Despite their effectiveness, these methods encounter challenges in video restoration, where the inherent randomness of the diffusion process can cause temporal inconsistencies across frames. Our method allows these methods to work on video without any training.

Diffusion Models for Video Task.

Building on the success of text-to-image diffusion models [3, 13, 22, 24, 25, 53, 62, 65, 100], recent research explores using diffusion models for video tasks [20, 27, 28, 29, 50, 51, 52], extending pre-trained image diffusion models to video processing. Methods [12, 69], like Upscale-A-Video [102] and MGLD-VSR [94], achieve this by integrating and fine-tuning temporal layers. Upscale-A-Video adds temporal layers to UNet and VAE-Decoder, ensuring sequence consistency. Similarly, MGLD-VSR uses motion dynamics from low-resolution videos and calculated optical flows to align latent features. These methods require paired video data and substantial computational resources. Alternatively, zero-shot methods use existing image diffusion models to generate videos without training [6, 14, 16, 23, 61, 84, 91, 93, 100], employing techniques like token merging [4], noise shuffling, and latent warping. VidToMe [41] and TokenFlow [23] enhance temporal consistency by merging and aligning attention tokens across frames; Rerender-A-Video [90] employs latent warping [76, 87] and frame interpolation; RAVE [33] uses noise shuffling to maintain frame consistency in longer videos with reduced time complexity. These techniques are capable of generating impressive video sequences with minimal effort. However, they often produce blurry results and struggle with semantic consistency in demanding video restoration tasks. Inspired by these methods, our training-free framework manipulates latent and token spaces to ensure semantic consistency across frames, introducing hierarchical latent warping and improving token merging with flow correspondence and spatial information.

3 Method

Given a low-quality video with $n$ frames $\left\{lq_{1},lq_{2},\ldots,lq_{n}\right\}$ , our goal is to restore it into a high-quality video $\left\{hq_{1},hq_{2},\ldots,hq_{n}\right\}$ using off-the-shelf image-based diffusion models. However, as illustrated in Fig. 2 and Fig. 7, directly applying these models to each frame individually results in temporal inconsistency due to the inherent stochasticity of the diffusion models, particularly in cases of extreme degradation. Our method, as depicted in the Fig. 3, addresses this by enforcing temporal stability in both latent and token space during restoration through two main components: Hierarchical Latent Warping (Sec. 3.2) and Hybrid Flow-guided Spatial-aware Token Merging (Sec. 3.3). In this section, we first briefly introduce the background of diffusion models and video token merging in Sec. 3.1. We then introduce the hierarchical latent warping strategy in Sec. 3.2, hybrid flow-guided spatial-aware token merging in Sec. 3.3, and scheduling of them in Sec. 3.4.

3.1 Preliminaries

Diffusion Models.

Diffusion models are a type of generative model that models a data distribution $p_{\text{data}}$ through gradual diffusing and denoising. The forward process diffuses a clean image $x_{0}$ by Gaussian noises in $T$ steps, given by

{x_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}}\epsilon_{t-1}}\Rightarrow x% _{t}=\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,

(1)

where $t\sim[1,T]$ , $\epsilon_{t},\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , and $\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}$ . The latent variable $x_{T}$ will be nearly a standard Gaussian distribution when $T$ is large enough. A denoiser $\epsilon_{\theta}$ , usually implemented with UNet [64], is trained to estimate the noise $\epsilon_{t}$ by minimizing $\mathbb{E}_{t\sim\left[1,T\right],x_{0},\epsilon_{t}}\left[||\epsilon_{t}-% \epsilon_{\theta}(x_{t},t)||^{2}\right]$ . During inference, the inverse process starts from an i.i.d. noise $x_{t}$ , and produce a clean image $x_{0}\sim p_{\text{data}}$ by gradual denoising with the well-trained denoiser over $\bar{T}$ steps [26, 71, 74], where $\bar{T}\in[1,T]$ . These unconditional generative models can be further enhanced with additional guidance, such as text prompts and images for guided generations. One will inject the control signal $c$ into the denoiser $\epsilon_{\theta}(x_{t},t,c)$ through training [100] or optimization [34].

Video Token Merging.

Video Token Merging (VidToMe) [41] enhances the temporal consistency of generated videos by utilizing image diffusion models to merge similar tokens within frame chunks in the attention blocks. This token merging not only improves temporal coherence but also reduces the computational overhead of the attention mechanism by decreasing the size of token chunks.

Given a token chunk $\textbf{T}\in\mathbb{R}^{B\times A\times C}$ , where $A=w*h$ , the algorithm first separates the tokens into source tokens $\textbf{T}_{\text{src}}\in\mathbb{R}^{B\times A-1\times C}$ and a target token $\textbf{T}_{\text{tar}}\in\mathbb{R}^{B\times 1\times C}$ . It then calculates the cosine between each source and target token, determining their corresponding similarity levels, denoted $score\in\mathbb{R}^{((B-1)*A)\times A}$ . The algorithm then identifies the most similar target token for each source token by taking the maximum value in the last column.

\displaystyle s(\textbf{T}_{\text{src}},\textbf{T}_{\text{tar}})=\frac{\textbf% {T}_{\text{src}}\cdot\textbf{T}_{\text{tar}}}{\left\|\textbf{T}_{\text{src}}% \right\|\left\|\textbf{T}_{\text{tar}}\right\|}\,,\,\,\,c=\max_{\{\textbf{t}% \in\textbf{T}_{\text{tar}}\}}(s(\textbf{T}_{\text{src}},\textbf{t})),

(2)

where $s(\cdot,\cdot)$ is the cosine similarity score and $c$ indicates the correspondences. Next, the $r$ most similar paired source-target tokens are merged, and the remaining tokens are concatenated as the output. Merged tokens are subsequently unmerged after self-attention to preserve the original shape by simply assigning the merged source-target tokens the exact same value. The token merging and unmerging are defined as follows:

\displaystyle\textbf{T}_{\text{merge}}=\mathcal{M}(\textbf{T}_{\text{src}},% \textbf{T}_{\text{tar}},\;c,\;r)\,,\,\,\,\textbf{T}_{\text{unmerge}}=\mathcal{% U}(\textbf{T}_{\text{merge}},\;c)\,,

(3)

where $\mathcal{M}$ and $\mathcal{U}$ denote the merging and unmerging operations, respectively.

3.2 Hierarchical Latent Warping

We introduce a hierarchical latent warping module that operates in the latent space. As illustrated in Fig. 5, this module provides rough shape guidance on both global and local scales by hierarchically propagating latents within keyframes and further from keyframes to their respective batches. Let $\hat{x}^{i}_{t\rightarrow 0}$ be the predicted $\hat{x}_{0}$ latent for the $i^{th}$ keyframe at denoising step $t$ . We warp the latents between keyframes as follows:

\hat{x}^{i}_{t\rightarrow 0}\leftarrow M_{ji}\cdot\hat{x}^{i}_{t\rightarrow 0}% +\left(1-M_{ji}\right)\cdot\mathcal{W}(\hat{x}^{j}_{t\rightarrow 0},f_{ji})

(4)

where $j=i-1$ and $f_{ji}$ , $M_{ji}$ denotes the optical flow and the occlusion mask from $lq_{j}$ to $lq_{i}$ estimated by GMFlow [87]. After keyframe latent warping, these latents are further warped to the remaining frames following the same procedure. The flows and masks are downsampled to match the latent size, and these operations are omitted for simplicity. This module primarily functions in the early stages of the denoising process, ensuring that corresponding points between frames share similar latents both globally and locally from the beginning.

3.3 Hybrid Flow-guided Spatial-aware Token Merging

While latent manipulation can achieve a certain degree of consistency, manipulating latents during the later stages of the denoising process would result in blurry outcomes. Additionally, the token space is highly semantically related to the image. Therefore, we propose hybrid flow-guided spatial-aware token merging to achieve consistency in the token space.

Flow-guided.

Even with low-quality video, we can identify correspondences between frames based on color, indicating that flow calculated from $lq$ can still provide useful guidance. As shown in the top of Fig. 5, in the early stages of the denoising process, when the latents are still very noisy, cosine similarity struggles to find the correct correspondences, especially in the downsample blocks of the UNet. Therefore, we use flow for correspondences at the downsample blocks in the UNet and employ the confidence from forward-backward consistency check as a criterion to determine $r$ most similar paired source token $\textbf{T}_{\text{src}}$ and target token $\textbf{T}_{\text{tar}}$ :

\sigma=\exp(-\left\|f_{\text{src}\rightarrow\text{tar}}(X(\textbf{T}_{\text{% src}}))+f_{\text{tar}\rightarrow\text{src}}\left(X(\textbf{T}_{\text{src}})+f_% {\text{src}\rightarrow\text{tar}}(X(\textbf{T}_{\text{src}}))\right)\right\|_{% 2}^{2}),

(5)

where $\sigma$ is the confidence, $X(\textbf{T}_{\text{src}})$ is the spatial location of $\textbf{T}_{\text{src}}$ , and $f_{src\rightarrow tar}$ , $f_{tar\rightarrow src}$ denotes the forward and backward flow between $\textbf{T}_{\text{src}}$ and $\textbf{T}_{\text{tar}}$ . Thus, the proposed flow-guided token merging is termed:

\textbf{T}_{\text{merge}}=\mathcal{M}(\textbf{T}_{\text{src}},\;\textbf{T}_{% \text{tar}},\;f_{\text{src}\rightarrow\text{tar}},\;\sigma,\;r).

(6)

Fig. 5 provides a clearer illustration of our proposed component. Additionally, as shown at the bottom of Fig. 5, flow and cosine similarity identify different correspondences, so a hybrid approach can provide comprehensive guidance, leading to improved temporal consistency and overall video quality.

Spatial-awareness and Padding Removal.

Directly finding correspondences relying on cosine similarity can easily lead to mismatches in places with uniform textures, especially video backgrounds (e.g., sky, sand, grass), bottom of Fig. 5, resulting in blurrier outcomes. Given that corresponding points in adjacent frames are typically spatially close in videos, leveraging spatial information is crucial for accurate correspondence. We can effectively utilize this information by weighting the cosine similarity scores with the tokens’ spatial distances. The weighted scores are defined as:

s_{ij}^{\prime}=s_{ij}\cdot e^{-\tau}\,,\,\,\text{with}\,\,\tau=\left\lfloor% \left[{\|X(i)-X(j)\|_{2}^{2}}\right]/{R}\right\rfloor\,,

(7)

where $X(i)$ , $X(j)$ is the spatial location of the $i^{th}$ source token and the $j^{th}$ target token; $R$ is a hyperparameter defining the radius of the region with uniform weight.

Images are often padded to propagate through the UNet, and we find that this padding can significantly impact the correspondences found in the tokens. While it is acceptable for padding to correspond to padding in another token, when the token feature dimension is low, cosine similarity may mistakenly identify padding as corresponding to the actual image content. This issue persists until the later stages of the denoising process. Please refer to the appendix for the non-padding correspondence figure. To address this, we remove padding before merging and add it back when unmerging.

Merging Ratio Annealing.

For restoration tasks that demand fine details, maintaining a high merging ratio in the later stages of the denoising process can result in blurred and unrealistic outcomes. To address this, we employ ratio annealing to gradually reduce the merging ratio, preserving detail and realism in the restored video. The merging ratio of the $i^{th}$ denoising step is computed as:

r_{i}=r\cdot\cos\left(\frac{\pi}{2}\cdot\text{max}\left(\text{min}\left(\delta% \cdot\frac{i-i_{\text{beg}}}{i_{\text{end}}-i_{\text{beg}}},1\right),0\right)% \right),

(8)

where $i_{\text{beg}}$ , $i_{\text{end}}$ are predefined steps indicating the beginning and end of the merging process, and $\delta$ represents a hyperparameter for controlling the annealing speed.

3.4 Scheduling

As depicted in Fig. 3, at the initial stage of the diffusion denoising process, hierarchical latent warping offers rough shape guidance on a global scale by warping latents between keyframes and on a local scale by propagating these latents within the batch. During the majority of the denoising process, tokens are processed with our hybrid spatial-aware token merging before entering the attention layer. This component further improves temporal consistency by matching similar tokens, utilizing both flow and spatial information.

Table 1: Quantitative comparisons. (Left) 4

\times

and 8

\times

video super-resolution on the SPMCS [75] and DAVIS [60] datasets. (Right) video denoising of various noise levels on the REDS30 [55] dataset. The best and second performances are marked in red and blue, respectively.

E_{\text{warp}}^{*}

denotes

E_{\text{warp}}(\times 10^{-3})

and

E_{\text{inter}}

, LPIPS

{}_{\text{inter}}

denotes interpolation error and LPIPS.

				SD $\times$ 4		DiffBIR
	Metrics	VidToMe	FMA-Net	Frame	Ours	Frame	Ours
SPMCS	PSNR $\uparrow$	20.516	21.910	20.573	20.636	21.534	21.843
	SSIM $\uparrow$	0.471	0.617	0.490	0.517	0.544	0.572
	LPIPS $\downarrow$	0.352	0.230	0.298	0.286	0.261	0.258
	$E_{\text{warp}}^{*}\downarrow$	0.531	0.157	1.058	8.290	0.571	0.571
	$E_{\text{inter}}\downarrow$	10.102	3.271	13.817	11.961	10.257	9.712
	LPIPS ${}_{\text{inter}}\downarrow$	0.218	0.015	0.241	0.226	0.177	0.158
DAVIS $\times$ 4	PSNR $\uparrow$	23.948	25.215	23.504	23.843	23.780	24.182
	SSIM $\uparrow$	0.608	0.727	0.584	0.618	0.601	0.621
	LPIPS $\downarrow$	0.298	0.347	0.277	0.272	0.264	0.262
	$E_{\text{warp}}^{*}\downarrow$	0.512	0.186	0.912	0.745	0.654	0.474
	$E_{\text{inter}}\downarrow$	14.615	11.558	18.125	17.431	16.529	14.666
	LPIPS ${}_{\text{inter}}\downarrow$	0.278	0.078	0.292	0.274	0.266	0.232
DAVIS $\times$ 8	PSNR $\uparrow$	22.570	22.690	20.268	20.519	21.964	22.331
	SSIM $\uparrow$	0.527	0.594	0.446	0.424	0.502	0.519
	LPIPS $\downarrow$	0.454	0.528	0.470	0.434	0.362	0.367
	$E_{\text{warp}}^{*}\downarrow$	0.523	0.351	2.199	1.759	0.964	0.699
	$E_{\text{inter}}\downarrow$	14.117	13.978	24.496	21.746	17.981	15.853
	LPIPS ${}_{\text{inter}}\downarrow$	0.379	0.132	0.457	0.442	0.372	0.333

					DiffBIR
$\sigma$	Metrics	VRT	Shift-Net	VidToMe	Frame	Ours
75	PSNR $\uparrow$	25.050	21.033	23.791	24.585	24.520
	SSIM $\uparrow$	0.787	0.381	0.618	0.649	0.649
	LPIPS $\downarrow$	0.275	0.735	0.296	0.276	0.275
	$E_{\text{warp}}^{*}\downarrow$	0.314	1.757	0.765	0.751	0.706
	$E_{\text{inter}}\downarrow$	17.825	27.094	21.751	21.798	21.166
	LPIPS ${}_{\text{inter}}\downarrow$	0.095	0.501	0.287	0.275	0.264
100	PSNR $\uparrow$	24.582	22.573	24.606	24.524	24.534
	SSIM $\uparrow$	0.744	0.484	0.676	0.648	0.652
	LPIPS $\downarrow$	0.346	0.518	0.318	0.275	0.271
	$E_{\text{warp}}^{*}\downarrow$	0.294	1.126	0.781	0.763	0.696
	$E_{\text{inter}}\downarrow$	17.079	23.424	21.460	21.835	20.639
	LPIPS ${}_{\text{inter}}\downarrow$	0.095	0.375	0.278	0.281	0.267
random	PSNR $\uparrow$	24.989	21.113	23.692	24.579	24.508
	SSIM $\uparrow$	0.780	0.386	0.615	0.650	0.649
	LPIPS $\downarrow$	0.284	0.728	0.303	0.276	0.270
	$E_{\text{warp}}^{*}\downarrow$	0.363	1.896	0.772	0.755	0.713
	$E_{\text{inter}}\downarrow$	18.147	27.565	21.929	21.743	21.140
	LPIPS ${}_{\text{inter}}\downarrow$	0.099	0.542	0.291	0.282	0.272

Table 2: Ablation studies for 8

\times

VSR on DAVIS [59] test sets. (Left) different correspondence matching methods. (Right) the proposed components applied at different stages of the denoising process. We apply our two proposed components, hierarchical latent warping (HLW) and hybrid spatial-aware token merging (HS-ToMe) at the early, mid, and late stages of the denoising process.

Down blocks	Up blocks	Spatial- aware	LPIPS $\downarrow$	$E_{\text{warp}}^{*}\downarrow$	LPIPS ${}_{\text{inter}}\downarrow$
Flow	Flow	–	0.518	1.214	0.563
Cos	Cos	–	0.390	0.736	0.350
Cos	Flow	–	0.507	1.049	0.545
Flow	Cos	–	0.375	0.677	0.347
Flow	Cos	✓	0.367	0.699	0.333

HLW (Sec. 3.2)			HS-ToMe (Sec. 3.3)
Early	Mid	Late	Early	Mid	Late	LPIPS $\downarrow$	$E_{\text{warp}}^{*}\downarrow$	LPIPS ${}_{\text{inter}}\downarrow$
–	–	–	–	–	–	0.362	0.964	0.372
✓	–	–	✓	–	–	0.368	0.887	0.369
✓	✓	–	✓	✓	✓	0.43	0.804	0.383
✓	✓	✓	✓	✓	✓	0.411	0.704	0.339
✓	–	–	✓	✓	✓	0.367	0.699	0.333

4 Experiments

Testing Dataset.

For video super-resolution, we evaluate on SPMCS [95] and DAVIS [59] testing sets, with two downsample scales ( $\times$ 4, $\times$ 8), following the same degradation pipeline of RealBasicVSR [10] to generate LQ-HQ video pairs. For video denoising, we evaluate on REDS30 [54] with 3 different noise levels (std. $=$ 75, 100, and std. is uniformly sampled from the range [50, 100]).

Evaluation Metrics.

We evaluate the restoration performance based on two aspects: (1) image quality, using LPIPS, SSIM, and PSNR; (2) temporal consistency, using warping error $E_{\text{warp}}$ , interpolation error, and interpolation LPIPS. Since LPIPS better reflects visual quality, we propose interpolation LPIPS, based on the interpolation error used in a previous study [41], to more accurately measure video continuity from a visual perspective. This involves interpolating a target frame from its previous and next frames and computing the LPIPS between the estimated and target frames.

Implementation Details.

The experiment is conducted on an NVIDIA RTX 4090 GPU. We apply our method to DiffBIR [85] and SDx4 upscaler [1], both image-based diffusion models, to demonstrate the propsed method’s compatibility with different models. Noted that for models that are restricted to a super-resolution scale of 4 $\times$ , we will apply the process twice and then use bicubic downsampling to achieve 8 $\times$ results.

4.1 Comparisons with State-of-the-Art Methods

To verify the effectiveness of our approach, we compare it with several state-of-the-art methods, including FMA-Net [96] for video super-resolution, and VRT [43] and Shift-Net [39] for video denoising. We also compare our method to per-frame restoration and the application of VidToMe [41], a zero-shot video editing method, onto the same model as ours.

Video Super-resolution.

As shown in Tab. 1, regression-based methods like FMA-Net [96] perform better on datasets like SPMCS that have minimal motion. However, their generalization ability diminishes significantly with increased motion or severe degradation. VidToMe [41] can generate highly consistent results, but they are often very blurry, leading to poor visual quality. In contrast, our method enhances temporal consistency while maintaining the generation quality of the original diffusion model, making it the most competitive approach. Fig. 7 provides visualizations of two challenging VSR cases. FMA-Net fails to produce sharp results due to domain gaps between training and testing. Diffusion-based image restoration method DiffBIR [45] and SD $\times$ 4 upscaler [1] can generate sharp results with details, while per-frame processing makes the result video temporal inconsistent and jitters across frames. On the contrary, our zero-shot video restoration framework restores a low-quality input video into a temporally consistent high-quality video.

Video Denoising.

Video denoising, compared to VSR, is a simpler task for regression models, as they can often find the correct pixel value given a sufficiently large batch size. However, our method consistently outperforms others in terms of visual quality (LPIPS) and remains highly robust even as degradation becomes severe. Fig. 8 visualizes the denoising results on the REDS30 dataset. Shift-Net [39] fails to remove all noise, likely due to the out-of-domain noise level; VRT [43] produces smooth results but lacks fine details. Although DiffBIR [45] generates highly detailed images, it suffers from poor temporal consistency, as evident in the changes to the pedestrian’s head and the statue’s face. In contrast, our method preserves both fine details and temporal consistency, effectively balancing these two aspects.

4.2 Ablation Study

Ways of Identifying Correspondence.

Tab. 2 presents an ablation study comparing different approaches (optical flow and cosine similarity) for finding correspondences and their order in the UNet. As detailed in Sec. 3.3, the hybrid approach of using optical flow at the downsample blocks and cosine similarity at the upsample blocks achieves the best performance. Additionally, our proposed spatial-aware token merging further enhances performance by utilizing spatial information to guide correspondences. The comparisons in Fig. 6 also indicate that our results are smoother, demonstrating better temporal stability.

Applied Stages in the Denoising Process.

Tab. 2 presents an ablation study evaluating the application of our two proposed components, hierarchical latent warping (HLW, Sec. 3.2) and hybrid spatial-aware token merging (HS-ToMe, Sec. 3.3), at the early, mid, and late stages of the denoising process. The results indicate that applying latent warping in the mid or late stages can significantly degrade the generated outcomes. Furthermore, ensuring consistency in the token space is crucial for achieving coherent and high-quality results.

5 Conclusion

We introduce a novel zero-shot video restoration framework utilizing pre-trained image-based diffusion models, eliminating the need for extensive retraining. Our approach integrates hierarchical latent warping and hybrid flow-guided, spatial-aware token merging, significantly enhancing temporal consistency and video quality under various degradation conditions. Experimental results demonstrate that our framework surpasses existing methods both in quality and consistency.

Limitations.

Our zero-shot video restoration framework has some limitations. Random keyframe sampling may not always select the most representative frames, potentially affecting restoration quality, especially if frames with severe degradation are chosen. Additionally, the sensitivity of the LDM decoder to minor variations in input latents can cause flickering, particularly in dynamic scenes. Future improvements will focus on refining keyframe selection and stabilizing the decoder output to enhance the practical application of diffusion-based video restoration methods.

References

sdx [2023] Stable diffusion x4 upscaler, 2023. URL https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler.
Albawi et al. [2017] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. Understanding of a convolutional neural network. In 2017 international conference on engineering and technology (ICET), pages 1–6. Ieee, 2017.
Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, June 2022.
Bolya et al. [2023] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
Cao et al. [2021] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
Ceylan et al. [2023] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
Chan et al. [2021a] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2021a.
Chan et al. [2021b] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Understanding deformable alignment in video super-resolution. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 973–981, 2021b.
Chan et al. [2021c] Kelvin C.K. Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. 2021c.
Chan et al. [2022] Kelvin C.K. Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
Chen et al. [2022] Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xiaoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
Chen et al. [2024] Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, and Tao Mei. Learning spatial adaptation and temporal coherence in diffusion models for video super-resolution. arXiv preprint arXiv:2403.17000, 2024.
Choi et al. [2021] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
Chu et al. [2024] Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, and Jun-Cheng Chen. Medm: Mediating image diffusion models for video-to-video translation with temporal correspondence guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
Chung et al. [2022] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. Advances in Neural Information Processing Systems, 35:25683–25696, 2022.
Cong et al. [2023] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017. doi: 10.1109/ICCV.2017.89.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
Fei et al. [2023] Ben Fei, Zhaoyang Lyu, Liang Pan, Junzhe Zhang, Weidong Yang, Tianyue Luo, Bo Zhang, and Bo Dai. Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9935–9946, 2023.
Gal et al. [2023] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 42(4):1–13, 2023.
Geyer et al. [2023] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. In International Conference on Learning Representations, 2023.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
Hu et al. [2023] Yaosi Hu, Zhenzhong Chen, and Chong Luo. Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023.
Huang et al. [2022] Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In European conference on computer vision, pages 668–685. Springer, 2022.
Isobe et al. [2020] Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network, 2020.
Kalchbrenner et al. [2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
Kara et al. [2024] Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Kawar et al. [2022] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. 2024.
Kim et al. [2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
Kim et al. [2017] Tae Hyun Kim, Seungjun Nah, and Kyoung Mu Lee. Dynamic video deblurring using a locally adaptive blur model. IEEE transactions on pattern analysis and machine intelligence, 40(10):2374–2387, 2017.
Kong et al. [2023] Lingshun Kong, Jiangxin Dong, Jianjun Ge, Mingqiang Li, and Jinshan Pan. Efficient frequency domain-based transformers for high-quality image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5886–5895, 2023.
Li et al. [2023] Dasong Li, Xiaoyu Shi, Yi Zhang, Ka Chun Cheung, Simon See, Xiaogang Wang, Hongwei Qin, and Hongsheng Li. A simple baseline for video restoration with grouped spatial-temporal shift. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9822–9832, June 2023.
Li et al. [2020] Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan: Multi-correspondence aggregation network for video super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 335–351. Springer, 2020.
Li et al. [2024] Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Liang et al. [2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
Liang et al. [2022a] Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer. arXiv preprint arXiv:2201.12288, 2022a.
Liang et al. [2022b] Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, 35:378–393, 2022b.
Lin et al. [2024] Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior, 2024.
Liu and Sun [2013] Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
Liu et al. [2019] Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang. Deep video frame interpolation using cyclic frame generation. 2019.
Liu et al. [2021a] Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Hybrid neural fusion for full-frame video stabilization. 2021a.
Liu et al. [2021b] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021b.
Lu et al. [2023] Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: An empirical study on video diffusion with transformers. arXiv preprint arXiv:2305.13311, 2023.
Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
Mei and Patel [2023] Kangfu Mei and Vishal Patel. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9117–9125, 2023.
Mou et al. [2024] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024.
Nah et al. [2019a] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1996–2005, 2019a. doi: 10.1109/CVPRW.2019.00251.
Nah et al. [2019b] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019b.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
O’shea and Nash [2015] Keiron O’shea and Ryan Nash. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458, 2015.
Pan et al. [2020] Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3043–3051, 2020.
Perazzi et al. [2016a] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 724–732, 2016a. doi: 10.1109/CVPR.2016.85.
Perazzi et al. [2016b] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016b.
Qi et al. [2023] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022a.
Saharia et al. [2022b] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726, 2022b.
Shi et al. [2023a] Xiaoyu Shi, Zhaoyang Huang, Weikang Bian, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Videoflow: Exploiting temporal cues for multi-frame optical flow estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12469–12480, 2023a.
Shi et al. [2023b] Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1599–1610, 2023b.
Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
Song et al. [2022] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2022.
Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023.
Tao et al. [2017] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE international conference on computer vision, pages 4472–4480, 2017.
Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
Tian et al. [2020] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3360–3369, 2020.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wang et al. [2023] Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023.
Wang et al. [2019] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
Wang et al. [2020] Xintao Wang, Ke Yu, Kelvin C.K. Chan, Chao Dong, and Chen Change Loy. Basicsr. https://github.com/xinntao/BasicSR, 2020.
Wang et al. [2021] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
Wang et al. [2022] Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
Xia et al. [2023] Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, and Luc Van Gool. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13095–13105, 2023.
Xie et al. [2023] Liangbin Xie, Xintao Wang, Shuwei Shi, Jinjin Gu, Chao Dong, and Ying Shan. Mitigating artifacts in real-world video super-resolution models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2956–2964, 2023.
Xu et al. [2022] Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8121–8130, 2022.
Xue et al. [2019] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision (IJCV), 127(8):1106–1125, 2019.
Yang et al. [2024a] Peiqing Yang, Shangchen Zhou, Qingyi Tao, and Chen Change Loy. Pgdiff: Guiding diffusion models for versatile face restoration via partial guidance. Advances in Neural Information Processing Systems, 36, 2024a.
Yang et al. [2023a] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023a.
Yang et al. [2024b] Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy. Fresco: Spatial-temporal correspondence for zero-shot video translation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024b.
Yang et al. [2023b] Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. arXiv preprint arXiv:2308.14469, 2023b.
Yang et al. [2021] Xi Yang, Wangmeng Xiang, Hui Zeng, and Lei Zhang. Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4781–4790, 2021.
Yang et al. [2023c] Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion-guided latent diffusion for temporally consistent real-world video super-resolution. arXiv preprint arXiv:2312.00853, 2023c.
Yi et al. [2019] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3106–3115, 2019.
Youk et al. [2024] Geunhyuk Youk, Jihyong Oh, and Munchurl Kim. Fma-net: Flow-guided dynamic filtering and iterative feature refinement with multi-attention for joint video super-resolution and deblurring. In CVPR, 2024.
Yue et al. [2024] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. Advances in Neural Information Processing Systems, 36, 2024.
Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
Zhang et al. [2018] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018.
Zhou et al. [2023] Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. arXiv preprint arXiv:2312.06640, 2023.
Zhu et al. [2019] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9308–9316, 2019.

Appendix A Appendix / supplemental material

In this supplementary material, we first provide additional details on the testing datasets and evaluation metrics. Subsequently, we present more visual comparisons of various methods.

A.1 Correspondences identified by cosine similarity without padding removal

Fig. 9 shows that padding value will affect the matching severely.

A.2 Additional Application: Consistent Video Depth

Our zero-shot framework is applicable to any pre-trained image-based diffusion models and could improve the predicted video consistency. Therefore, we integrate our proposed zero-shot framework into a state-of-the-art latent diffusion-based monocular depth estimator: Marigold [35]. Fig. 10 shows that integrating our proposed framework into Marigold helps improve the temporal consistency of video depth estimation.

Input
Marigold [35]
Ours
Input
Marigold [35]
Ours
Input
Marigold [35]
Ours