ShadowMaskFormer: Mask Augmented Patch Embedding for Shadow Removal

Zhuohao Li Guoyang Xie^∗ Guannan Jiang and Zhichao Lu Zhuohao Li is with the School of Ocean Engineering and Technology, Sun Yat-Sen University, Zhuhai 519082, , China (Email: lizhh9810@gmail.com)Guoyang Xie is with the Department of Intelligent Manufacturing, CATL, Ningde 352000, China and also with the Department of Computer Science, City University of Hong Kong, Hong Kong, China. (Email: guoyang.xie@ieee.org) Guannan Jiang is with the Department of Intelligent Manufacturing, CATL, Ningde 35200, China (Email: jianggn@catl.com)Zhichao Lu is with the Department of Computer Science, City University of Hong Kong, Hong Kong, China (Email: luzhichaocn@gmail.com)

*

Corresponding author.

Abstract

Transformer recently emerged as the de facto model for computer vision tasks and has also been successfully applied to shadow removal. However, these existing methods heavily rely on intricate modifications to the attention mechanisms within the transformer blocks while using a generic patch embedding. As a result, it often leads to complex architectural designs requiring additional computation resources. In this work, we aim to explore the efficacy of incorporating shadow information within the early processing stage. Accordingly, we propose a transformer-based framework with a novel patch embedding that is tailored for shadow removal, dubbed ShadowMaskFormer. Specifically, we present a simple and effective mask-augmented patch embedding to integrate shadow information and promote the model’s emphasis on acquiring knowledge for shadow regions. Extensive experiments conducted on the ISTD, ISTD+, and SRD benchmark datasets demonstrate the efficacy of our method against state-of-the-art approaches while using fewer model parameters.

{IEEEImpStatement}

Our research introduces ShadowMaskFormer, a transformer-based framework designed to enhance shadow removal in images. This new approach simplifies the process and improves efficiency, requiring fewer model parameters compared to existing methods. Technologically, ShadowMaskFormer integrates shadow information early in the processing stage, enabling more accurate and less resource-intensive image analysis. This can lead to more cost-effective AI applications where computational resources or power efficiency is a concern. Economically, the reduction in computational demand may lower the barriers to implementing advanced image-processing technologies in consumer electronics and other devices. Socially, by improving the quality of shadow removal, our method could enhance the visual experience in applications such as digital photography and video, making these technologies more accessible and enjoyable for users. ShadowMaskFormer contributes to the ongoing development of AI in visual computing by offering a more streamlined approach to a common problem, potentially influencing future advancements in the field.

{IEEEkeywords}

Patch Embedding, Shadow Mask, Shadow Removal, Vision Transformer, Deep Learning.

\IEEEpeerreviewmaketitle

1 Introduction

\IEEEPARstart

Deep learning-based approaches have been widely used for various computer vision tasks by performing excellent performance over traditional model-based approaches [1, 2, 3, 4]. Shadow removal is highly challenging as it involves restoring irregular shadow regions and has gradually emerged as the dominant paradigm in this field. Recently, deep learning-based approaches [5, 6, 7, 8, 9] have gradually emerged as the dominant paradigm in this field. It should be emphasized that shadow mask which can distinguish shadow regions from non-shadow regions in an image, has been widely employed in deep learning methods and proven to effectively assist in shadow removal task [8, 10, 11].

Limitation of Shadow Transformer. Recently, as an emerging backbone model of choice for vision tasks, transformers have also been applied to shadow removal tasks. However, there are two issues. ❶ The existing transformer-based methods overlook the shadow information in the early processing stage and directly employ the generic patch embedding, which cannot fully unleash the representational power of the transformer-based methods. As shown in Figure 1(a), the state-of-the-art transformer-based methods (i.e., CRFormer [12] and ShadowFormer [13]) achieve the contextual knowledge from the non-shadow regions, resulting in performance degradation. ❷ As shown in Figure 1(c)(c), the transformer-based methods mainly necessitate the creation of new modules with shadow masks within the main computation blocks (i.e., transformer blocks and convolution blocks) for shadow removal, resulting in a noticeable scale of parameters. Hence, this has led to the following research question: Can we incorporate shadow information in patch embedding to avoid complicated modifications to transformer blocks and highlight the shadow region in the feature extraction stage?

ShadowMaskFormer. To overcome the constraints of current transformer-based techniques and address the research question, we introduce a new framework called ShadowMaskFormer. As shown in Figure 1(b), ShadowMaskFormer combines the transformer model with a shadow mask in patch embedding, presenting a novel approach to effectively remove shadows from images. Specifically, in the early processing stage (i.e., patch embedding), we propose a simple and effective patch embedding, namely Mask Augmented Patch Embedding (MAPE). The motivation behind MAPE is our observations based on the limited utilization of masks from existing work, as detailed in Section 3.2. Therefore, in our ShadowMaskFormer, the shadow mask is carefully utilized in MAPE with two complementary binarization schemes (the $0/1$ and $-1/+1$ Binarization) to enhance the shadow region pixels, as detailed in section 4.2. In conclusion, with the proposed MAPE, our ShadowMaskFormer can not only leverage the contextual information that is achieved by the learning capability of the transformer model but also restore the shadow region pixels at an earlier time. This approach allows the model to more effectively acquire the distinctive features of shadow regions and perform shadow removal with more purposefully. It is noteworthy that MAPE operates as a single-layer module, primarily performing pixel-level operations, and it requires computation only once per training epoch, significantly reducing computational demands. As far as our knowledge extends, we believe our approach marks the pioneering exploration of utilizing patch embedding in vision transformer models for the task of shadow removal. Experimental results demonstrate that ShadowMaskFormer achieves outstanding shadow removal results over the three widely-used shadow removal datasets, surpassing the state-of-the-art performance. Furthermore, compared with other SOTA methods, it only has 2.2MB network parameters, e.g., half of CRFormer [12] and smaller than others, as illustrated in Figure 1(c). The main contributions of this work are as follows:

$\bullet$

We propose an innovative transformer-based framework called ShadowMaskFormer¹¹1Our implementation is available at https://www.github.com/lizhh268/ShadowMaskFormer.git, which incorporates mask information in the patch embedding stage. To the best of our knowledge, we are the first to explore the task of shadow removal from the perspective of patch embedding in vision transformer models.
$\bullet$

We propose the concept patch embedding tailored for shadow removal and further introduce the corresponding MAPE which is simple and effective. By carefully utilizing the shadow mask with two complementary binarization schemes, MAPE effectively enhances the shadow region to assist in restoring each pixel without introducing any model parameters.
$\bullet$

Comprehensive experiments conducted on publicly available datasets, namely ISTD, ISTD+, and SRD, demonstrate the outstanding performance of our proposed method, placing it at the state-of-the-art level.

The remainder of this paper is structured as follows: Section II reviews related work in shadow removal and vision transformer. Section III introduces the task background and the motivation of our proposed method. Section IV describes the detailed architecture and workflow of ShadowMaskFormer. Sections V and VI present our method’s experimental results and ablation studies. Finally, Section VII concludes our work and discusses future directions.

2 Related Work

In this section, We will present prior works related to image shadow removal, briefly outline vision transformers, and discuss their application in the task of shadow removal.

2.1 Shadow Removal

In the field of image processing, shadows pose a pervasive challenge that has negative implications for various downstream tasks, e.g., object detection, tracking, and face recognition [14, 15, 16]. Consequently, shadow removal has been a fundamental task in the field of computer vision and has received extensive research attention over the years. Numerous methods have emerged for shadow removal in images. These methods can be broadly categorized into two flavors: traditional model-based approaches and deep learning-based approaches. Traditional model-based approaches rely on the physical models of shadow images, which are limited by their dependence on prior knowledge and often struggle to effectively remove shadows in real-world scenes [17, 18, 19]. For instance, [17] proposed a physical model based on constant illumination conditions; [20] proposed an optimization model for removing shadow under various illumination conditions.

In the recent past, deep learning-based (DL) methods have achieved excellent performance in the field of shadow removal, leveraging their end-to-end capabilities.

DL methods can be primarily categorized into two main approaches: convolutional neural network (CNN) based and generative modeling, particularly generative adversarial network (GAN), based approaches. CNN-based approaches employ multi-level convolutions to extract context-based features for shadow removal [5, 7, 21, 22, 23, 24, 6, 25, 26]. For instance, DeshadowNet [5] utilizes image contextual information to learn shadow mask features and remove shadows. GAN-based approaches generate shadow mask images and shadow-free images that closely resemble reality by adhering to a series of criteria for discerning between real and fake data [9, 11, 8, 10, 27, 28, 29, 28]. For instance, in DC-ShadowNet [28], three dedicated losses were defined to characterize shadow images for accurate shadow synthesis. Besides, the most recent work [30] predicts the adaptive weights between two features that are extracted from the shadow image and the shadow mask image. Our method is different from them, i.e., we define the problem as the integration of transformer models and shadow masks, which enables us to draw inspiration from existing mask utilization methods.

2.2 Vision Transformer

The transformer model was originally designed for natural language processing tasks [31], which were recently adopted for vision tasks. Specifically, vision transformers segment images into patches and utilize them as inputs for the subsequent computation (i.e., multi-head attention) [32]. Owing to the ability to learn global contextual information, transformers have achieved remarkable performance in vision tasks [33, 34]. Apart from regular image classification [33] and segmentation [35], transformer models have also been applied to various low-level vision tasks, such as image restoration [36], colorization [37], and inpainting [38]. Differing from other image tasks, our objective is to propose a transformer-based framework for shadow removal. This allows us to design a module for utilizing shadow masks, achieving efficient shadow removal.

2.3 Transformer for Shadow Removal

With the remarkable performance of transformer models in computer vision tasks, it has also been applied to shadow removal tasks [12, 13]. CRFormer [12] attempts to guide the restoration of shadow regions by leveraging non-shadow region information with shadow masks. However, it places excessive emphasis on non-shadow region information, resulting in insufficient model attention allocation to the crucial shadow regions and shadow masks. ShadowFormer [13] proposes a Shadow-Interaction Module Attention to exploit the global contextual correlation between shadow and non-shadow regions. It is worth noting that the transformer-based method, still like other DL-based methods, relies on designing new modules from the transformer blocks to leverage shadow masks for the shadow removal task, as shown in Figure 1(a). In this work, we aim to explore the efficacy of incorporating shadow information during the early processing stage and propose a novel patch embedding module tailored to the shadow removal task.

3 Preliminaries

In this section, we will provide the essential background knowledge about DL models and image shadow removal. Furthermore, we will elucidate the shadow model in this work and its motivation.

3.1 Background

In conventional computer vision models, such as Convolutional Neural Networks (CNNs), an input image is processed as an entirety by sliding a filter over the entire image to capture local patterns. In contrast, a vision transformer (ViT) model [32] does not directly operate on an input image, rather, it breaks down an input image into smaller and non-overlapping patches. Each patch is then linearly transformed into a fixed-dimensional vector representation for subsequent computation. This process is known as Patch Embedding.

The main computing backbone of a ViT model comprises $N$ sequentially connected transformer blocks. Each block comprises a multi-head attention (MHA), a feed-forward network (FFN), and LayerNorms [39].

3.2 Motivation

Shadow masks. With a steady stream of promising empirical results confirming the effectiveness of shadow masks for shadow removal [8, 11, 12, 13], we observe two patterns emerged from existing works: ❶ Shadow masks are typically represented as $0/1$ binary masks (i.e., $\textbf{M}\in\{0,1\}^{H\times W}$ with ${H\times W}$ being the size of an input image) where “ $1$ ” indicates shadow regions and “ $0$ ” indicates non-shadow regions, respectively [8, 23, 12, 13]; ❷ Shadow masks are primarily utilized by the main computing units of the models, e.g., convolution operations in CNNs [40, 23] and attention modules in ViTs [12, 13].

Regarding observation ❶, binarization to $0/1$ poses a potential risk of losing useful information as features corresponding to non-shadow regions will be completely suppressed if one directly applies M to input signals. Instead of designing alternative ways to indirectly apply M as done in most prior arts [8, 23, 12, 13], we opt for designing two complementary binarization schemes for direct utilization of M in this work.

In response to observation ❷, we re-visit the location within a deep learning model where M should be incorporated. Specifically, we seek to investigate the efficacy of incorporating M during the input preprocessing stage, i.e., the patch-embedding stage for ViT models, eliminating the need for repeated applications of M in every transformer block, which in turn leads to improved model efficiency.

Physical models of shadow removal. For the model of shadow removal, based on the preceding works [41, 42, 43], it can be deduced that the formation of shadows occurs due to obstruction of direct illumination and a portion of the ambient illumination. This implies that the shadowed pixel ${I}_{x}^{shadow}$ will exhibit diminished intensity compared to their corresponding shadow-free pixels ${I}_{x}^{shadow-free}$ . According to [21, 43], we start from the original shadow illumination model, which describes a mapping function $T$ that transforms a shadow pixel $I^{shadow}_{x}$ to its non-shadow pixel $I^{shadow-free}_{x}$ . This mapping can be summarized as a linear function and the intensity of a lit pixel is formulated as:

\displaystyle I^{shadow-free}_{x}(\lambda)=L^{d}_{x}(\lambda)R_{x}(\lambda)+L^% {a}_{x}(\lambda)R_{x}(\lambda)

(1)

where $\lambda$ is the wavelength and $I^{shadow-free}_{x}$ is the intensity reflected from an image pixel, $L$ and $R$ are the illumination and reflectance respectively, $L^{d}$ and $L^{a}$ denote the direct illumination and the ambient illumination, respectively. A more detailed implementation of the shadow physics model can be found in the Appendix. For a real shadow scene, an occluder blocks the direct illumination $L^{d}$ and part of the ambient illumination $L^{a}$ , thus the shadowed pixel can be represented as:

\displaystyle I^{shadow}_{x}(\lambda)=a_{x}(\lambda)L^{a}_{x}(\lambda)R_{x}(\lambda)

(2)

where $a_{x}(\lambda)$ is the attenuation factor indicating the remaining fraction of $L^{a}$ that arrives at an image point $x$ .

From these Equations, the shadow-free pixel can also be expressed as follows:

\displaystyle I^{shadow-free}_{x}(\lambda)=L^{d}_{x}(\lambda)R_{x}(\lambda)+a_% {x}(\lambda)^{-1}I^{shadow}_{x}(\lambda)

(3)

Furthermore, following [21], we can establish a mapping function between ${I}_{x}^{shadow}$ and ${I}_{x}^{shadow-free}$ . This mapping expresses the shadow-free pixel as a linear function of the shadowed pixel.

\displaystyle{I}_{x}^{shadow-free}(k)=w(k)\times{I}_{x}^{shadow}(k)+b_{k}

(4)

where $k$ represents the color channel ( $k$ $\in$ $R$ , $G$ , $B$ ), $b_{k}$ is the response of the camera to direct illumination, and $w(k)$ is responsible for the attenuation factor of the ambient illumination at this pixel in this color channel. Additionally, $w=[w_{R},w_{G},w_{B}]$ and $b=[b_{R},b_{G},b_{B}]$ are constant across all pixels $x$ in the umbra area of the shadow. The crucial aspect of the learned mapping by the model lies in determining the parameters $w$ and $b$ for individual shadowed pixels. Inspired by this, we further assume the expression of Eq. 4 can be reformulated as follows:

\displaystyle{I}_{x}^{shadow-free}(k)=S(k)\times{I}_{x}^{shadow}(k)

In the above expression, the gain factor $S(k)$ in our method is derived from the assumption underlying Eq. 1, and $S(k)$ will be learned by the model for the shadowed pixel ${I}_{x}^{shadow}(k)$ . Specifically, considering the natural attribute of shadows where the attenuation of light sources results in shadow pixel values significantly lower than those of non-shadowed pixels, we assume that in Eq. 1, $w(k)$ will be noticeably greater than 1, and $b_{k}$ will be smaller compared to ${I}_{x}^{shadow}(k)\times w(k)$ . The term ${I}_{x}^{shadow}(k)$ appears to be roughly equivalent to the body reflection. Hence, the crux of shadow removal lies in ensuring that the gain factor $S(k)$ learned by the model closely approximates the correct solution. In this paper, we endeavor to perform a preliminary exploration of the gain factor $S(k)$ during the patch embedding stage. This approach allows the model to enhance shadow removal performance more effectively at an early stage, thereby reducing unnecessary model exploration. Specifically, to address the task of shadow removal, we introduce a simple and effective patch embedding, namely Mask Augmented Patch Embedding (MAPE).

⬇

# x: shadow images

# mask: shadow mask images

# w1, w2: weight factors

# proj: Linear Projection

class MAPE(nn.Module):

def __init__(self, patch_size,

in_chans, embed_dim, kernel_Size):

super(MAPE, self).__init__()

self.patch_embed = PatchEmbed(

patch_size, in_chans, embed_dim,

kernel_size)

def forward(self, x, mask, w1, w2):

x = (x / 255) * 2 - 1

Ms = mask / 255

Ts = (Ms * w1 + (1 - Ms) * w2) * x

# the first step of mask angmented

Mp = (mask / 255) * 2 - 1

Tm = Mp * Ts

F = self.proj(Tm)

# the second step of mask angmented

return F

Algorithm 1 Mask Augmented Patch Embedding (MAPE)

4 ShadowMaskFormer

In this section, we will elucidate the overarching workflow framework of the model, with a particular emphasis on detailing the implementation of MAPE.

4.1 Overview

An overview of the proposed transformer with mask augmented in the patch embedding stage (ShadowMaskFormer) is depicted in Figure 2. This approach employs two types of shadow masks to enhance the shadow region pixels. To achieve this, in the early processing stage of the model, ShadowMaskFormer integrates a Mask Augmented Patch Embedding (MAPE) specifically designed for shadow removal.

Specifically, in the patch embedding stage, MAPE takes the shadow image $\textbf{I}_{s}$ and its corresponding shadow mask M as inputs. The main idea behind MAPE is to enhance the shadow region pixels in the early stage by utilizing the information of the M and $\textbf{I}_{s}$ . In ShadowMaskFormer, the determination of $S(k)$ is achieved through two forms of mask utilization: initialization at the early stage and adaptive refinement through model exploration. After MAPE, the transformer blocks, based on the vision transformer, leverage its powerful ability to learn contextual information. It aims to learn a nonlinear mapping function $f(\textbf{I}_{s},\textbf{M};\theta)$ from $\textbf{I}_{s}$ to $\textbf{I}_{gt}$ . By applying this mapping function, the model reconstructs the shadow-free image J which represents the image after shadow removal.

4.2 Mask Augmented Patch Embedding

Based on the discussions above, aiming to find an accurate approximation of the gain factor $S(k)$ for shadow region pixels early in the model’s training, we introduce the concept of Mask Augment Patch Embedding (MAPE) which is based on the two complementary binarization schemes (the $0/1$ and $-1/+1$ Binarization) as depicted in Figure 2. In particular, to ensure the learned non-linear mapping function $f(\textbf{I}_{s},\textbf{M};\theta)$ of the model to explore the correct $S(k)$ early and avoid exploring useless information for shadow region enhancement, we designed MAPE to be positioned in the early stage of the model. In the patch embedding stage, the shadow image $\textbf{I}_{s}$ and its corresponding shadow mask M are first used as inputs to the model. Subsequently, the $0/1$ and $-1/+1$ (Binarize) pixel-wise operations are performed on M, resulting in two complementary masks $\textbf{M}_{s}$ and $\textbf{M}_{p}$ , respectively. These operations can be expressed using the following formulas:

\displaystyle\textbf{M}_{s}=(\frac{\textbf{M}}{255})

(5)

\displaystyle\textbf{M}_{p}=(\frac{\textbf{M}}{255})\cdot 2-1

(6)

After applying binarization, pixels corresponding to the shadow regions of $\textbf{M}_{s}$ are set to 1, while the non-shadow regions are set to 0. $\textbf{M}_{p}$ undergoes pixel-wise operations to adjust the pixel distribution to the range [-1, 1].

First, we use a set of $0/1$ masks $\textbf{M}_{s}$ to learn weighted masks for adjusting the intensity of pixels from the shadow (S) and non-shadow (NS) regions, which in essence provides an initial estimation of $S(k)$ . Thus we can obtain the first enhanced shadow image $\textbf{T}_{s}$ . Subsequently, the region information carried by $\textbf{M}_{p}$ is combined with $\textbf{T}_{s}$ through further operations, yielding the feature $\textbf{T}_{m}$ with element-wise multiplication. Finally, we employ linear projection on the $\textbf{T}_{m}$ , introducing linear transformation features. These operations of mask-augmented patch embedding can be expressed as follows.

\displaystyle\textbf{T}_{s}=(w1\cdot\textbf{M}_{s}+w2\cdot(\textbf{1}-\textbf{% M}_{s}))\cdot\textbf{I}_{s}

(7)

\displaystyle\textbf{T}_{m}=\textbf{M}_{p}\cdot\textbf{T}_{s}

(8)

\displaystyle\textbf{F}=LinearProjection(\textbf{T}_{s})

(9)

where $w1$ and $w2$ are the weights for region reassignment. F is the output of MAPE, and $LinearProjection$ denotes the convolutional layer with the kernel size of 3 × 3. Note that due to performance and efficiency by introducing convolutions into ViT [44], we opted for convolutional projection rather than positional projection.

Actually, the process of mask augmented can be seen as two steps and its pseudocode can be summarized as Algorithm 1. In the first step, we reassign the pixel information in the shadow and non-shadow regions by using the region reassignment weights $w1$ and $w2$ with $0/1$ mask, respectively. This can be expressed as $(w1\cdot\textbf{M}_{s}+w2\cdot(1-\textbf{M}_{s}))\cdot\textbf{I}_{s}$ , which means $S(k)=w1$ . It should be noted that in order to enhance the shadow region pixels, it is necessary to set $w1$ greater than $w2$ . In our experiments, we empirically set $w1$ and $w2$ to 2.5 and 1, respectively. Besides, we conducted a feature analysis of shadow images, e.g.: For ISTD datasets, the results indicate that after the $-1/+1$ processing, the proportion of positive values in the non-shadow regions reaches 70%. In contrast, in the shadow regions, negative values overwhelmingly dominate, reaching 97% of the total values. As shadow region pixel values tend to be low, the majority of shadow region pixels of $\textbf{T}_{s}$ are negative and this contradicts with the pixels in the non-shadow region (positive). Therefore, the second step is introduced to further enhance shadow region pixels by adaptive refinement. We use $-1/+1$ masks to re-balance the distribution of the normalized pixel intensity from the shadow and non-shadow regions, where most shadow pixels are negative-valued while most non-shadow pixels are positive-valued. This step essentially aims to make the restored shadow regions closely resemble non-shadow regions. Finally, we applied linear projection to $\textbf{T}_{m}$ , transforming it into the input for the transformer block. At this point, $S(k)=LinearProjection(w1;k)$ . Note that the transformer blocks also simultaneously explore the genuine $S(k)$ values for each pixel during the training phase. In addition, MAPE avoids intricate modifications to the transformer blocks and eliminates the involvement of shadow masks in the computation of every transformer block, leading to noticeable computational savings.

4.3 Loss Function

Following prior shadow removal works [21, 24, 12], we only adopt the image consistency loss $\mathcal{L}1$ , which is mathematically defined as follows:

\displaystyle\mathcal{L}1=||\ \mathbf{J}-\mathbf{I}_{gt}\ ||_{1}

(10)

where $\mathbf{J}$ and $\mathbf{I}_{gt}$ are the predicted shadow-free image and the ground truth, respectively.

5 Experiments

This section briefly describes our experimental setup followed by results on multiple shadow removal benchmarks.

Table 1: The RMSE results of shadow regions (S), non-shadow regions (NS), and whole images (All) on the ISTD dataset. Our results are in shades and the best result in each section is in bold. ^† indicates that the results are evaluated under input size of 400

\times

400.

	Method	Params	S	NS	All
	Method	(M)	RMSE $\downarrow$	RMSE $\downarrow$	RMSE $\downarrow$
256 $\times$ 256	Input Image	-	32.1	7.09	10.9
	Guo et al. [45]	-	18.7	7.76	9.26
	MaskShadow-GAN [11]	13.8	12.7	6.68	7.41
	ST-CGAN [8]	31.8	9.99	6.05	6.65
	DSC [7]	22.3	8.72	5.04	5.59
	G2R [29]	22.8	10.7	7.55	7.85
	DHAN [10]	21.8	7.49	5.30	5.66
	Fu et al. [6]	187	7.91	5.51	5.88
	DC-ShadowNet [28]	21.2	11.4	5.81	6.57
	Zhu et al. [23]	10.1	8.29	4.55	5.09
	CRFormer^† [12]	4.89	7.32	5.82	6.07
	ShadowFormer [13]	2.40	6.16	3.90	4.27
	MAPE (ours)	2.28	6.08	3.86	4.23
480 $\times$ 640	Input Image	-	33.23	7.25	11.4
	ARGAN [9]	-	9.21	6.27	6.63
	DHAN [10]	21.8	8.13	5.94	6.29
	CANet [22]	358	8.86	6.07	6.15
	ShadowFormer [13]	2.40	6.93	4.59	4.96
	MAPE (ours)	2.28	6.83	4.50	4.87

Table 2: The RMSE results of shadow regions (S), non-shadow regions (NS), and whole images (All) on the ISTD+ dataset. Our results are in shades and the best result in each section is in bold. ^† indicates that the results are evaluated under an input size of 400

\times

400.

Method	Params	S	NS	All
Method	(M)	RMSE $\downarrow$	RMSE $\downarrow$	RMSE $\downarrow$
Input Image	-	40.2	2.6	8.5
DHAN [10]	21.8	11.2	7.1	7.8
Param-Net [25]	-	9.7	2.9	4.1
G2R [29]	22.8	8.8	2.9	3.9
SP+M-Net [21]	141	7.9	2.8	3.6
Fu et al. [6]	187	6.6	3.8	4.2
SG-ShadowNet^† [24]	6.20	5.9	2.9	3.4
ShadowFormer [13]	2.40	5.4	2.4	2.8
AdaptiveFusionNetwork [30]	23.9	5.9	2.9	3.4
MAPE (ours)	2.28	5.4	2.2	2.7

5.1 Experimental Setup

Datasets. We evaluate our method on three widely-used benchmarks for shadow removal: (1) ISTD dataset [8] comprises 1,870 image triplets (i.e., shadow images, shadow-free images, and shadow masks), from which 1,330 and 540 triplets are used for training and testing, respectively; (2) Adjusted ISTD (ISTD+) dataset [21] reduces illumination inconsistency between shadow and shadow-free images using an image processing algorithm while retaining the same number of triplets as the original ISTD dataset; (3) SRD dataset [5] comprises 2,680 training and 408 testing pairs of shadow and shadow-free images w/o ground truth shadow masks. Accordingly, we adopt the predicted masks provided by [10] for training and evaluation.

Baselines. Towards obtaining a representative evaluation of our method, we consider a wide range of baselines, including a traditional method [45], generative modeling-assisted approaches [11, 8, 29, 10, 9, 30], two transformer-based approaches [12, 13], among others [7, 28, 22].

Evaluation metrics. We use root mean square error (RMSE) between the estimated shadow-free images $\mathbf{J}$ and the ground truth $\mathbf{I}_{gt}$ in the LAB color space, where lower values indicate better results. We also use the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) [46] metrics to quantitatively compare the performance of various methods in the RGB color space, where higher values indicate better results.

Implementation details. We implement our method on top of a recent image restoration framework [47] with swin transformer [48]. For training, we use AdamW [49] optimizer with a batch size of one and an initial learning rate of $2\times 10^{-4}$ , which is annealed to zero in 300 epochs following the cosine schedule [50]. All experiments are carried out on two NVIDIA GeForce GTX 3090 GPUs.

Table 3: Comparison of shadow removal performance, measured by PSNR, SSIM, and RMSE metrics, on the SRD dataset. Our results are in shades and the best result in each section is in bold. “-” indicates that the information is not available publicly.

Method Params Shadow Region (S) Non-Shadow Region (NS) All Image (All) (M) PSNR $\uparrow$ SSIM $\uparrow$ RMSE $\downarrow$ PSNR $\uparrow$ SSIM $\uparrow$ RMSE $\downarrow$ PSNR $\uparrow$ SSIM $\uparrow$ RMSE $\downarrow$ Input Image - 18.96 0.871 36.69 31.47 0.975 4.83 18.19 0.830 14.05 DeshadowNet [5] - - - 11.78 - - 4.84 - - 6.64 DSC [7] 22.3 30.65 0.960 8.62 31.94 0.956 4.41 27.76 0.903 5.71 DHAN [10] 21.8 33.67 0.978 8.94 34.79 0.979 4.80 30.51 0.949 5.67 Fu et al. [6] 187 32.26 0.966 9.55 31.87 0.945 5.74 28.40 0.893 6.50 DC-ShadowNet [28] 21.2 34.00 0.975 7.70 35.53 0.981 3.65 31.53 0.955 4.65 CANet [22] 358 - - 7.82 - - 5.88 - - 5.98 SG-shadowNet [24] 6.20 - - 7.53 - - 2.97 - - 4.23 Zhu et al. [23] 10.1 34.94 0.980 7.44 35.85 0.982 3.74 31.72 0.952 4.79 CRFormer [12] 4.89 - - 7.14 - - 3.15 - - 4.25 ShadowFormer [13] 2.40 36.13 0.988 6.05 35.95 0.986 3.55 32.38 0.955 4.09 MAPE (ours) 2.28 37.71 0.988 5.55 38.23 0.984 2.98 34.43 0.968 3.64

5.2 Experimental Results

Results on ISTD. Table 1 summarizes the results on the ISTD dataset. In general, we observe that our method consistently outperforms all considered baselines on different regions (i.e., shadow and non-shadow regions) under both input resolution settings (i.e., $256\times 256$ and $480\times 640$ ) with fewer model parameters. In particular, our method achieves 1.6 ${\times}$ lower RMSE (averaged over whole images) with 9.3 ${\times}$ fewer parameters than DC-ShadowNet [28]. Our method also achieves 1.4 ${\times}$ lower RMSE (averaged over whole images) with 2.1 ${\times}$ fewer parameters than CRFormer [12]. Our method is also all-around competitive against ShadowFormer [13].

Additional qualitative comparisons are provided in Figure 3. In general, the empirical improvements provided by our method (as depicted in Table 1) translate well to visualization results. Evidently, we observe that our method significantly outperforms existing approaches in removing shadows from complex scenes (i.e., the second row), achieving a more natural adaptation between shadow and non-shadow regions. For example, SP+M-Net [21] tends to incorrectly process the non-shadow regions excessively, e.g., the results of the second and third rows in the fourth column. DC-ShadowNet [28] fails to successfully remove shadows and also affects the non-shadow regions, as shown in the result of the third row in the third column.

Results on ISTD+. Table 2 summarizes the results on the ISTD+ dataset. In general, we observe that (i) all methods exhibit better performance than that on the original ISTD dataset [8] after adjustments of illumination inconsistency introduced by [21]; (ii) our method continues to outperform all the considered baseline competitors as indicated by RMSE while using fewer model parameters. Specifically, our method achieves better RMSE on shadow, non-shadow, and all regions than both SG-ShadowNet [24] and ShadowFormer [13].

Results on SRD. Table 3 presents the comparison results on the SRD [5] dataset. Evidently, we observe that our method achieves significantly better results on the overall image than all considered baselines. For instance, our method yields 1.58dB to 2.28dB PSNR improvements on shadow, non-shadow, and all regions over ShadowFormer [13] with a similar number of model parameters. In addition, we observe that our method also achieves noticeably better RMSE than the baselines. Specifically, our method achieves 1.3 $\times$ and 1.2 $\times$ lower RMSE than CRFormer [12] on shadow and all regions, respectively, while using 2.1 $\times$ fewer model parameters.

Figure 4 depicts qualitative comparisons between our method and baselines. Evidently, we observe that our method not only can better eliminate artifacts while removing shadows but also effectively restore the content and color of the shadow regions.

6 Ablation Studies

In this section, we conduct a detailed analysis of each component of our MAPE framework, with the number of transformer blocks N set to 5. We conduct ablation studies on the ISTD dataset [8] to evaluate the individual impact of these components. In our paper, we argue that the mask deep fusion approach efficiently utilizes shadow mask and provides useful and sufficient information about shadow regions.

Ablation Study on MAPE. As summarized in Table 4, we first present the performance of the regular path embedding [51], which is the de-facto choice of image preprocessing adopted by most existing works that are based on transformers. Evidently, our mask augmented patch embedding (MAPE) leads to significantly better shadow removal performance measured by RMSE. Compared to the conventional patch embedding, our MAPE can obviously utilize the regional information provided by the shadow mask effectively and convey this information to the entire model. Then, we present the relative improvements from the proposed $-1/+1$ binarization scheme by replacing it with the regular 0/1 binary mask (i.e., $\textbf{M}_{p}$ $\xrightarrow{}$ $\textbf{M}_{s}$ ), where the effectiveness of $\textbf{M}_{p}$ can be observed. This experimental result demonstrates that the combination of the two shadow masks designed in our MAPE is significantly more effective in shadow removal compared to using only a single, common 0/1 mask ( $\textbf{M}_{s}$ ). In other words, the success of MAPE relies on the reliable combined use of the two masks. On the contrary, applying shadow masks directly within patch embedding can lead to performance degradation in shadow regions, as shown in Table 4.

Table 4: Comparison among the de-facto patch embedding (PE), our proposed MAPE, and a variant of MAPE on the ISTD dataset. Relative differences are shown in parentheses.

Method S NS All RMSE $\downarrow$ RMSE $\downarrow$ RMSE $\downarrow$ Original PE 7.59 (+1.5) 4.91 (+1.1) 5.35 (+1.1) MAPE 6.08 (+0.0) 3.86 (+0.0) 4.23 (+0.0) $\textbf{M}_{p}$ $\xrightarrow{}$ $\textbf{M}_{s}$ in MAPE 8.03 (+2.0) 4.49 (+0.6) 4.95 (+0.7)

The effect of the Model’s variants. Our method, MAPE, is built upon the vision-transformer model and is sensitive to different model configurations. To investigate the impact of varying transformer block settings, we conducted experiments with three distinct model configurations on the ISTD dataset. The results are summarized in Table 5. We observed that increasing the number of model parameters enhances shadow removal capabilities, leading to a significant improvement from 1.94MB to 2.52MB in overall image performance. While higher model capacity improves performance, it also increases computational time, which can be burdensome. To address this concern, we selected the middle model parameter configuration to strike a balance between performance and computational efficiency, alleviating the overhead.

Table 5: Comparisons of the Model’s variants over ISTD dataset [8].

Model Size Params S NS All (M) RMSE $\downarrow$ RMSE $\downarrow$ RMSE $\downarrow$ Small 1.94 6.61 3.93 4.37 Middle 2.28 6.08 3.86 4.23 Large 2.52 6.27 3.80 4.20

7 Conclusion

In this paper, we introduce ShadowMaskFormer, a novel early mask utilization framework based on the transformer model, for efficient shadow removal. In ShadowMaskFormer, we successfully explore a simple and effective implementation of enhanced shadow region pixels during the patch embedding stage, enabling the model to restore shadow images. Specifically, after analyzing the characteristics of the previous methods, we develop the Mask Augmented patch embedding by using carefully the shadow mask. This approach efficiently encourages the model to explore the optimal shadow pixel gain factors as soon as possible. Experimental results on the ISTD, ISTD+, and SRD datasets demonstrate that our ShadowMaskFormer achieves outstanding performance compared to state-of-the-art methods by using fewer network parameters. In the future, we will further explore more shadow removal methods based on the physical parameters of the shadow model, to enhance the interpretability of the model and its adaptability to different scenes.

References

[1] S. Calderon-Ramirez, S. Yang, and D. Elizondo, “Semisupervised deep learning for image classification with distribution mismatch: A survey,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 6, pp. 1015–1029, 2022.
[2] R. C. Aralikatti, S. J. Pawan, and J. Rajan, “A dual-stage semi-supervised pre-training approach for medical image segmentation,” IEEE Transactions on Artificial Intelligence, vol. 5, no. 2, pp. 556–565, 2024.
[3] S. Huang, C. He, and R. Cheng, “Sologan: Multi-domain multimodal unpaired image-to-image translation via a single generative adversarial network,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 5, pp. 722–737, 2022.
[4] A. Esmaeilzehi, M. O. Ahmad, and M. Swamy, “Ultralight-weight three-prior convolutional neural network for single image super resolution,” IEEE Transactions on Artificial Intelligence, vol. 4, no. 6, pp. 1724–1738, 2023.
[5] L. Qu, J. Tian, S. He, Y. Tang, and R. W. H. Lau, “Deshadownet: A multi-context embedding deep network for shadow removal,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2308–2316.
[6] L. Fu, C. Zhou, Q. Guo, F. Juefei-Xu, H. Yu, W. Feng, Y. Liu, and S. Wang, “Auto-exposure fusion for single-image shadow removal,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 566–10 575.
[7] X. Hu, C.-W. Fu, L. Zhu, J. Qin, and P.-A. Heng, “Direction-aware spatial context features for shadow detection and removal,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 11, pp. 2795–2808, nov 2020.
[8] J. Wang, X. Li, and J. Yang, “Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1788–1797.
[9] B. Ding, C. Long, L. Zhang, and C. Xiao, “Argan: Attentive recurrent generative adversarial network for shadow detection and removal,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 10 212–10 221.
[10] X. Cun, C.-M. Pun, and C. Shi, “Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting gan,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 680–10 687.
[11] X. Hu, Y. Jiang, C.-W. Fu, and P.-A. Heng, “Mask-ShadowGAN: Learning to remove shadows from unpaired data,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, oct 2019.
[12] J. Wan, H. Yin, Z. Wu, X. Wu, Z. Liu, and S. Wang, “Crformer: a cross-region transformer for shadow removal,” arXiv preprint arXiv:2207.01600, 2022.
[13] L. Guo, S. Huang, D. Liu, H. Cheng, and B. Wen, “Shadowformer: Global context helps shadow removal,” in AAAI Conference on Artificial Intelligence, 2023.
[14] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts, and shadows in video streams,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1337–1342, 2003.
[15] S. Nadimi and B. Bhanu, “Physical models for moving shadow and object detection in video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 1079–1087, 2004.
[16] W. Zhang, X. Zhao, J.-M. Morvan, and L. Chen, “Improving shadow suppression for illumination robust face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 3, pp. 611–624, 2019.
[17] G. Finlayson, M. Drew, and C. Lu, “Entropy minimization for shadow removal,” International Journal of Computer Vision, vol. 85, no. 1, pp. 35–57, 2009.
[18] G. Finlayson, S. Hordley, C. Lu, and M. Drew, “On the removal of shadows from images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 59–68, 2006.
[19] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri, “Automatic shadow detection and removal from a single image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 431–446, 2016.
[20] L. Zhang, Q. Zhang, and C. Xiao, “Shadow remover: Image shadow removal based on illumination recovering optimization,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4623–4636, 2015.
[21] H. Le and D. Samaras, “Shadow removal via shadow image decomposition,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8577–8586.
[22] Z. Chen, C. Long, L. Zhang, and C. Xiao, “Canet: A context-aware network for shadow removal,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 4723–4732.
[23] Y. Zhu, Z. Xiao, Y. Fang, X. Fu, Z. Xiong, and Z.-J. Zha, “Efficient model-driven network for shadow removal,” 2022.
[24] J. Wan, H. Yin, Z. Wu, X. Wu, Y. Liu, and S. Wang, “Style-guided shadow removal,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 361–378.
[25] H. Le and D. Samaras, “From shadow segmentation to shadow removal,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 264–281.
[26] Y. Xu, M. Lin, H. Yang, F. Chao, and R. Ji, “Shadow-aware dynamic convolution for shadow removal,” Pattern Recognition, vol. 146, p. 109969, 2024.
[27] L. Zhang, C. Long, X. Zhang, and C. Xiao, “Ris-gan: Explore residual and illumination with generative adversarial networks for shadow removal,” in AAAI Conference on Artificial Intelligence (AAAI), 2020.
[28] Y. Jin, A. Sharma, and R. T. Tan, “Dc-shadownet: Single-image hard and soft shadow removal using unsupervised domain-classifier guided network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5027–5036.
[29] Z. Liu, H. Yin, X. Wu, Z. Wu, Y. Mi, and S. Wang, “From shadow generation to shadow removal,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4925–4934.
[30] X. Li, Q. Guo, R. Abdelfattah, D. Lin, W. Feng, I. Tsang, and S. Wang, “Leveraging inpainting for single-image shadow removal,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 13 055–13 064.
[31] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Conference and Workshop on Neural Information Processing Systems (NIPS), 2017.
[32] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning. PMLR, 2021, pp. 10 347–10 357.
[33] J. Lanchantin, T. Wang, V. Ordonez, and Y. Qi, “General multi-label image classification with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 16 478–16 488.
[34] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in Proceedings of the International Conference on Machine Learning (ICML), July 2021.
[35] R. Strudel, R. G. Pinel, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 7242–7252.
[36] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5718–5729.
[37] M. Kumar, D. Weissenborn, and N. Kalchbrenner, “Colorization transformer,” in International Conference on Learning Representations, 2021.
[38] Q. Dong, C. Cao, and Y. Fu, “Incremental transformer structure enhanced image inpainting with masking positional encoding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[39] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[40] S. He, B. Peng, J. Dong, and Y. Du, “Mask-shadownet: Toward shadow removal via masked adaptive instance normalization,” IEEE Signal Processing Letters, vol. 28, pp. 957–961, 2021.
[41] B. A. Maxwell, R. M. Friedhoff, and C. A. Smith, “A bi-illuminant dichromatic reflection model for understanding images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
[42] H. Shen, H. Zhang, S. Shao, and J. Xin, “Chromaticity-based separation of reflection components in a single image,” Pattern Recognition, vol. 41, no. 8, pp. 2461–2469, 2008.
[43] Y. Shor and D. Lischinski, “The shadow meets the mask: Pyramid-based shadow removal,” Computer Graphics Forum, vol. 27, no. 2, pp. 577–586, 2008.
[44] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 22–31.
[45] R. Guo, Q. Dai, and D. Hoiem, “Paired regions for shadow detection and removal,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2956–2967, 2012.
[46] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[47] Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single image dehazing,” IEEE Transactions on Image Processing, vol. 32, pp. 1927–1941, 2023.
[48] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002.
[49] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations (ICLR), 2017.
[50] ——, “SGDR: stochastic gradient descent with warm restarts,” in 5th International Conference on Learning Representations (ICLR) 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
[51] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.