{CJK*}

UTF8gbsn

An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation

1st Zijun Gao Northeastern University
Boston, USA
zjg.elaine@gmail.com
   2nd Qi Wang Northeastern University
Boston, USA
bjwq2019@gmail.com
   3rd Taiyuan Mei Northeastern University
Boston, USA
taiyuanmei0824@gmail.com
   4th Xiaohan Cheng Northeastern University
Boston, USA
Cheng.xiaoh@northeastern.edu
   5th Yun Zi Georgia Institute of Technology
Chicago, USA
yzi9@gatech.edu
   6th Haowei Yang University of Houston
Houston, USA
yanghaowei09@gmail.com
Abstract

The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the intricate details across various image scales more effectively, thus minimizing the information loss inherent to down-sampling procedures. Additionally, to enhance the convergence rate of network training and mitigate sample imbalance issues, we have devised a modified cross-entropy loss function incorporating a balancing factor. This modification optimizes the distribution between positive and negative samples, thus improving the efficiency of model training. Experimental evaluations of our model demonstrate a substantial reduction in information loss and improved accuracy in semantic segmentation. Notably, our proposed network architecture demonstrates a substantial improvement in the finely annotated mean Intersection over Union (mIoU) on the dataset compared to the conventional SegNet. The proposed network structure not only reduces operational costs by decreasing manual inspection needs but also scales up the deployment of AI-driven image analysis across different sectors.

Index Terms:
Deep Learning, Semantic Segmentation, Residual Connection, SegNet model, Encoder-Decoder Network

I Introduction

Semantic segmentation is a crucial task in the field of computer vision[1], with applications spanning from E-commerce webpage Recommendation[2] to medical image analysis[3][4]. The goal of semantic segmentation is to partition an image into segments that correspond to different object classes. Traditional methods, including the widely-used SegNet architecture[5], have laid a robust foundation for addressing these challenges. However, these approaches often suffer from significant information loss during the down-sampling process in the encoder phase, negatively impacting the accuracy of the segmentation.

Recognizing the limitations inherent in existing architectures, this paper introduces an innovative encoder-decoder network structure enhanced with residual connections designed to reduce information loss and improve segmentation accuracy. The core idea revolves around the integration of a multi-residual connection strategy that aids in preserving detailed information across various image scales, which is typically lost in traditional methods. By maintaining these details, our approach enhances the network’s ability to perform accurate segmentation, even in complex visual scenes.

Furthermore, the challenges of network convergence and sample imbalance, which frequently plague the training of deep learning models, are addressed through the introduction of a modified cross-entropy loss function. This novel loss function incorporates a balancing factor that optimizes the distribution between the positive and negative samples, significantly refining the training efficiency and stability of the model.

In this paper, we will detail the development of our enhanced encoder-decoder network architecture, including the rationale behind the design choices and the specific implementations of the multi-residual connections and modified loss function. Experimental results demonstrate that our proposed model not only significantly reduces information loss but also achieves superior segmentation accuracy compared to the conventional SegNet architecture. This improvement is quantified through metrics such as the mean Intersection over Union (mIoU), highlighting our model’s effectiveness in a practical, real-world context.

By reducing the need for manual inspection and facilitating the deployment of AI-driven image analysis, the proposed network structure offers promising prospects for enhancing operational efficiencies across various sectors. Through rigorous testing and validation on multiple datasets, we establish the robustness and scalability of our approach, setting a new standard for image semantic segmentation tasks.

II ALGORITHM AND MODEL

II-A SegNet Model

SegNet is a deep learning model primarily designed for image segmentation tasks, which is crucial in medical imaging[6][7][8] for tasks such as identifying and delineating regions[9][10], like tumors or various anatomical structures in MRI scans[11], CT scans, and other medical images. The SegNet model consists of an encoder network and a corresponding decoder network, as shown in Figure 1.

Refer to caption
Figure 1: SegNet Model Architecture

The encoder network comprises convolutional layers, batch normalization layers, and pooling layers. These layers extract features using same-padding convolution, normalize the data, and accelerate convergence with ReLU activation[12]. Max-pooling layers record the positions of maximum values, providing robustness through translation invariance, though they reduce feature map size and spatial information. SegNet addresses this by storing only the max-pooling indices.

The decoder maps the encoded object and position information to specific pixels, up-samples the reduced feature maps, and refines object shapes using convolution, compensating for detail loss from the encoder’s pooling layers[13]. Up-sampling layers double the feature map size based on the max-pooling indices, with other positions set to zero. The final output is fed into a softmax classifier for pixel-level classification[14].

SegNet’s innovation is using max-pooling indices from the encoder for unpooling in the decoder, reducing model parameters compared to FCN’s bilinear interpolation. This approach allows SegNet to achieve robust segmentation performance with balanced memory usage and accuracy, enhancing boundary delineation. SegNet also demonstrates superior inference time and memory efficiency compared to other architectures.

II-B Model Establishment

This study designs a network structure based on the SegNet encoder-decoder architecture, incorporating residual connections for enhanced semantic segmentation[15]. For standard photographs, shallow CNNs capture more boundary and texture information[16], while deep CNNs extract higher-level abstract features. Combining both shallow and deep features is essential for improving semantic segmentation accuracy. While deepening and widening the network can enhance segmentation precision, it also introduces parameter burden and redundancy. Hence, residual connections and concatenation operations are used to effectively integrate shallow visual features with deep semantic features, with minimal additional parameters.

As depicted in Figure 2, the input image of size H×W produces an output of the same dimensions. Blue boxes represent convolutional layers followed by linear activation functions and batch normalization. The numbers within the boxes indicate the size and number of feature maps post-operation. During the encoder stage, the image is down-sampled thrice, reducing it to 1/8 of its original size and generating 256 feature maps. During the decoder stage, the feature maps are up-sampled to H×W, and the semantic category probabilities of each pixel are output via the softmax function.

Refer to caption
Figure 2: SegNet Model Architecture

In the SegNet recovery stage, the down-sampled feature maps contain abundant feature information. However, due to network constraints, up-sampling sparse feature maps cannot generate dense feature maps, resulting in the loss of crucial information and suboptimal segmentation accuracy. Our network combines max-pooling indices and residual connections, feeding shallow feature maps from the encoder into the decoder’s nonlinear up-sampling stage. Deconvolution then produces dense feature maps, preserving the original image’s color, texture, and boundaries.

An input image undergoes the following steps in the improved SegNet network:

  • Convolution produces feature maps of H×W×64, denoted as F1.

  • Down-sampling to H/2×W/2×64, followed by convolution to H/2×W/2×128, denoted as F2.

  • Down-sampling to H/4×W/4×128, followed by convolution to H/4×W/4×256, denoted as F3.

  • Down-sampling to H/8×W/8×256, denoted as F4.

  • Up-sampling to H/4×W/4×256, denoted as F’3, calculated as F’3 = Fuse(PI(F1), F3). Deconvolution to H/4×W/4×128, denoted as De(F3).

  • Up-sampling to H/2×W/2×128, denoted as F’2, calculated as F’2 = Fuse(PI(F2), De(F’3)). Deconvolution to H/2×W/2×64, denoted as De(F2).

  • Up-sampling De(F2) to H×W resolution, combined with F1 through concatenation to H×W×128, denoted as F4, calculated as F4 = Conc(Fuse(PI(F1), De(F2)), F1).

  • The softmax function assigns each pixel a category, outputting the semantic segmentation result.

II-C Model Training

The training process for the designed network model is as follows:

Preprocess the dataset and split it into training and validation sets. Input the preprocessed data into the initialized semantic segmentation network model. Iteratively update model parameters to minimize cross-entropy loss until convergence and minimal loss are achieved. Output the optimal network model and parameters. Figure 3 illustrates the network training flowchart.

Refer to caption
Figure 3: Training process

II-D Improved Cross-Entropy Loss Function

The standard cross-entropy loss function assigns equal weight to all samples. In cases of imbalanced positive and negative samples, the dominance of numerous easy negative samples can overshadow the impact of a few hard and positive samples, leading to reduced accuracy. To address this issue, we introduce a balancing factor within the range [0, 1].

β={βif (y=1)1βotherwise𝛽cases𝛽if 𝑦11𝛽otherwise\beta=\begin{cases}\beta&\text{if }(y=1)\\ 1-\beta&\text{otherwise}\end{cases}italic_β = { start_ROW start_CELL italic_β end_CELL start_CELL if ( italic_y = 1 ) end_CELL end_ROW start_ROW start_CELL 1 - italic_β end_CELL start_CELL otherwise end_CELL end_ROW (1)

The improved cross-entropy loss function (B-CE) is formulated as follows:

CE(p,y)=βlog(pt)𝐶𝐸𝑝𝑦𝛽subscript𝑝𝑡CE(p,y)=-\beta\log(p_{t})italic_C italic_E ( italic_p , italic_y ) = - italic_β roman_log ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

By incorporating the balancing factor, the cross-entropy loss function achieves faster convergence than the standard version. This enhancement is particularly effective in optimizing class-imbalanced pixel distributions, thereby improving overall convergence efficiency.

III Experimental Results and Analysis

III-A Experimental Environment

The experiments were conducted on an advanced system running Windows 10 Professional, equipped with 32GB RAM, an Intel(R) Core i9-10980HK processor at 2.40GHz, and a NVIDIA 2080 8GB GPU. Matlab 2020b was used as the experimental platform, utilizing MatconvNet and Visual C++ 2017 for constructing the deep learning network model. Model training and testing were carried out in a CUDA 10.2-based GPU environment, ensuring enhanced processing speeds and computational efficiency.

III-B Evaluation Metrics

Intersection over Union (IoU) [17] is a widely used metric for assessing the performance of object detection and semantic segmentation algorithms. It is computed as the ratio of the area of overlap between the predicted bounding box and the ground truth bounding box to the area of their union. IoU is straightforward and applicable to any task with a predicted range output. Figure 4 illustrates the mathematical concept of IoU.

Ideally, the candidate bounding box set completely overlaps with the ground truth, resulting in an IoU of 1, which indicates perfect prediction accuracy. The IoU formula is:

IoU=area(C)area(G)area(C)area(G)𝐼𝑜𝑈𝑎𝑟𝑒𝑎𝐶𝑎𝑟𝑒𝑎𝐺𝑎𝑟𝑒𝑎𝐶𝑎𝑟𝑒𝑎𝐺IoU=\frac{area(C)\cap area(G)}{area(C)\cup area(G)}italic_I italic_o italic_U = divide start_ARG italic_a italic_r italic_e italic_a ( italic_C ) ∩ italic_a italic_r italic_e italic_a ( italic_G ) end_ARG start_ARG italic_a italic_r italic_e italic_a ( italic_C ) ∪ italic_a italic_r italic_e italic_a ( italic_G ) end_ARG (3)

Typically, an IoU threshold of 0.5 is used to determine the accuracy of the predicted bounding box. Higher IoU values indicate more precise bounding boxes.

Refer to caption
Figure 4: Mathematical Concept of IoU

III-C Experimental Results and Analysis

This section presents the results of experiments conducted on the PASCAL VOC 2012 dataset[18], including IoU value statistics for each category and visual analysis of random samples. The designed network structure’s effectiveness and advancement are comprehensively evaluated based on both quantitative metrics and visual performance.

III-C1 Dataset Overview and Parameter Settings

PASCAL VOC 2012, a benchmark dataset, includes 17,125 original images and their corresponding annotations. It is widely used for object detection, image segmentation comparisons, and model performance evaluations. The dataset provides labeled data for supervised learning in visual tasks and is divided into four major categories: person, common animals, transportation vehicles, and indoor furniture. we have chosen to adopt the linked data[19] approach to process dataset. This method is particularly advantageous for enhancing the interoperability and accessibility of our dataset, allowing for a more structured and semantic integration of data sources. By applying this methodology, we aim to facilitate a more robust and dynamic analysis, enabling enhanced data discovery and reuse across various research domains.

The parameter settings for experiments using this dataset are detailed in Table 1. Test data was randomly selected, and the network training utilized stochastic gradient descent with momentum as the optimizer, with learning rate and momentum parameters set to 0.1 and 0.9, respectively.

Parameter Value
Epoch limit 210
Quantity of validation images 5813
Learning rate 0.1
Quantity of training images 5718
Quantity of test images 1000
Momentum 0.9
TABLE I: Hyperparameter Settings for PASCAL VOC 2012 Dataset

Experimental Results Comparison and Analysis Table 2 shows the IoU for each category in the PASCAL VOC 2012 test set, comparing the proposed method with SegNet.

Category Proposed Method SegNet
Keychain 92.61 89.8
Laptop 60.23 39.2
Window 93.95 79.6
Cup 74.84 63.8
Book 82.76 68.1
Backpack 95.00 87.3
Pen 88.44 81.1
Mouse 94.61 86.0
Desk 45.42 28.4
Jacket 91.28 76.9
Clock 76.22 61.9
Phone 90.48 78.9
Hat 91.66 80.2
Sunglasses 88.03 83.5
Shoe 87.88 80.1
Plant 69.77 58.7
Umbrella 82.73 83.3
Pillow 60.81 54.2
Poster 80.65 80.6
Speaker 66.73 64.9
Mean 80.71 72.4
TABLE II: Comparison of IOU for dataset

In summary, the integration of multiple residual connections enhances the fidelity of the features extracted by the semantic segmentation network, maintaining a higher correlation with the original image. This results in superior pixel-level classification and boundary localization compared to SegNet. Both qualitative visual analysis and quantitative IoU analysis for each category demonstrate that this method effectively leverages the efficiency of max-pooling indices and the flexibility of multiple residual connections. Consequently, it achieves higher accuracy in image semantic segmentation, better meeting practical application requirements.

IV Conclusion

The challenges posed by the substantial information loss during the multiple down-sampling and up-sampling processes in the SegNet model have been a significant bottleneck in achieving high accuracy in semantic segmentation[20]. In response, our research introduces a novel encoder-decoder network structure that incorporates multiple residual connections, effectively addressing these limitations. By integrating residual connections, our model harnesses both low-level spatial information and high-level semantic features across various resolutions, thereby preserving essential details that are crucial for accurate segmentation. This strategic use of residual connections not only counters the loss of information but also avoids excessive increases in the number of parameters, maintaining a balance between complexity and performance.

Moreover, recognizing the critical impact of class imbalance on segmentation tasks, we have innovated a balanced cross-entropy loss function. This enhancement optimizes the training process, ensuring more stable and efficient model convergence, and significantly reduces the loss at the point of convergence. Consequently, our approach not only enhances the robustness of the model but also substantially improves segmentation accuracy.

To substantiate our claims, we conducted extensive experimental comparisons and analyses on the PASCAL VOC 2012 dataset. Employing rigorous quantitative evaluation metrics alongside thorough visual analysis, our findings clearly demonstrate that our proposed model markedly outstrips the traditional SegNet in terms of segmentation performance. These results underscore the effectiveness of our structural and functional modifications in advancing the field of semantic segmentation.

References

  • [1] S. Lu, Z. Liu, T. Liu, and W. Zhou, “Scaling-up medical vision-and-language representation learning with federated learning,” Engineering Applications of Artificial Intelligence, vol. 126, p. 107037, 2023.
  • [2] W. Zhao, X. Liu, R. Xu, L. Xiao, and M. Li, “E-commerce webpage recommendation scheme base on semantic mining and neural networks,” Journal of Theory and Practice of Engineering Science, vol. 4, no. 03, p. 207–215, Mar. 2024. [Online]. Available: https://centuryscipub.com/index.php/jtpes/article/view/533
  • [3] M. Xiao, Y. Li, X. Yan, M. Gao, and W. Wang, “Convolutional neural network classification of cancer cytopathology images: taking breast cancer as an example,” arXiv preprint arXiv:2404.08279, 2024.
  • [4] J. Zhang, L. Xiao, Y. Zhang, J. Lai, and Y. Yang, “Optimization and performance evaluation of deep learning algorithm in medical image processing,” Frontiers in Computing and Intelligent Systems, vol. 7, no. 3, pp. 67–71, 2024.
  • [5] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [6] Q. Wang, S. E. Schindler, G. Chen, N. S. Mckay, A. McCullough, S. Flores, J. Liu, Z. Sun, S. Wang, W. Wang et al., “Investigating white matter neuroinflammation in alzheimer disease using diffusion-based neuroinflammation imaging,” Neurology, vol. 102, no. 4, p. e208013, 2024.
  • [7] X. Yan, W. Wang, M. Xiao, Y. Li, and M. Gao, “Survival prediction across diverse cancer types using neural networks,” arXiv preprint arXiv:2404.08713, 2024.
  • [8] B. Zhao, Z. Cao, and S. Wang, “Lung vessel segmentation based on random forests,” Electronics Letters, vol. 53, no. 4, pp. 220–222, 2017.
  • [9] W. Dai, J. Tao, X. Yan, Z. Feng, and J. Chen, “Addressing unintended bias in toxicity detection: An lstm and attention-based approach,” in 2023 5th International Conference on Artificial Intelligence and Computer Applications (ICAICA), 2023, pp. 375–379.
  • [10] Z. Liu and J. Song, “Comparison of tree-based feature selection algorithms on biological omics dataset,” in Proceedings of the 5th International Conference on Advances in Artificial Intelligence, 2021, pp. 165–169.
  • [11] Y. Gong, H. Qiu, X. Liu, Y. Yang, and M. Zhu, “Research and application of deep learning in medical image reconstruction and enhancement,” Frontiers in Computing and Intelligent Systems, vol. 7, no. 3, pp. 72–76, 2024.
  • [12] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova, “Nonlinear approximation and (deep) relu networks,” Constructive Approximation, vol. 55, no. 1, pp. 127–172, 2022.
  • [13] J. Yao, C. Li, K. Sun, Y. Cai, H. Li, W. Ouyang, and H. Li, “Ndc-scene: Boost monocular 3d semantic scene completion in normalized devicecoordinates space,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV).   IEEE Computer Society, 2023, pp. 9421–9431.
  • [14] S. Wang, Z. Liu, and B. Peng, “A self-training framework for automated medical report generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 16 443–16 449.
  • [15] M. Li, Z. Zhu, R. Xu, Y. Feng, and L. Xiao, “Research on image classification and semantic segmentation model based on convolutional neural network,” Journal of Computing and Electronic Information Management, vol. 12, no. 3, pp. 94–100, 2024.
  • [16] R. Xu, Y. Yang, H. Qiu, X. Liu, and J. Zhang, “Research on multimodal generative adversarial networks in the framework of deep learning,” Journal of Computing and Electronic Information Management, vol. 12, no. 3, pp. 84–88, 2024.
  • [17] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 658–666.
  • [18] S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing pascal voc,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 41–48.
  • [19] Y. Li, X. Yan, M. Xiao, W. Wang, and F. Zhang, “Investigation of creating accessibility linked data based on publicly available accessibility datasets,” in Proceedings of the 2023 13th International Conference on Communication and Network Security, 2023, pp. 77–81.
  • [20] A. Sohail, N. A. Nawaz, A. A. Shah, S. Rasheed, S. Ilyas, and M. K. Ehsan, “A systematic literature review on machine learning and deep learning methods for semantic segmentation,” IEEE Access, vol. 10, pp. 134 557–134 570, 2022.