LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Perception System

1st Shijing Hu School of Computer Science
Fudan University
Shanghai, China
   2nd Ruijun Deng School of Computer Science
Fudan University
Shanghai, China
   3rd Xin Du School of Computer Science
Fudan University
Shanghai, China
   4th Zhihui Lu School of Computer Science
Fudan University
Shanghai, China
   5th Qiang Duan Information Sciences and Technology Department
Pennsylvania State University
Abington, USA
   6th Yi He Department of Computer Science
Old Dominion University
Virginia, USA
   7th Shih-Chia Huang Department of Electronic Engineering
National Taipei University of Technology
Taipei, Taiwan
   8th Jie Wu School of Computer Science
Fudan University
Shanghai, China
Abstract

Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaboration with large-small model co-inference offers a promising approach to achieving high inference accuracy and low latency. However, existing edge-cloud collaboration methods are tightly coupled with the model architecture and cannot adapt to the dynamic data drifts in heterogeneous IoT environments. To address the issues, we propose LAECIPS, a new edge-cloud collaboration framework. In LAECIPS, both the large vision model on the cloud and the lightweight model on the edge are plug-and-play. We design an edge-cloud collaboration strategy based on hard input mining, optimized for both high accuracy and low latency. We propose to update the edge model and its collaboration strategy with the cloud under the supervision of the large vision model, so as to adapt to the dynamic IoT data streams. Theoretical analysis of LAECIPS proves its feasibility. Experiments conducted in a robotic semantic segmentation system using real-world datasets show that LAECIPS outperforms its state-of-the-art competitors in accuracy, latency, and communication overhead while having better adaptability to dynamic environments.

Index Terms:
Edge-Cloud Collaboration, Large Vision Model, Big/Little Model Cooperation, IoT-based Perception System

I Introduction

Machine Learning (ML) has been widely applied to support intelligent perception in the Internet of Things (IoTs) for various applications including robotic surveillance and autonomous driving [1, 2]. IoT-based perception often requires high accuracy and low latency of ML inference for meeting application requirements [3]. ML functions in IoTs are typically deployed on edge devices in user proximity to reduce inference latency. However, on the one hand, the constrained resources on IoT edge devices limit their abilities to support complex ML models [4]; on the other hand, the lightweight models on edge devices may suffer low inference accuracy, especially for corner cases [5]. In addition, data distribution drifts may occur in some perception scenarios (e.g., as a robot moves into an unexpected environment or an auto-pilot vehicle travels to an unexplored area) [6], which makes the pre-trained edge model less accurate for the new task.

Refer to caption
Figure 1: Scenario of Edge Cloud Collaborative Inference for IoT-based Perception System

Recently, considerable progress has been made in developing large vision models, for example, the Segment Anything Model (SAM) from Meta [7]. With their strong generalization ability, such large vision models may achieve very high accuracy in handling corner cases and being robust to data distribution drifts in intelligent perception [8]. However, a large vision model can only be deployed in a resource-rich cloud data center, which may cause long latency due to data transmissions between user devices and the cloud server. Therefore, how to fully leverage the advantages of a large vision model for achieving accurate inference while reducing perception latency in the resource-constrained IoT becomes an important research problem.

To answer it, one may think using edge-cloud collaboration for large-small model co-inference [9, 10]. In particular, with a large vision model hosted on the cloud and a small model deployed at the edge, an edge-cloud collaboration strategy determines for each received input if the inference can be performed by the small edge model or needs to be processed by the large vision model on the cloud, as illustrated by Fig. 1. However, existing edge-cloud collaboration methods mainly suffer from three limitations that need to be overcome to support IoT-based intelligent perception. First, the tight coupling between the large and small models limits the system flexibility of the current methods for fully leveraging large vision models. Second, the collaboration strategy needs to be further optimized for both high accuracy and low latency, while demonstrating its capability to adapt to the dynamic IoT environment. Third, inference outputs from a large vision model (e.g., SAM) may lack semantic labels and thus need to be combined with the edge model inference results.

To address the three issues, we propose LAECIPS, a large vision model-assisted adaptive cloud-edge collaboration framework. In this framework, the large vision model on the cloud and the lightweight model deployed at the edge cooperate in a loose-coupling manner thus making both models plug-and-play and greatly improving the system flexibility. This framework employs an edge-cloud collaboration strategy that is based on hard input mining and optimized for both low response latency and high inference accuracy. This framework also enables online adjustment of the collaboration strategy and continual training of the edge model under the supervision of the large vision model, which makes the system adaptive to data distribution drifts in dynamic IoT environments.

Specific contributions of this paper are as follows.

  1. 1.

    This is the first study that explores the problem of edge-cloud collaborative perception for dynamic IoT data streams. We propose a novel edge-cloud collaboration framework, LAECIPS, to enable flexible utilization of both large and small models in an online manner to solve this problem.

  2. 2.

    In the LAECIPS framework, we design a hard input mining-based edge-cloud co-inference strategy that achieves higher accuracy and lower task processing latency.

  3. 3.

    In the LAECIPS framework, we propose the continual training of the small model to fit in with the dynamic environmental changes in the IoT environment.

  4. 4.

    We analyze the theoretical generalization capability of LAECIPS to prove the feasibility of incorporating large vision models, edge small models, and edge-cloud co-inference strategies into the LAECIPS framework in a plug-and-play manner.

  5. 5.

    We implement the proposed LAECIPS framework through a real-world robotic semantic segmentation system in a realistic edge-cloud environment to demonstrate its applicability. Extensive experimental results substantiate that LAECIPS achieves significantly higher accuracy, lower task processing latency and communication overhead than its SoTA competitors.

The rest of the paper is structured as follows. Section II explains the related work. The technical details of LAECIPS are given in Section III. Section IV presents the theoretical proof of the generalization ability of LAECIPS. Experimental results are presented in Section V. Finally, we conclude the paper in Section VI.

II Related Work

The related research on cloud-edge collaborative inference can be categorized into two categories: model partition and big/little model cooperation.

Model Partition

Model partition segments a (big) model into multiple sub-models that are deployed on different hosts including cloud server and edge device(s) based on their resource availability. During inference operation, the model is collaboratively computed across all sub-models to obtain the output result. For example, Neurosurgeon [11] uses a performance prediction model to select the optimal split point for a model. JoinDNN [12] formulates the optimal model layers scheduling as the shortest path problem and solves it using integer linear programming. DADS [13] formulates different model partition optimization problems for lightly and heavily loaded conditions. IONN [14] incrementally builds the model on the server using the arriving model partitions to enable early-stage training. DeepThings [15] fuses grids across layers of DNN to construct a fine-grained model partition.

Although partitioning a complex model across the cloud and edge device(s) reduces computational costs and improves accuracy, it may introduce significant communication overheads for transmitting the intermediate results of the split model, which is often overwhelming to resource-constrained IoTs. Also, the sub-models deployed on the cloud and edge device(s) are tightly coupled thus limiting the flexibility and adaptability of model partitioning to face the dynamic IoT environments. In addition, it is difficult to directly apply the existing model partition methods to the recently developed large vision models due to their highly complex model structures.

Big/Little Model Cooperation

The idea of big/little model cooperation is to deploy a lightweight model on the edge device for simple data inference and use a big model on the cloud for handling difficult data. With an appropriate strategy for model selection, big/little model cooperation may achieve high accuracy and low latency with minimized communication overheads. Also, this approach allows loosely coupled models to be deployed on the edge and cloud for more flexibility and adaptability. Therefore, big/little model cooperation offers a promising approach to intelligent perception in the IoT environment.

Collaborative inference based on big/little model cooperation was first proposed in SM [16], where difficult samples were identified using score margin and uploaded to the cloud for inference. In Cachier [17], the interaction between edge and cloud was modeled as a caching system to minimize inference latency. CeDLD [18] applied big/little model collaborative inference to medical image recognition and identified difficult samples based on image similarity. AppealNet [19] transformed the edge model into a multi-head structure to simultaneously identify difficult samples while outputting the inference results. EdgeCNN [20] proposed a collaborative training method for big/little model cooperation that uses large vision model outputs to supervise the training of the small model on an edge device. The newly reported SOTA work is DCSB [21], which applied big/little model collaborative inference to object detection and adaptively down-sampled some regions of the difficult case to reduce bandwidth consumption. Besides that, works that focused on difficult data detection could also bring insights for big/little model collaboration. MESS [22] proposed an early exit method for semantic segmentation tasks, which could also be used in detecting difficult data. SPP [23] proposed a confidence score-based method to detect out-of-distribution examples which could also be regarded as hard samples.

Although encouraging progress has been made in this area, the SOTA technologies for big/little model cooperation still have some limitations that need to be overcome to effectively support intelligent perception in IoTs. Particularly, the current methods lack the capability of online updating for the edge model and adaptive adjustment of the collaboration strategy in response to the dynamic IoT environments. Also, with the rise of large vision models, current methods need further optimization to apply to large vision models.

Refer to caption
Figure 2: Overview Architecture of LAECIPS

III Proposed Method

III-A System Overview

We examine a scenario wherein a robot or autonomous vehicle outfitted with a camera is deployed at the edge. The edge node is tasked with executing real-time semantic segmentation duties, while the cloud node acts as a resource-abundant computing center, offering assistance to the edge node. In response to challenges related to the poor performance of edge models when confronted with corner cases, alongside concerns of data drift and heterogeneity within the edge environment, we have devised the LAECIPS architecture, illustrated in Fig. 2.

In this framework, a small semantic segmentation model is deployed on the edge device. In steps \scriptsize{1}⃝circled-\scriptsize{1}\normalsize{\scriptsize{1}⃝}\scriptsize{1}⃝ and \scriptsize{2}⃝circled-\scriptsize{2}\normalsize{\scriptsize{2}⃝}\scriptsize{2}⃝, the small model conducts inference on the collected data inputs to yield small model inference results. Subsequently, in steps \scriptsize{3}⃝circled-\scriptsize{3}\normalsize{\scriptsize{3}⃝}\scriptsize{3}⃝ and \scriptsize{4}⃝circled-\scriptsize{4}\normalsize{\scriptsize{4}⃝}\scriptsize{4}⃝, the hard input mining module processes these results to categorize the collected data into two groups: hard inputs and easy inputs. In step \scriptsize{5}⃝circled-\scriptsize{5}\normalsize{\scriptsize{5}⃝}\scriptsize{5}⃝, the small model inference results of easy inputs, which have achieved an acceptable level of accuracy, are directly outputted for reducing processing latency. Conversely, hard inputs that cause low accuracy in edge inference are uploaded to the cloud for further processing to improve inference accuracy.

In step \scriptsize{6}⃝circled-\scriptsize{6}\normalsize{\scriptsize{6}⃝}\scriptsize{6}⃝, both the small model and the SAM large vision model deployed in the cloud perform their respective inference on the uploaded hard inputs. In step \scriptsize{7}⃝circled-\scriptsize{7}\normalsize{\scriptsize{7}⃝}\scriptsize{7}⃝, a fusion of the cloud inference masks with the small model inference results yields co-inference results. In step \scriptsize{8}⃝circled-\scriptsize{8}\normalsize{\scriptsize{8}⃝}\scriptsize{8}⃝ and \scriptsize{9}⃝circled-\scriptsize{9}\normalsize{\scriptsize{9}⃝}\scriptsize{9}⃝, the cloud node sends the co-inference results to the edge node, which then outputs the co-inference results as the inference results for hard inputs. Additionally, the hard inputs and the co-inference results are stored in the cloud node’s replay buffer. In step \scriptsize{9}⃝circled-\scriptsize{9}\normalsize{\scriptsize{9}⃝}\scriptsize{9}⃝, upon the replay buffer’s sample count exceeding a predetermined threshold or a specified time interval elapsing, the cloud node proceeds to continually train the small model, using the hard inputs and their co-inference results as the ground truth. Finally, in step \scriptsize{10}⃝circled-\scriptsize{10}\normalsize{\scriptsize{10}⃝}\scriptsize{10}⃝, the small model deployed in the edge node is updated by the small model trained in the cloud node.

III-B Large Vision Model assisted Inference

SAM is one of the most representative large vision models for IoT perception systems developed in recent years. Its prowess lies in its remarkable efficacy in image segmentation tasks, attributed to its strong generalization ability. Nonetheless, as illustrated in Fig. 3c, the SAM model, while adept at producing well-defined contours delineating segmented objects, falls short in providing semantic labels for these segmented images [8]. Consequently, it cannot be directly applied to semantic segmentation tasks. Edge models, in the course of inference, can provide segmented outcomes accompanied by semantic labels. However, these results often bear imperfections, notably exemplified by coarse object edges as depicted in Fig. 3b. In semantic segmentation tasks, the process of image segmentation frequently poses greater challenges than the subsequent labeling of the segmented outcomes [24]. Therefore, a natural idea is to combine the segmentation results from the SAM model with the classification labels from the edge model to obtain a more refined segmentation outcome with classification labels, as depicted in Fig. 3d.

Refer to caption
Figure 3: Examples of Inference Samples and their Inference results: (a)Inference Samples, (b) Edge Small Model Inference Results, (c) Cloud SAM Model Inference results, (d) SAM Model assisted Inference Results

We assume that the collected image is x[0,255]3×H×W𝑥superscript02553𝐻𝑊x\in[0,255]^{3\times H\times W}italic_x ∈ [ 0 , 255 ] start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT where H𝐻Hitalic_H is the height of the image and W𝑊Witalic_W is the width of the image. The corresponding label of the collected image is y{0,M1}H×W𝑦superscript0𝑀1𝐻𝑊y\in\{0,...M-1\}^{H\times W}italic_y ∈ { 0 , … italic_M - 1 } start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, with M𝑀Mitalic_M representing the number of classes. The image conforms to the probabilistic distribution P(x,y)𝑃𝑥𝑦P(x,y)italic_P ( italic_x , italic_y ). Furthermore, we denote the edge model as f𝑓fitalic_f: f(x)=y[0,1]M×H×W𝑓𝑥superscript𝑦superscript01𝑀𝐻𝑊f(x)=y^{*}\in[0,1]^{M\times H\times W}italic_f ( italic_x ) = italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_M × italic_H × italic_W end_POSTSUPERSCRIPT, and the large vision model in cloud as SAM𝑆𝐴𝑀SAMitalic_S italic_A italic_M: SAM(x)=mask{valid_mask}ann𝑆𝐴𝑀𝑥𝑚𝑎𝑠𝑘superscript𝑣𝑎𝑙𝑖𝑑_𝑚𝑎𝑠𝑘𝑎𝑛𝑛SAM(x)=mask\in\{valid\_mask\}^{ann}italic_S italic_A italic_M ( italic_x ) = italic_m italic_a italic_s italic_k ∈ { italic_v italic_a italic_l italic_i italic_d _ italic_m italic_a italic_s italic_k } start_POSTSUPERSCRIPT italic_a italic_n italic_n end_POSTSUPERSCRIPT. Notably, the output produced by the SAM𝑆𝐴𝑀SAMitalic_S italic_A italic_M model is the mask of image segmentation. During the process of SAM model-assisted semantic segmentation inference, a sample is inferred by the edge model, resulting in a labeled albeit inaccurate segmentation outcome. Concurrently, the same sample is transmitted to the cloud and inferred by the SAM𝑆𝐴𝑀SAMitalic_S italic_A italic_M model, generating an unlabeled yet accurate segmentation outcome. The combination of the SAM𝑆𝐴𝑀SAMitalic_S italic_A italic_M model’s inference outcome and that of the edge model forms the SAM𝑆𝐴𝑀SAMitalic_S italic_A italic_M model-assisted inference outcome, defined as follows with its description in Algorithm 1.

F(x)=Assisted_Inference(f(x),SAM(x))𝐹𝑥𝐴𝑠𝑠𝑖𝑠𝑡𝑒𝑑_𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑓𝑥𝑆𝐴𝑀𝑥F(x)=Assisted\_Inference(f(x),SAM(x))italic_F ( italic_x ) = italic_A italic_s italic_s italic_i italic_s italic_t italic_e italic_d _ italic_I italic_n italic_f italic_e italic_r italic_e italic_n italic_c italic_e ( italic_f ( italic_x ) , italic_S italic_A italic_M ( italic_x ) ) (1)
Algorithm 1 Assisted_Inference algorithm

Input: Edge Inference Result: pred𝑝𝑟𝑒𝑑preditalic_p italic_r italic_e italic_d, Large Vision Model Inference Mask: mask𝑚𝑎𝑠𝑘maskitalic_m italic_a italic_s italic_k, class num: M𝑀Mitalic_M
Output: large vision model–assisted Inference Result: semantic_mask𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐_𝑚𝑎𝑠𝑘semantic\_maskitalic_s italic_e italic_m italic_a italic_n italic_t italic_i italic_c _ italic_m italic_a italic_s italic_k

1:  semantic_maskpred𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐_𝑚𝑎𝑠𝑘𝑝𝑟𝑒𝑑semantic\_mask\leftarrow preditalic_s italic_e italic_m italic_a italic_n italic_t italic_i italic_c _ italic_m italic_a italic_s italic_k ← italic_p italic_r italic_e italic_d
2:  for valid_maskmask𝑣𝑎𝑙𝑖𝑑_𝑚𝑎𝑠𝑘𝑚𝑎𝑠𝑘valid\_mask\in maskitalic_v italic_a italic_l italic_i italic_d _ italic_m italic_a italic_s italic_k ∈ italic_m italic_a italic_s italic_k do
3:     scores[0,,0]𝑠𝑐𝑜𝑟𝑒𝑠00scores\leftarrow[0,...,0]italic_s italic_c italic_o italic_r italic_e italic_s ← [ 0 , … , 0 ]
4:     for i[1,,M]𝑖1𝑀i\in[1,...,M]italic_i ∈ [ 1 , … , italic_M ] do
5:        scores[i]i=1M(pred[i][valid_mask])𝑠𝑐𝑜𝑟𝑒𝑠delimited-[]𝑖superscriptsubscript𝑖1𝑀𝑝𝑟𝑒𝑑delimited-[]𝑖delimited-[]𝑣𝑎𝑙𝑖𝑑_𝑚𝑎𝑠𝑘scores[i]\leftarrow\sum_{i=1}^{M}(pred[i][valid\_mask])italic_s italic_c italic_o italic_r italic_e italic_s [ italic_i ] ← ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_p italic_r italic_e italic_d [ italic_i ] [ italic_v italic_a italic_l italic_i italic_d _ italic_m italic_a italic_s italic_k ] )
6:     end for
7:     Top_1_classargmaxi[1,,M]scores[i]𝑇𝑜𝑝_1_𝑐𝑙𝑎𝑠𝑠𝑖1𝑀𝑠𝑐𝑜𝑟𝑒𝑠delimited-[]𝑖Top\_1\_class\leftarrow\underset{i\in[1,...,M]}{\arg\max}\ scores[i]italic_T italic_o italic_p _ 1 _ italic_c italic_l italic_a italic_s italic_s ← start_UNDERACCENT italic_i ∈ [ 1 , … , italic_M ] end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_s italic_c italic_o italic_r italic_e italic_s [ italic_i ]
8:     semantic_mask[valid_mask]Top_1_class𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐_𝑚𝑎𝑠𝑘delimited-[]𝑣𝑎𝑙𝑖𝑑_𝑚𝑎𝑠𝑘𝑇𝑜𝑝_1_𝑐𝑙𝑎𝑠𝑠semantic\_mask[valid\_mask]\leftarrow Top\_1\_classitalic_s italic_e italic_m italic_a italic_n italic_t italic_i italic_c _ italic_m italic_a italic_s italic_k [ italic_v italic_a italic_l italic_i italic_d _ italic_m italic_a italic_s italic_k ] ← italic_T italic_o italic_p _ 1 _ italic_c italic_l italic_a italic_s italic_s
9:  end for
10:  return semantic_mask𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐_𝑚𝑎𝑠𝑘semantic\_maskitalic_s italic_e italic_m italic_a italic_n italic_t italic_i italic_c _ italic_m italic_a italic_s italic_k

The large vision model-assisted co-inference method can significantly improve semantic segmentation accuracy by refining edge model-generated results. However, it faces two challenges: On one hand, it entails the uploading of samples to the cloud, potentially leading to increased latency. Therefore, the edge-cloud collaboration strategy is essential for solving this problem, as will be explained in section III-C. On the other hand, since the effectiveness of edge-cloud joint inference relies on the labels produced by the edge model, if the edge model encounters difficulties of environment mobility, thereby resulting in data drift and heterogeneity, the accuracy of large vision model-assisted co-inference will also diminish accordingly. Therefore, we need to adaptively update the edge model and the edge-cloud collaboration strategy, which will be elaborated in section III-D.

III-C Hard Input Mining Strategy

The hard input mining strategy is pivotal in the LAECIPS architecture. Too many identified hard inputs will result in high inference latency while too few ones will lead to a decrease in the accuracy of handling corner cases. Existing methods rely on loss values [25] or confidence scores [23] to identify hard inputs. However, calculating loss values during inference is challenging due to unknown labels, and confidence score-based methods lack adaptability to changing environments. To address this, we propose a neural network-based hard input mining model, denoted as hhitalic_h. This model determines if a data input for edge inference is a hard or easy input, represented as h(f(x))[0,1]𝑓𝑥01h(f(x))\in[0,1]italic_h ( italic_f ( italic_x ) ) ∈ [ 0 , 1 ]. Based on this hard input mining model, the result of cloud-edge collaborative inference is:

(F,f,h)(x)={f(x)ifh(f(x))>δF(x)otherwise.(F,f,h)(x)=\left\{\begin{aligned} f(x)&\ \ \ if\ h(f(x))>\delta\\ F(x)&\ \ otherwise.\end{aligned}\right.( italic_F , italic_f , italic_h ) ( italic_x ) = { start_ROW start_CELL italic_f ( italic_x ) end_CELL start_CELL italic_i italic_f italic_h ( italic_f ( italic_x ) ) > italic_δ end_CELL end_ROW start_ROW start_CELL italic_F ( italic_x ) end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e . end_CELL end_ROW (2)

We denote the loss of the model’s output as

L(F,f,h,x,y)={l(f(x),y)ifh(f(x))>δl(F(x),y)otherwise.L(F,f,h,x,y)=\left\{\begin{aligned} l(f(x),y)&\ \ \ if\ h(f(x))>\delta\\ l(F(x),y)&\ \ otherwise.\end{aligned}\right.italic_L ( italic_F , italic_f , italic_h , italic_x , italic_y ) = { start_ROW start_CELL italic_l ( italic_f ( italic_x ) , italic_y ) end_CELL start_CELL italic_i italic_f italic_h ( italic_f ( italic_x ) ) > italic_δ end_CELL end_ROW start_ROW start_CELL italic_l ( italic_F ( italic_x ) , italic_y ) end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e . end_CELL end_ROW (3)

and the inference latency of the model as

delay(F,f,h,x)={d(f(x))ifh(f(x))>δd(F(x))otherwise.delay(F,f,h,x)=\left\{\begin{aligned} d(f(x))&\ \ \ if\ h(f(x))>\delta\\ d(F(x))&\ \ otherwise.\end{aligned}\right.italic_d italic_e italic_l italic_a italic_y ( italic_F , italic_f , italic_h , italic_x ) = { start_ROW start_CELL italic_d ( italic_f ( italic_x ) ) end_CELL start_CELL italic_i italic_f italic_h ( italic_f ( italic_x ) ) > italic_δ end_CELL end_ROW start_ROW start_CELL italic_d ( italic_F ( italic_x ) ) end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e . end_CELL end_ROW (4)

Then, the loss of cloud-edge collaborative inference is:

𝔼P(x,y)𝔼h(f(x))[L(F,f,h,x,y)]=𝔼P(x,y)[h(f(x))l(f(x),y)+(1h(f(x)))l(F(x),y)],\begin{aligned} &\mathbb{E}_{P(x,y)}\mathbb{E}_{h(f(x))}[L(F,f,h,x,y)]=\\ \mathbb{E}_{P(x,y)}[h(f(x))&*l(f(x),y)+(1-h(f(x)))*l(F(x),y)],\end{aligned}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_h ( italic_f ( italic_x ) ) end_POSTSUBSCRIPT [ italic_L ( italic_F , italic_f , italic_h , italic_x , italic_y ) ] = end_CELL end_ROW start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_h ( italic_f ( italic_x ) ) end_CELL start_CELL ∗ italic_l ( italic_f ( italic_x ) , italic_y ) + ( 1 - italic_h ( italic_f ( italic_x ) ) ) ∗ italic_l ( italic_F ( italic_x ) , italic_y ) ] , end_CELL end_ROW

(5)

and the latency of cloud-edge collaborative inference is:

𝔼P(x,y)𝔼h(f(x))[delay(F,f,h,x)]=𝔼P(x,y)[h(f(x))d(f(x))+(1h(f(x)))d(F(x))].\begin{aligned} &\mathbb{E}_{P(x,y)}\mathbb{E}_{h(f(x))}[delay(F,f,h,x)]=\\ \mathbb{E}_{P(x,y)}[h(f(x))&*d(f(x))+(1-h(f(x)))*d(F(x))].\end{aligned}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_h ( italic_f ( italic_x ) ) end_POSTSUBSCRIPT [ italic_d italic_e italic_l italic_a italic_y ( italic_F , italic_f , italic_h , italic_x ) ] = end_CELL end_ROW start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_h ( italic_f ( italic_x ) ) end_CELL start_CELL ∗ italic_d ( italic_f ( italic_x ) ) + ( 1 - italic_h ( italic_f ( italic_x ) ) ) ∗ italic_d ( italic_F ( italic_x ) ) ] . end_CELL end_ROW

(6)

We aim to improve inference accuracy while meeting the requirements of task process latency. Therefore, the overall optimization objective is:

minF,f𝔽,h𝔼P(x,y)𝔼h(f(x))[L(F,f,h,x,y)]subscriptformulae-sequence𝐹𝑓𝔽subscript𝔼𝑃𝑥𝑦subscript𝔼𝑓𝑥delimited-[]𝐿𝐹𝑓𝑥𝑦\displaystyle\min_{F,f\in\mathbb{F},h\in\mathbb{H}}{\mathbb{E}_{P(x,y)}\mathbb% {E}_{h(f(x))}[L(F,f,h,x,y)]}roman_min start_POSTSUBSCRIPT italic_F , italic_f ∈ blackboard_F , italic_h ∈ blackboard_H end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_h ( italic_f ( italic_x ) ) end_POSTSUBSCRIPT [ italic_L ( italic_F , italic_f , italic_h , italic_x , italic_y ) ] (7)
s.t.𝔼P(x,y)𝔼h(f(x))[delay(F,f,h,x)]<delaymax.\displaystyle s.t.\ \ \mathbb{E}_{P(x,y)}\mathbb{E}_{h(f(x))}[delay(F,f,h,x)]<% delay_{max}\ .italic_s . italic_t . blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_h ( italic_f ( italic_x ) ) end_POSTSUBSCRIPT [ italic_d italic_e italic_l italic_a italic_y ( italic_F , italic_f , italic_h , italic_x ) ] < italic_d italic_e italic_l italic_a italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT .

Since the inference delay is independent of the inference inputs, we can simplify the inference delay as follows:

d(f(x))=d(f)=d1𝑑𝑓𝑥𝑑𝑓subscript𝑑1\displaystyle d(f(x))=d(f)=d_{1}italic_d ( italic_f ( italic_x ) ) = italic_d ( italic_f ) = italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (8)
d(F(x))=d(F)=d0.𝑑𝐹𝑥𝑑𝐹subscript𝑑0\displaystyle d(F(x))=d(F)=d_{0}\ .italic_d ( italic_F ( italic_x ) ) = italic_d ( italic_F ) = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

Thus, the cloud-edge collaborative inference delay in (6) can be simplified as:

𝔼P(x,y)𝔼h(f(x))[delay(F,f,h,x)]=subscript𝔼𝑃𝑥𝑦subscript𝔼𝑓𝑥delimited-[]𝑑𝑒𝑙𝑎𝑦𝐹𝑓𝑥absent\displaystyle\mathbb{E}_{P(x,y)}\mathbb{E}_{h(f(x))}[delay(F,f,h,x)]=blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_h ( italic_f ( italic_x ) ) end_POSTSUBSCRIPT [ italic_d italic_e italic_l italic_a italic_y ( italic_F , italic_f , italic_h , italic_x ) ] = (9)
(d1d0)𝔼P(x,y)[h(f(x))]+d0.subscript𝑑1subscript𝑑0subscript𝔼𝑃𝑥𝑦delimited-[]𝑓𝑥subscript𝑑0\displaystyle(d_{1}-d_{0})*\mathbb{E}_{P(x,y)}[h(f(x))]+d_{0}\ .( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∗ blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_h ( italic_f ( italic_x ) ) ] + italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

Therefore, the constraint in (7) can be simplified as:

𝔼P(x,y)[h(f(x))]>d0delaymaxdelaymaxd1.subscript𝔼𝑃𝑥𝑦delimited-[]𝑓𝑥subscript𝑑0𝑑𝑒𝑙𝑎subscript𝑦𝑚𝑎𝑥𝑑𝑒𝑙𝑎subscript𝑦𝑚𝑎𝑥subscript𝑑1\displaystyle\mathbb{E}_{P(x,y)}[h(f(x))]>\frac{d_{0}-delay_{max}}{delay_{max}% -d_{1}}\ .blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_h ( italic_f ( italic_x ) ) ] > divide start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_d italic_e italic_l italic_a italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_e italic_l italic_a italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG . (10)

Thus, the optimization objective in (7), which satisfies the KKT conditions [26], can be rewritten as:

minF,f𝔽,h𝔼P(x,y)𝔼h(f(x))[L(F,f,h,x,y)]+β𝔼P(x,y)[log(h(f(x)))].subscriptformulae-sequence𝐹𝑓𝔽subscript𝔼𝑃𝑥𝑦subscript𝔼𝑓𝑥delimited-[]𝐿𝐹𝑓𝑥𝑦𝛽subscript𝔼𝑃𝑥𝑦delimited-[]𝑙𝑜𝑔𝑓𝑥\min_{F,f\in\mathbb{F},h\in\mathbb{H}}{\mathbb{E}_{P(x,y)}\mathbb{E}_{h(f(x))}% [L(F,f,h,x,y)]+\beta*\mathbb{E}_{P(x,y)}[-log(h(f(x)))].}roman_min start_POSTSUBSCRIPT italic_F , italic_f ∈ blackboard_F , italic_h ∈ blackboard_H end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_h ( italic_f ( italic_x ) ) end_POSTSUBSCRIPT [ italic_L ( italic_F , italic_f , italic_h , italic_x , italic_y ) ] + italic_β ∗ blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ - italic_l italic_o italic_g ( italic_h ( italic_f ( italic_x ) ) ) ] .

(11)

III-D Adaptive Update Process

Since F𝐹Fitalic_F represents the large vision model-assisted inference function, there is no need to optimize F𝐹Fitalic_F. Therefore, the optimization targets are f𝑓fitalic_f and hhitalic_h. Additionally, since it is difficult to obtain the true label y𝑦yitalic_y of a sample x𝑥xitalic_x in a real environment, we optimize f𝑓fitalic_f and hhitalic_h using the large vision model-assisted inference result F(x)𝐹𝑥F(x)italic_F ( italic_x ). The optimization objective in (11) can be further rewritten as:

minf𝔽,h𝔼P(x,y)[h(f(x))l(f(x),F(x))]+β𝔼P(x,y)[log(h(f(x)))].subscriptformulae-sequence𝑓𝔽subscript𝔼𝑃𝑥𝑦delimited-[]𝑓𝑥𝑙𝑓𝑥𝐹𝑥𝛽subscript𝔼𝑃𝑥𝑦delimited-[]𝑙𝑜𝑔𝑓𝑥\min_{f\in\mathbb{F},h\in\mathbb{H}}{\mathbb{E}_{P(x,y)}[h(f(x))*l(f(x),F(x))]% +\beta*\mathbb{E}_{P(x,y)}[-log(h(f(x)))].}roman_min start_POSTSUBSCRIPT italic_f ∈ blackboard_F , italic_h ∈ blackboard_H end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_h ( italic_f ( italic_x ) ) ∗ italic_l ( italic_f ( italic_x ) , italic_F ( italic_x ) ) ] + italic_β ∗ blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ - italic_l italic_o italic_g ( italic_h ( italic_f ( italic_x ) ) ) ] .

(12)

The model update process can be divided into two steps: in the first step, we freeze hhitalic_h and update f𝑓fitalic_f:

Lf=l(f(x),F(x))subscript𝐿𝑓𝑙𝑓𝑥𝐹𝑥\displaystyle L_{f}=l(f(x),F(x))italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_l ( italic_f ( italic_x ) , italic_F ( italic_x ) ) (13)
θf=θfηLf.subscript𝜃𝑓subscript𝜃𝑓𝜂subscript𝐿𝑓\displaystyle\theta_{f}=\theta_{f}-\eta\bigtriangledown L_{f}\ .italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_η ▽ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT .

Then, we freeze f𝑓fitalic_f and update hhitalic_h:

Lhsubscript𝐿\displaystyle L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =h(f(x))l(f(x),F(x))+βlog(h(f(x)))\displaystyle=h(f(x))*l(f(x),F(x))+\beta*-log(h(f(x)))= italic_h ( italic_f ( italic_x ) ) ∗ italic_l ( italic_f ( italic_x ) , italic_F ( italic_x ) ) + italic_β ∗ - italic_l italic_o italic_g ( italic_h ( italic_f ( italic_x ) ) ) (14)
θhsubscript𝜃\displaystyle\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =θhηLh.absentsubscript𝜃𝜂subscript𝐿\displaystyle=\theta_{h}-\eta\bigtriangledown L_{h}\ .= italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_η ▽ italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT .

The overall workflow of LAECIPS combines the large vision model-assisted inference, hard input mining, and the adaptive update process, as shown in Algorithm 2.

Algorithm 2 Adaptive Edge-Cloud Collaboration algorithm

Initialize: Pretrained Edge Model: θfsubscript𝜃𝑓\theta_{f}italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, Cloud Model: SAM𝑆𝐴𝑀SAMitalic_S italic_A italic_M, Pretrained Hard Input Mining Model: θhsubscript𝜃\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.
Parameters: confidence threshold: threshold𝑡𝑟𝑒𝑠𝑜𝑙𝑑thresholditalic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d, max replay buffer size: maxsize𝑚𝑎𝑥𝑠𝑖𝑧𝑒maxsizeitalic_m italic_a italic_x italic_s italic_i italic_z italic_e, max continual training interval: maxtime𝑚𝑎𝑥𝑡𝑖𝑚𝑒maxtimeitalic_m italic_a italic_x italic_t italic_i italic_m italic_e.
Output: Edge inference result: edge_result𝑒𝑑𝑔𝑒_𝑟𝑒𝑠𝑢𝑙𝑡edge\_resultitalic_e italic_d italic_g italic_e _ italic_r italic_e italic_s italic_u italic_l italic_t, Co-inference result: assisted_result𝑎𝑠𝑠𝑖𝑠𝑡𝑒𝑑_𝑟𝑒𝑠𝑢𝑙𝑡assisted\_resultitalic_a italic_s italic_s italic_i italic_s italic_t italic_e italic_d _ italic_r italic_e italic_s italic_u italic_l italic_t.

1:  replay_buffer𝑟𝑒𝑝𝑙𝑎𝑦_𝑏𝑢𝑓𝑓𝑒𝑟replay\_buffer\leftarrow\emptysetitalic_r italic_e italic_p italic_l italic_a italic_y _ italic_b italic_u italic_f italic_f italic_e italic_r ← ∅
2:  for imgsamples𝑖𝑚𝑔𝑠𝑎𝑚𝑝𝑙𝑒𝑠img\in samplesitalic_i italic_m italic_g ∈ italic_s italic_a italic_m italic_p italic_l italic_e italic_s do
3:     Edge Node Collect img𝑖𝑚𝑔imgitalic_i italic_m italic_g
4:     edge_resultf(img)𝑒𝑑𝑔𝑒_𝑟𝑒𝑠𝑢𝑙𝑡𝑓𝑖𝑚𝑔edge\_result\leftarrow f(img)italic_e italic_d italic_g italic_e _ italic_r italic_e italic_s italic_u italic_l italic_t ← italic_f ( italic_i italic_m italic_g )
5:     \\\backslash*\ ∗ perform hard input mining\\*\backslash∗ \
6:     confidenceh(edge_result)𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑒𝑑𝑔𝑒_𝑟𝑒𝑠𝑢𝑙𝑡confidence\leftarrow h(edge\_result)italic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e ← italic_h ( italic_e italic_d italic_g italic_e _ italic_r italic_e italic_s italic_u italic_l italic_t )
7:     if confidence>threshold𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑡𝑟𝑒𝑠𝑜𝑙𝑑confidence>thresholditalic_c italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e > italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d then
8:        output edge_result𝑒𝑑𝑔𝑒_𝑟𝑒𝑠𝑢𝑙𝑡edge\_resultitalic_e italic_d italic_g italic_e _ italic_r italic_e italic_s italic_u italic_l italic_t
9:        Continue
10:     else
11:        Upload img𝑖𝑚𝑔imgitalic_i italic_m italic_g to Cloud Node
12:        \\\backslash*\ ∗ perform large vision model assisted inference\\*\backslash∗ \
13:        maskSAM(img)𝑚𝑎𝑠𝑘𝑆𝐴𝑀𝑖𝑚𝑔mask\leftarrow SAM(img)italic_m italic_a italic_s italic_k ← italic_S italic_A italic_M ( italic_i italic_m italic_g )
14:        Edge node Download mask𝑚𝑎𝑠𝑘maskitalic_m italic_a italic_s italic_k from Cloud Node
15:        assisted_resultAssisted_Inference(edge_result,mask)𝑎𝑠𝑠𝑖𝑠𝑡𝑒𝑑_𝑟𝑒𝑠𝑢𝑙𝑡𝐴𝑠𝑠𝑖𝑠𝑡𝑒𝑑_𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑒𝑑𝑔𝑒_𝑟𝑒𝑠𝑢𝑙𝑡𝑚𝑎𝑠𝑘assisted\_result\leftarrow Assisted\_Inference(edge\_result,mask)italic_a italic_s italic_s italic_i italic_s italic_t italic_e italic_d _ italic_r italic_e italic_s italic_u italic_l italic_t ← italic_A italic_s italic_s italic_i italic_s italic_t italic_e italic_d _ italic_I italic_n italic_f italic_e italic_r italic_e italic_n italic_c italic_e ( italic_e italic_d italic_g italic_e _ italic_r italic_e italic_s italic_u italic_l italic_t , italic_m italic_a italic_s italic_k ) by Algorithm 1
16:        output assisted_result𝑎𝑠𝑠𝑖𝑠𝑡𝑒𝑑_𝑟𝑒𝑠𝑢𝑙𝑡assisted\_resultitalic_a italic_s italic_s italic_i italic_s italic_t italic_e italic_d _ italic_r italic_e italic_s italic_u italic_l italic_t
17:        replay_buffer𝑟𝑒𝑝𝑙𝑎𝑦_𝑏𝑢𝑓𝑓𝑒𝑟replay\_bufferitalic_r italic_e italic_p italic_l italic_a italic_y _ italic_b italic_u italic_f italic_f italic_e italic_r append (img,assisted_result)𝑖𝑚𝑔𝑎𝑠𝑠𝑖𝑠𝑡𝑒𝑑_𝑟𝑒𝑠𝑢𝑙𝑡(img,assisted\_result)( italic_i italic_m italic_g , italic_a italic_s italic_s italic_i italic_s italic_t italic_e italic_d _ italic_r italic_e italic_s italic_u italic_l italic_t )
18:     end if
19:     \\\backslash*\ ∗ perform adaptive update process\\*\backslash∗ \
20:     if size(replay_buffer)>maxsize𝑠𝑖𝑧𝑒𝑟𝑒𝑝𝑙𝑎𝑦_𝑏𝑢𝑓𝑓𝑒𝑟𝑚𝑎𝑥𝑠𝑖𝑧𝑒size(replay\_buffer)>maxsizeitalic_s italic_i italic_z italic_e ( italic_r italic_e italic_p italic_l italic_a italic_y _ italic_b italic_u italic_f italic_f italic_e italic_r ) > italic_m italic_a italic_x italic_s italic_i italic_z italic_e or timeinterval>maxtime𝑡𝑖𝑚𝑒𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑚𝑎𝑥𝑡𝑖𝑚𝑒timeinterval>maxtimeitalic_t italic_i italic_m italic_e italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l > italic_m italic_a italic_x italic_t italic_i italic_m italic_e then
21:        Train θfsubscript𝜃𝑓\theta_{f}italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT with replay_buffer𝑟𝑒𝑝𝑙𝑎𝑦_𝑏𝑢𝑓𝑓𝑒𝑟replay\_bufferitalic_r italic_e italic_p italic_l italic_a italic_y _ italic_b italic_u italic_f italic_f italic_e italic_r by Equation(13)
22:        Train θhsubscript𝜃\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT with replay_buffer𝑟𝑒𝑝𝑙𝑎𝑦_𝑏𝑢𝑓𝑓𝑒𝑟replay\_bufferitalic_r italic_e italic_p italic_l italic_a italic_y _ italic_b italic_u italic_f italic_f italic_e italic_r by Equation(14)
23:        replay_buffer𝑟𝑒𝑝𝑙𝑎𝑦_𝑏𝑢𝑓𝑓𝑒𝑟replay\_buffer\leftarrow\emptysetitalic_r italic_e italic_p italic_l italic_a italic_y _ italic_b italic_u italic_f italic_f italic_e italic_r ← ∅
24:     end if
25:  end for

IV Theoretical Analysis of LAECIPS

The system’s generalization ability will greatly affect its actual effectiveness when deploying in a real-world dynamic IoT environment. In this section, we theoretically analyze the generalization boundary of the proposed system LAECIPS to prove its feasibility.

Based on the optimization objective from equation (12), the expected loss for the semantic segmentation function f𝑓fitalic_f and hard input mining strategy hhitalic_h is defined as follows:

R(f,h)𝑅𝑓\displaystyle R(f,h)italic_R ( italic_f , italic_h ) =𝔼P(x,y)[h(f(x))l(f(x),F(x))]absentsubscript𝔼𝑃𝑥𝑦delimited-[]𝑓𝑥𝑙𝑓𝑥𝐹𝑥\displaystyle=\mathbb{E}_{P(x,y)}[h(f(x))*l(f(x),F(x))]= blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_h ( italic_f ( italic_x ) ) ∗ italic_l ( italic_f ( italic_x ) , italic_F ( italic_x ) ) ] (15)
+β𝔼P(x,y)[log(h(f(x)))].𝛽subscript𝔼𝑃𝑥𝑦delimited-[]𝑙𝑜𝑔𝑓𝑥\displaystyle+\beta*\mathbb{E}_{P(x,y)}[-log(h(f(x)))]\ .+ italic_β ∗ blackboard_E start_POSTSUBSCRIPT italic_P ( italic_x , italic_y ) end_POSTSUBSCRIPT [ - italic_l italic_o italic_g ( italic_h ( italic_f ( italic_x ) ) ) ] .
Theorem 1

Let f𝑓fitalic_f be the family of semantic segmentation functions taking values in [0,1]M×H×Wsuperscript01𝑀𝐻𝑊[0,1]^{M\times H\times W}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_M × italic_H × italic_W end_POSTSUPERSCRIPT, hhitalic_h be the family of hard input mining functions taking values in [0,1]01[0,1][ 0 , 1 ]. We denote by R^S(f,h)subscript^𝑅𝑆𝑓\widehat{R}_{S}(f,h)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_f , italic_h ) the empirical loss of function (f,h)𝑓(f,h)( italic_f , italic_h ) over the Sample S𝑆Sitalic_S. Then, for any δ>0𝛿0\delta>0italic_δ > 0, with probability as least 1δ1𝛿1-\delta1 - italic_δ over the draw of a sample S𝑆Sitalic_S of size m𝑚mitalic_m, the following holds for all (f,h)𝔽×𝑓𝔽(f,h)\in\mathbb{F}\times\mathbb{H}( italic_f , italic_h ) ∈ blackboard_F × blackboard_H, where msubscript𝑚\mathcal{R}_{m}caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represents the Rademacher complexity [27]:

R(f,h)R^S(f,h)+(1+β)m()+m(𝔽)+log1δ2m.𝑅𝑓subscript^𝑅𝑆𝑓1𝛽subscript𝑚subscript𝑚𝔽𝑙𝑜𝑔1𝛿2𝑚R(f,h)\leq\widehat{R}_{S}(f,h)+(1+\beta)\mathcal{R}_{m}(\mathbb{H})+\mathcal{R% }_{m}(\mathbb{F})+\sqrt{\frac{log\frac{1}{\delta}}{2m}}\ .italic_R ( italic_f , italic_h ) ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_f , italic_h ) + ( 1 + italic_β ) caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( blackboard_H ) + caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( blackboard_F ) + square-root start_ARG divide start_ARG italic_l italic_o italic_g divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG 2 italic_m end_ARG end_ARG . (16)
Proof 1

Let l𝔽,subscript𝑙𝔽l_{\mathbb{F},\mathbb{H}}italic_l start_POSTSUBSCRIPT blackboard_F , blackboard_H end_POSTSUBSCRIPT be the family of functions l𝔽,={(x,y)L(f,h,x,y),f𝔽,h}subscript𝑙𝔽formulae-sequence𝑥𝑦𝐿𝑓𝑥𝑦formulae-sequence𝑓𝔽l_{\mathbb{F},\mathbb{H}}=\{(x,y)\rightarrow L(f,h,x,y),f\in\mathbb{F},h\in% \mathbb{H}\}italic_l start_POSTSUBSCRIPT blackboard_F , blackboard_H end_POSTSUBSCRIPT = { ( italic_x , italic_y ) → italic_L ( italic_f , italic_h , italic_x , italic_y ) , italic_f ∈ blackboard_F , italic_h ∈ blackboard_H }. By the general Rademacher complexity bound  [28], with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds for all (f,h)𝔽×𝑓𝔽(f,h)\in\mathbb{F}\times\mathbb{H}( italic_f , italic_h ) ∈ blackboard_F × blackboard_H:

R(f,h)R^S(f,h)+2m(l𝔽,)+log1δ2m.𝑅𝑓subscript^𝑅𝑆𝑓2subscript𝑚subscript𝑙𝔽𝑙𝑜𝑔1𝛿2𝑚R(f,h)\leq\widehat{R}_{S}(f,h)+2\mathcal{R}_{m}(l_{\mathbb{F},\mathbb{H}})+% \sqrt{\frac{log\frac{1}{\delta}}{2m}}\ .italic_R ( italic_f , italic_h ) ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_f , italic_h ) + 2 caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT blackboard_F , blackboard_H end_POSTSUBSCRIPT ) + square-root start_ARG divide start_ARG italic_l italic_o italic_g divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG 2 italic_m end_ARG end_ARG . (17)

Now, the Rademacher complexity can be bounded as follows:

m(l𝔽,)=𝔼σ[sup(f,h)𝔽×1mi=1mσih(f(xi))\displaystyle\mathcal{R}_{m}(l_{\mathbb{F},\mathbb{H}})=\mathbb{E}_{\sigma}[% \sup_{(f,h)\in\mathbb{F}\times\mathbb{H}}\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}*h% (f(x_{i}))caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT blackboard_F , blackboard_H end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , italic_h ) ∈ blackboard_F × blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (18)
l(f(xi),F(xi))+σiβ(log(h(f(xi))))]\displaystyle*l(f(x_{i}),F(x_{i}))+\sigma_{i}*\beta*(-log(h(f(x_{i}))))]∗ italic_l ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_β ∗ ( - italic_l italic_o italic_g ( italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ) ]
𝔼σ[sup(f,h)𝔽×1mi=1mσih(f(xi))l(f(xi),F(xi))]absentsubscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑓subscript𝑥𝑖𝑙𝑓subscript𝑥𝑖𝐹subscript𝑥𝑖\displaystyle\leq\mathbb{E}_{\sigma}[\sup_{(f,h)\in\mathbb{F}\times\mathbb{H}}% \frac{1}{m}\sum_{i=1}^{m}\sigma_{i}*h(f(x_{i}))*l(f(x_{i}),F(x_{i}))]≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , italic_h ) ∈ blackboard_F × blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∗ italic_l ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ]
+β𝔼σ[sup(f,h)𝔽×1mi=1mσi(log(h(f(xi))))].𝛽subscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑙𝑜𝑔𝑓subscript𝑥𝑖\displaystyle+\beta*\mathbb{E}_{\sigma}[\sup_{(f,h)\in\mathbb{F}\times\mathbb{% H}}\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}*(-log(h(f(x_{i}))))].+ italic_β ∗ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , italic_h ) ∈ blackboard_F × blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ ( - italic_l italic_o italic_g ( italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ) ] .
Lemma 1

Let 𝔽1subscript𝔽1\mathbb{F}_{1}blackboard_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝔽2subscript𝔽2\mathbb{F}_{2}blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two families of functions mapping X𝑋Xitalic_X to [0,1]01[0,1][ 0 , 1 ]. Let 𝔽={f1f2:f1𝔽1,f2𝔽2}𝔽conditional-setsubscript𝑓1subscript𝑓2formulae-sequencesubscript𝑓1subscript𝔽1subscript𝑓2subscript𝔽2\mathbb{F}=\{f_{1}*f_{2}:f_{1}\in\mathbb{F}_{1},f_{2}\in\mathbb{F}_{2}\}blackboard_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. Then, the empirical Rademacher complexities of 𝔽𝔽\mathbb{F}blackboard_F for any sample S𝑆Sitalic_S of size m𝑚mitalic_m are bounded:

S^(𝔽)2(S^(𝔽1)+S^(𝔽2))^subscript𝑆𝔽2^subscript𝑆subscript𝔽1^subscript𝑆subscript𝔽2\widehat{\mathcal{R}_{S}}(\mathbb{F})\leq 2(\widehat{\mathcal{R}_{S}}(\mathbb{% F}_{1})+\widehat{\mathcal{R}_{S}}(\mathbb{F}_{2}))over^ start_ARG caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ( blackboard_F ) ≤ 2 ( over^ start_ARG caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ( blackboard_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + over^ start_ARG caligraphic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ( blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) (19)

The proof of lemma 1 could be found in [29]

By lemma 1, the Rademacher complexity of products of indicator functions can be bounded by the sum of the Rademacher complexities of each indicator function class, thus:

𝔼σ[sup(f,h)𝔽×1mi=1mσih(f(xi))l(f(xi),F(xi))]subscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑓subscript𝑥𝑖𝑙𝑓subscript𝑥𝑖𝐹subscript𝑥𝑖\displaystyle\mathbb{E}_{\sigma}[\sup_{(f,h)\in\mathbb{F}\times\mathbb{H}}% \frac{1}{m}\sum_{i=1}^{m}\sigma_{i}*h(f(x_{i}))*l(f(x_{i}),F(x_{i}))]blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , italic_h ) ∈ blackboard_F × blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∗ italic_l ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] (20)
𝔼σ[sup(f,h)𝔽×1mi=1mσih(f(xi))]absentsubscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑓subscript𝑥𝑖\displaystyle\leq\mathbb{E}_{\sigma}[\sup_{(f,h)\in\mathbb{F}\times\mathbb{H}}% \frac{1}{m}\sum_{i=1}^{m}\sigma_{i}*h(f(x_{i}))]≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , italic_h ) ∈ blackboard_F × blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ]
+𝔼σ[supf𝔽1mi=1mσil(f(xi),F(xi))].subscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑙𝑓subscript𝑥𝑖𝐹subscript𝑥𝑖\displaystyle+\mathbb{E}_{\sigma}[\sup_{f\in\mathbb{F}}\frac{1}{m}\sum_{i=1}^{% m}\sigma_{i}*l(f(x_{i}),F(x_{i}))]\ .+ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ blackboard_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_l ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] .

So, the Rademacher complexity can be bounded as follows:

m(l𝔽,)subscript𝑚subscript𝑙𝔽\displaystyle\mathcal{R}_{m}(l_{\mathbb{F},\mathbb{H}})caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT blackboard_F , blackboard_H end_POSTSUBSCRIPT ) (1+β)𝔼σ[sup(f,h)𝔽×1mi=1mσih(f(xi))]absent1𝛽subscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑓subscript𝑥𝑖\displaystyle\leq(1+\beta)*\mathbb{E}_{\sigma}[\sup_{(f,h)\in\mathbb{F}\times% \mathbb{H}}\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}*h(f(x_{i}))]≤ ( 1 + italic_β ) ∗ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_f , italic_h ) ∈ blackboard_F × blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] (21)
+𝔼σ[supf𝔽1mi=1mσil(f(xi),F(xi))]subscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑙𝑓subscript𝑥𝑖𝐹subscript𝑥𝑖\displaystyle+\mathbb{E}_{\sigma}[\sup_{f\in\mathbb{F}}\frac{1}{m}\sum_{i=1}^{% m}\sigma_{i}*l(f(x_{i}),F(x_{i}))]+ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ blackboard_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_l ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ]
(1+β)𝔼σ[suph1mi=1mσih(f(xi))]absent1𝛽subscript𝔼𝜎delimited-[]subscriptsupremum1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑓subscript𝑥𝑖\displaystyle\leq(1+\beta)*\mathbb{E}_{\sigma}[\sup_{h\in\mathbb{H}}\frac{1}{m% }\sum_{i=1}^{m}\sigma_{i}*h(f(x_{i}))]≤ ( 1 + italic_β ) ∗ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ blackboard_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ]
+𝔼σ[supf𝔽1mi=1mσil(f(xi),F(xi))]subscript𝔼𝜎delimited-[]subscriptsupremum𝑓𝔽1𝑚superscriptsubscript𝑖1𝑚subscript𝜎𝑖𝑙𝑓subscript𝑥𝑖𝐹subscript𝑥𝑖\displaystyle+\mathbb{E}_{\sigma}[\sup_{f\in\mathbb{F}}\frac{1}{m}\sum_{i=1}^{% m}\sigma_{i}*l(f(x_{i}),F(x_{i}))]+ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ blackboard_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_l ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ]
(1+β)m()+m(𝔽).absent1𝛽subscript𝑚subscript𝑚𝔽\displaystyle\leq(1+\beta)\mathcal{R}_{m}(\mathbb{H})+\mathcal{R}_{m}(\mathbb{% F}).≤ ( 1 + italic_β ) caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( blackboard_H ) + caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( blackboard_F ) .

This theorem gives generalization guarantees for learning the semantic segmentation function f𝑓fitalic_f and hard input mining function hhitalic_h that admit Rademacher complexities in O(1m)𝑂1𝑚O(\frac{1}{\sqrt{m}})italic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ).

Theorem 1 indicates that the maximum generalization error of LAECIPS is bounded provided that the maximum generalization error of the semantic segmentation models and hard input mining strategy deployed in LAECIPS is controllable. Therefore, it is theoretically feasible to deploy large visual models, small models, and hard input mining strategies in the LAECIPS framework for co-inference in a plug-and-play manner.

V Experiments

V-A Experimental Setup

Hardware and Software Systems

We implemented a system prototype of the proposed LAECCIPS framework for real-world robotic semantic segmentation and conducted experiments on it for performance evaluations. In the hardware setup, we use the Nvidia Jetson Nano [30], which is commonly used in real-world robotic devices, as the edge node. For the cloud node, we have a Dell R750 server with a 48-core Intel Xeon Silver 4310 CPU @ 2.10GHz, 256GB of memory, and 2 Nvidia GeForce 3090 GPUs. The cloud node and edge node are connected via WLAN with a network bandwidth of 4Mbps. We’ve implemented LAECIPS using the distributed AI testing framework Ianvs [31] based on Kubeedge, deploying small models on the Jetson Nano and the large vision model on the Dell R750 server as shown in Fig. 4.

Datasets

Semantic segmentation is a typical task in the IoT perception system and also a fundamental task in the fields of robotics and autonomous driving. To validate the effectiveness of our proposed LAECIPS in the real-world IoT perception environment, we selected four typical real-world semantic segmentation datasets:

  • The Cloud-Robotics dataset [32] contains 2600 semantic segmentation images collected by intelligent robotic dogs in the Shenzhen Industrial zone, mainly applicable to robot scenes in semi-enclosed areas.

  • The Cityscapes dataset [33] contains 5000 semantic segmentation images collected by smart cars in multiple cities in Germany, mainly applicable to autonomous driving scenes in open-world environments.

  • The ADE20K dataset [34] contains 20,000 semantic segmentation images, covering various scenes from indoor to outdoor, natural to urban, and can be used for tasks like scene understanding and image segmentation in robotics and autonomous driving.

  • The SYNTHIA dataset [35] contains 9,000 semantic segmentation images, consisting of photo-realistic frames rendered from a virtual city and includes precise pixel-level semantic annotations.

Refer to caption
Figure 4: Experimental setup for LAECIPS

Compared Methods

We first compared three different baseline frameworks:

  • CLOUD: Upload all inputs to the cloud node for processing by the large visual model.

  • EDGE: Process all inputs on the edge node using the small model.

  • DCSB: DCSB [21] is the SOTA method for big/little model cooperation. The difference between this framework and our proposed LAECIPS framework is that DCSB does not dynamically update the small model.

Besides that, we also employed three typical hard input mining strategies, MESS [22], SM [16], and SPP [23], to evaluate the effectiveness and generalization of our LAECIPS method.

  • MESS is the SOTA method proposed for early exit semantic segmentation, which could also be used in hard input mining. It calculates the confidence score of an inference result by counting the proportion of pixels with a maximum probability distribution greater than a certain threshold:

    Confidence=1HWh=1Hw=1W𝟙(ch,wtop1(f(x))threpix)𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒1𝐻𝑊superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊1superscriptsubscript𝑐𝑤𝑡𝑜𝑝1𝑓𝑥𝑡𝑟superscript𝑒𝑝𝑖𝑥Confidence=\frac{1}{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}\mathds{1}(c_{h,w}^{top1}(f% (x))\geq thre^{pix})italic_C italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT blackboard_1 ( italic_c start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_p 1 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) ≥ italic_t italic_h italic_r italic_e start_POSTSUPERSCRIPT italic_p italic_i italic_x end_POSTSUPERSCRIPT ) (22)
  • SM is the classic method used in edge-cloud collaboration. It calculates the confidence score based on the difference between the maximum probability distribution and the second maximum probability distribution in the inference result:

    Confidence=1HWh=1Hw=1W(ch,wtop1(f(x))ch,wtop2(f(x)))𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒1𝐻𝑊superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊superscriptsubscript𝑐𝑤𝑡𝑜𝑝1𝑓𝑥superscriptsubscript𝑐𝑤𝑡𝑜𝑝2𝑓𝑥Confidence=\frac{1}{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}(c_{h,w}^{top1}(f(x))-c_{h,% w}^{top2}(f(x)))italic_C italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_p 1 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) - italic_c start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_p 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) ) (23)
  • SPP is the baseline method for hard input mining. It calculates the confidence score based on the maximum probability distribution in the inference result:

    Confidence=1HWh=1Hw=1Wch,wtop1(f(x))𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒1𝐻𝑊superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊superscriptsubscript𝑐𝑤𝑡𝑜𝑝1𝑓𝑥Confidence=\frac{1}{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}c_{h,w}^{top1}(f(x))italic_C italic_o italic_n italic_f italic_i italic_d italic_e italic_n italic_c italic_e = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_p 1 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) (24)

For a fair comparison, the above three algorithms for hard input mining will be applied in the proposed framework in an online manner during the experimental process.

Evaluation Metrics

The metrics we test in the experiment include mIoU, Cloud Upload Rate (CUR), and latency. mIoU measures the model’s inference accuracy in semantic segmentation tasks. CUR represents the proportion of images uploaded to the cloud, reflecting the communication overhead of edge-cloud co-inference. The latency is the average time for completing the co-inference process for image inputs.

The calculation of the inference mIoU accuracy is as follows:

mIoU(F,f,h)𝑚𝐼𝑜𝑈𝐹𝑓\displaystyle mIoU(F,f,h)italic_m italic_I italic_o italic_U ( italic_F , italic_f , italic_h ) =1Ni=1N[𝟙(h(f(xi))δ)IoU(f(xi),y)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}[\mathds{1}(h(f(x_{i}))\geq\delta)*IoU(% f(x_{i}),y)= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ blackboard_1 ( italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ≥ italic_δ ) ∗ italic_I italic_o italic_U ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y ) (25)
+𝟙(h(f(x))<δ)IoU(F(xi),y)].\displaystyle+\mathds{1}(h(f(x))<\delta)*IoU(F(x_{i}),y)].+ blackboard_1 ( italic_h ( italic_f ( italic_x ) ) < italic_δ ) ∗ italic_I italic_o italic_U ( italic_F ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y ) ] .

The calculation of the Cloud Upload Rate(CUR) is as follows:

CUR=1Ni=1N𝟙(h(f(xi))<δ).𝐶𝑈𝑅1𝑁superscriptsubscript𝑖1𝑁1𝑓subscript𝑥𝑖𝛿CUR=\frac{1}{N}\sum_{i=1}^{N}\mathds{1}(h(f(x_{i}))<\delta).italic_C italic_U italic_R = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 ( italic_h ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) < italic_δ ) . (26)

The calculation of the latency is as follows:

latency=1Ni=1N(delay(xi)).𝑙𝑎𝑡𝑒𝑛𝑐𝑦1𝑁superscriptsubscript𝑖1𝑁𝑑𝑒𝑙𝑎𝑦subscript𝑥𝑖latency=\frac{1}{N}\sum_{i=1}^{N}(delay(x_{i})).italic_l italic_a italic_t italic_e italic_n italic_c italic_y = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_d italic_e italic_l italic_a italic_y ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (27)

In order to test the algorithm’s performance in dynamically changing environments during the experiment process, we divide the datasets into 5 tasks in chronological order. Fig. 5 shows the class frequency changes for the different tasks divided on these four datasets, which effectively reflect common data drift and heterogeneity phenomena in the real world. We uniformly use the pre-trained semantic segmentation RFNet model [36] for different algorithms to train and test on these 5 tasks. The training learning rate and the continual training epochs of RFNet model is set to 0.001 and 50.

Refer to caption
Figure 5: Change of class frequency in different tasks

V-B Experimental Result

Tables I and II, and Figures 6, 7, 8 and 9 show the experimental results of LAECIPS and other frameworks, algorithms in diferent datasets. Through these experimental results, we aim to answer the following research questions.

Q1.

How effective is the edge-cloud collaboration in our proposed LAECIPS framework?

Refer to caption
Figure 6: Effectiveness of LAECIPS in Different Tasks.
TABLE I: Comparision of average mIoU, CUR, and inference latency for edge-cloud collaboration of LAECIPS
Dataset Method mIoU CUR latency
Cloud-Robotics LAECIPS 0.683 37.12% 2.60
Cloud 0.735 100% 5.11
Edge 0.442 0% 1.12
DCSB 0.624 36.22% 2.56
Cityscapes LAECIPS 0.597 34.98% 2.74
Cloud 0.647 100% 5.83
Edge 0.396 0% 1.09
DCSB 0.537 35.86% 2.79
ADE20K LAECIPS 0.467 38.52% 2.52
Cloud 0.504 100% 4.88
Edge 0.352 0% 1.05
DCSB 0.441 33.12% 2.32
SYNTHIA LAECIPS 0.591 31.71% 2.33
Cloud 0.627 100% 5.07
Edge 0.438 0% 1.06
DCSB 0.547 31.81% 2.34

To answer this question, we make two observations from Fig. 6 and Table I based on accuracy and latency. Firstly, Fig. 6 shows the results of training and inference using the LAECIPS framework on different datasets. Combining the average mIoU accuracy shown in Table I, it can be observed that in the Cloud-Robotics dataset, the LAECIPS method improves the inference mIoU accuracy by 22.1% and 5.9% compared to edge inference and DCSB framework, with only a 5.1% difference compared to cloud inference. In the Cityscapes dataset, the LAECIPS method improves the inference mIoU accuracy by 20.1% and 6.0% compared to edge inference and DCSB framework, with only a 5.0% difference compared to cloud inference. In the ADE20K dataset, the LAECIPS method improves the inference mIoU accuracy by 12.5% and 2.6% compared to edge inference and DCSB framework, with only a 3.7% difference compared to cloud inference. In the SYTHIA dataset, the LAECIPS method improves the inference mIoU accuracy by 15.3% and 4.4% compared to edge inference and DCSB framework, with only a 3.6% difference compared to cloud inference. Those results demonstrate that the LAECIPS method can effectively improve the model’s inference accuracy.

Secondly, Table I shows the average inference latency and CURs. Compared to methods that perform all inference in the cloud, LAECIPS saves over 60% of inference time and communication overhead. Compared to the current SOTA DCSB framework, LAECIPS has very similar inference latency and communication overhead. This proves that the LAECIPS can effectively reduce inference latency and communication overhead.

Q2.

Is our cloud-edge collaboration method more effective in identifying hard inputs compared to other hard input mining algorithms?

Refer to caption
Figure 7: Hard input and easy input histogram of different algorithms in Cloud-Robotics Dataset
Refer to caption
Figure 8: mIoU under different CUR
TABLE II: Average mIoU, CUR, and inference latency of different tasks.
Dataset Method 1-mIoU 1-CUR 2-mIoU 2-CUR 3-mIoU 3-CUR 4-mIoU 4-CUR 5-mIoU 5-CUR avg mIoU avg CUR avg latency
Cloud-Robotics LAECIPS 0.682 41.94% 0.647 32.85% 0.676 39.29% 0.677 36.04% 0.733 35.55% 0.683 37.12% 2.60
DCSB 0.639 35.70% 0.610 37.83% 0.664 29.15% 0.534 38.17% 0.674 40.35% 0.624 36.22% 2.56
MESS 0.654 61.38% 0.505 30.83% 0.545 55.76% 0.497 33.14% 0.679 32.72% 0.576 42.93% 2.83
SM 0.637 64.17% 0.465 25.41% 0.602 47.08% 0.475 34.23% 0.601 48.45% 0.556 43.86% 2.87
SPP 0.548 38.5% 0.492 39.1% 0.585 49.75% 0.425 28.72% 0.644 29.77% 0.538 37.86% 2.63
Cityscapes LAECIPS 0.623 35.45% 0.587 37.37% 0.576 37.29% 0.607 35.05% 0.593 29.46% 0.597 34.98% 2.74
DCSB 0.532 34.99% 0.546 42.47% 0.493 36.04% 0.539 30.58% 0.579 35.23% 0.537 35.86% 2.79
MESS 0.564 56.38% 0.475 40.83% 0.545 49.26% 0.497 43.14% 0.529 41.22% 0.522 46.33% 3.28
SM 0.557 54.67% 0.535 50.41% 0.522 49.08% 0.475 37.23% 0.551 53.45% 0.528 49.76% 3.44
SPP 0.518 43.5% 0.492 39.1% 0.485 54.75% 0.525 38.72% 0.464 31.27% 0.496 41.46% 3.05
ADE20K LAECIPS 0.473 39.45% 0.457 37.37% 0.476 41.29% 0.441 35.05% 0.493 39.46% 0.468 38.52% 2.50
DCSB 0.431 31.84% 0.455 28.84% 0.415 37.7% 0.438 34.37% 0.464 32.87% 0.441 33.12% 2.32
MESS 0.434 52.38% 0.425 41.83% 0.445 47.76% 0.417 40.14% 0.449 42.22% 0.434 44.86% 2.77
SM 0.387 34.67% 0.405 47.91% 0.412 51.58% 0.435 39.23% 0.43 48.45% 0.414 44.36% 2.75
SPP 0.408 45.0% 0.392 37.1% 0.385 44.75% 0.425 43.72% 0.434 36.27% 0.408 41.37% 2.64
SYNTHIA LAECIPS 0.59 37.72% 0.584 30.96% 0.592 29.02% 0.603 28.3% 0.585 32.54% 0.591 31.71% 2.33
DCSB 0.523 29.07% 0.546 32.83% 0.566 27.69% 0.564 37.16% 0.538 32.33% 0.547 31.81% 2.34
MESS 0.534 40.71% 0.468 39.33% 0.523 38.68% 0.493 33.75% 0.478 36.82% 0.499 37.86% 2.58
SM 0.478 36.91% 0.46 40.9% 0.538 36.52% 0.525 33.01% 0.524 35.71% 0.505 36.61% 2.52
SPP 0.482 37.12% 0.495 34.05% 0.517 36.51% 0.536 37.02% 0.48 32.8% 0.502 35.49% 2.48
Refer to caption
Figure 9: mIoU and CUR under different tasks

We answer this question by making two observations from Fig. 7 and Fig. 8. Firstly, we classify the samples x𝑥xitalic_x that satisfy the condition mIoU(F(x))mIoU(f(x))0.1𝑚𝐼𝑜𝑈𝐹𝑥𝑚𝐼𝑜𝑈𝑓𝑥0.1mIoU(F(x))-mIoU(f(x))\geq 0.1italic_m italic_I italic_o italic_U ( italic_F ( italic_x ) ) - italic_m italic_I italic_o italic_U ( italic_f ( italic_x ) ) ≥ 0.1 as hard inputs. Fig. 7 shows the differentiation between hard inputs and easy inputs based on the confidence scores of different algorithms. It can be seen that MESS, SM, and SPP methods are unable to clearly distinguish hard inputs from easy inputs based on confidence score, while the LAECIPS method can identify most inputs with a confidence score greater than 0.75 as easy and most inputs with a confidence score less than 0.75 as hard, indicating that the LAECIPS method is more effective in distinguishing hard inputs from easy inputs.

Secondly, as shown in Fig. 8, we tested the inference mIoU accuracy under different CURs by adjusting the threshold δ𝛿\deltaitalic_δ with the same edge model. It can be seen that the inference accuracy of LAECIPS is higher than that of other methods under different CURs. The results indicate that LAECIPS introduces less amount of communication overhead compared to other methods for achieving the same level of inference accuracy, further validating the effectiveness of the LAECIPS method in identifying hard inputs.

Q3.

Is the LAECIPS algorithm more adaptable to dynamic environmental changes?

We make two observations from Fig. 9 and Table II to answer this question. Firstly, Fig. 9 shows the inference mIoU accuracy and CURs of various algorithms in different tasks. The data distributions of different tasks from the same dataset are significantly different as shown in Fig. 5, which have certain impacts on the effectiveness of the semantic segmentation models and hard input mining algorithms, leading to fluctuations in the model’s inference accuracy and CURs across different tasks. Therefore, the performance variances of the evaluated methods for handling different tasks reflect their adaptability to dynamic environments.

The obtained results indicate that DCSB, MESS, SM, and SPP methods are greatly affected by environmental changes in terms of both inference accuracy and CUR, while LAECIPS remains relatively stable in different tasks. It can be seen that LAECIPS outperforms other algorithms in various tasks across the 4 datasets in the experiment. LAECIPS has an average inference mIoU accuracy that is more than 5% higher than other algorithms. Table II shows the accuracy and CUR under different tasks. Across different tasks, LAECIPS demonstrates relatively stable CUR variations, while MESS, SM, and SPP methods show significant performance fluctuations. DCSB also exhibits stable performance in terms of CUR, but due to its lack of adaptive updates for small models, there is still a certain gap in accuracy compared to LAECIPS, further highlighting the importance of the adaptive update process used in the LAECIPS framework.

VI Conclusion

This paper delves into the new problem of online cloud-edge collaborative training and inference in dynamic environments, underscored by large vision models in the IoT perception landscape. The crux of this problem lies in discerning optimal collaboration strategies that cater to the real-time demands of edge sensing and computing while bolstering inference accuracy. Our solution, the LAECIPS framework, decouples its primary constituents – a large vision model hosted on the cloud and a small model deployed at the edge – and employs a hard input mining-based co-inference strategy to optimize their collaboration. With LAECIPS, only the hard inputs are deferred to the cloud, and the edge model is adaptively updated, learning from the pre-trained large vision model outputs to ensure resilience to dynamic environmental shifts. The generalization error bound of LAECIPS has been derived, and comprehensive evaluations on real-world robotic semantic segmentation benchmarks have been conducted. Both theoretical and empirical results substantiate the viability and effectiveness of our proposed framework. We believe that our work lays a solid foundation for large vision model-assisted edge-cloud collaboration and facilitates the development of IoT perception systems. In future research, we will further extend the application of LAECIPS from IoT perception systems to other multimodal scenarios.

References

  • [1] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR 2021, 2021, pp. 7077–7087.
  • [2] J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh, “Anomalynet: An anomaly detection network for video surveillance,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 10, pp. 2537–2550, 2019.
  • [3] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
  • [4] M. M. H. Shuvo, S. K. Islam, J. Cheng, and B. I. Morshed, “Efficient acceleration of deep learning inference on resource-constrained edge devices: A review,” Proceedings of the IEEE, 2022.
  • [5] Y. Zhang, Y. Yao, P. Ram, P. Zhao, T. Chen, M. Hong, Y. Wang, and S. Liu, “Advancing model pruning via bi-level optimization,” Advances in Neural Information Processing Systems, vol. 35, pp. 18 309–18 326, 2022.
  • [6] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3366–3385, 2021.
  • [7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [8] T. Chen, Z. Mai, R. Li, and W. lun Chao, “Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation,” 2023.
  • [9] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
  • [10] S. Duan, D. Wang, J. Ren, F. Lyu, Y. Zhang, H. Wu, and X. Shen, “Distributed artificial intelligence empowered by end-edge-cloud computing: A survey,” IEEE Communications Surveys & Tutorials, 2022.
  • [11] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
  • [12] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “Jointdnn: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Transactions on Mobile Computing, vol. 20, no. 2, pp. 565–576, 2019.
  • [13] C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive dnn surgery for inference acceleration on the edge,” in IEEE INFOCOM 2019.   IEEE, 2019, pp. 1423–1431.
  • [14] H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “Ionn: Incremental offloading of neural network computations from mobile devices to edge servers,” in SoCC 2018, 2018, pp. 401–411.
  • [15] Z. Zhao, K. M. Barijough, and A. Gerstlauer, “Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2348–2359, 2018.
  • [16] E. Park, D. Kim, S. Kim, Y.-D. Kim, G. Kim, S. Yoon, and S. Yoo, “Big/little deep neural network for ultra low power inference,” in CODES+ISSS 2015, 2015, pp. 124–132.
  • [17] U. Drolia, K. Guo, J. Tan, R. Gandhi, and P. Narasimhan, “Cachier: Edge-caching for recognition applications,” in ICDCS 2017, 2017, pp. 276–286.
  • [18] S. Ding, L. Li, Z. Li, H. Wang, and Y. Zhang, “Smart electronic gastroscope system using a cloud–edge collaborative framework,” Future Generation Computer Systems, vol. 100, pp. 395–407, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X18324324
  • [19] M. Li, Y. Li, Y. Tian, L. Jiang, and Q. Xu, “Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for dnn inference,” in DAC 2021, 2021, pp. 409–414.
  • [20] C. Ding, A. Zhou, Y. Liu, R. N. Chang, C.-H. Hsu, and S. Wang, “A cloud-edge collaboration framework for cognitive service,” IEEE Transactions on Cloud Computing, vol. 10, no. 3, pp. 1489–1499, 2022.
  • [21] Z. Cao, Z. Li, Y. Chen, H. Pan, Y. Hu, and J. Liu, “Edge-cloud collaborated object detection via difficult-case discriminator,” in 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2023, pp. 259–270.
  • [22] A. Kouris, S. I. Venieris, S. Laskaridis, and N. Lane, “Multi-exit semantic segmentation networks,” in ECCV 2022.   Cham: Springer Nature Switzerland, 2022, pp. 330–349.
  • [23] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” 2018.
  • [24] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Weakly supervised semantic segmentation by pixel-to-prototype contrast,” in CVPR 2022, June 2022, pp. 4320–4329.
  • [25] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in CVPR 2016, June 2016.
  • [26] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” Traces and emergence of nonlinear programming, pp. 247–258, 2014.
  • [27] M. Mohri and A. Rostamizadeh, “Rademacher complexity bounds for non-i.i.d. processes,” in NIPS 2008, vol. 21.   Curran Associates, Inc., 2008.
  • [28] V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” The Annals of Statistics, vol. 30, no. 1, pp. 1–50, 2002.
  • [29] G. DeSalvo, M. Mohri, and U. Syed, “Learning with deep cascades,” in Algorithmic Learning Theory: 26th International Conference, ALT 2015, Banff, AB, Canada, October 4-6, 2015, Proceedings 26.   Springer, 2015, pp. 254–269.
  • [30] Nvidia jetson nano. [Online]. Available: https://developer.nvidia.com/embedded/jetsonnano-developer-kit
  • [31] Kubeedge ianvs: Distributed synergy ai benchmarking. [Online]. Available: https://github.com/kubeedge/ianvs
  • [32] S. Hu, S. Mao, S. Luo, Z. Huang, Z. Zheng, J. Pu, and F. Wang, “Cloud robotics: a robotic semantic segmentation benchmark for lifelong learning,” [Online]. Available: https://kubeedge-ianvs.github.io/, 2023.
  • [33] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR 2016, 2016, pp. 3213–3223.
  • [34] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  • [35] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [36] L. Sun, K. Yang, X. Hu, W. Hu, and K. Wang, “Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images,” IEEE robotics and automation letters, vol. 5, no. 4, pp. 5558–5565, 2020.