LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Perception System
Abstract
Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaboration with large-small model co-inference offers a promising approach to achieving high inference accuracy and low latency. However, existing edge-cloud collaboration methods are tightly coupled with the model architecture and cannot adapt to the dynamic data drifts in heterogeneous IoT environments. To address the issues, we propose LAECIPS, a new edge-cloud collaboration framework. In LAECIPS, both the large vision model on the cloud and the lightweight model on the edge are plug-and-play. We design an edge-cloud collaboration strategy based on hard input mining, optimized for both high accuracy and low latency. We propose to update the edge model and its collaboration strategy with the cloud under the supervision of the large vision model, so as to adapt to the dynamic IoT data streams. Theoretical analysis of LAECIPS proves its feasibility. Experiments conducted in a robotic semantic segmentation system using real-world datasets show that LAECIPS outperforms its state-of-the-art competitors in accuracy, latency, and communication overhead while having better adaptability to dynamic environments.
Index Terms:
Edge-Cloud Collaboration, Large Vision Model, Big/Little Model Cooperation, IoT-based Perception SystemI Introduction
Machine Learning (ML) has been widely applied to support intelligent perception in the Internet of Things (IoTs) for various applications including robotic surveillance and autonomous driving [1, 2]. IoT-based perception often requires high accuracy and low latency of ML inference for meeting application requirements [3]. ML functions in IoTs are typically deployed on edge devices in user proximity to reduce inference latency. However, on the one hand, the constrained resources on IoT edge devices limit their abilities to support complex ML models [4]; on the other hand, the lightweight models on edge devices may suffer low inference accuracy, especially for corner cases [5]. In addition, data distribution drifts may occur in some perception scenarios (e.g., as a robot moves into an unexpected environment or an auto-pilot vehicle travels to an unexplored area) [6], which makes the pre-trained edge model less accurate for the new task.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/senario.jpg)
Recently, considerable progress has been made in developing large vision models, for example, the Segment Anything Model (SAM) from Meta [7]. With their strong generalization ability, such large vision models may achieve very high accuracy in handling corner cases and being robust to data distribution drifts in intelligent perception [8]. However, a large vision model can only be deployed in a resource-rich cloud data center, which may cause long latency due to data transmissions between user devices and the cloud server. Therefore, how to fully leverage the advantages of a large vision model for achieving accurate inference while reducing perception latency in the resource-constrained IoT becomes an important research problem.
To answer it, one may think using edge-cloud collaboration for large-small model co-inference [9, 10]. In particular, with a large vision model hosted on the cloud and a small model deployed at the edge, an edge-cloud collaboration strategy determines for each received input if the inference can be performed by the small edge model or needs to be processed by the large vision model on the cloud, as illustrated by Fig. 1. However, existing edge-cloud collaboration methods mainly suffer from three limitations that need to be overcome to support IoT-based intelligent perception. First, the tight coupling between the large and small models limits the system flexibility of the current methods for fully leveraging large vision models. Second, the collaboration strategy needs to be further optimized for both high accuracy and low latency, while demonstrating its capability to adapt to the dynamic IoT environment. Third, inference outputs from a large vision model (e.g., SAM) may lack semantic labels and thus need to be combined with the edge model inference results.
To address the three issues, we propose LAECIPS, a large vision model-assisted adaptive cloud-edge collaboration framework. In this framework, the large vision model on the cloud and the lightweight model deployed at the edge cooperate in a loose-coupling manner thus making both models plug-and-play and greatly improving the system flexibility. This framework employs an edge-cloud collaboration strategy that is based on hard input mining and optimized for both low response latency and high inference accuracy. This framework also enables online adjustment of the collaboration strategy and continual training of the edge model under the supervision of the large vision model, which makes the system adaptive to data distribution drifts in dynamic IoT environments.
Specific contributions of this paper are as follows.
-
1.
This is the first study that explores the problem of edge-cloud collaborative perception for dynamic IoT data streams. We propose a novel edge-cloud collaboration framework, LAECIPS, to enable flexible utilization of both large and small models in an online manner to solve this problem.
-
2.
In the LAECIPS framework, we design a hard input mining-based edge-cloud co-inference strategy that achieves higher accuracy and lower task processing latency.
-
3.
In the LAECIPS framework, we propose the continual training of the small model to fit in with the dynamic environmental changes in the IoT environment.
-
4.
We analyze the theoretical generalization capability of LAECIPS to prove the feasibility of incorporating large vision models, edge small models, and edge-cloud co-inference strategies into the LAECIPS framework in a plug-and-play manner.
-
5.
We implement the proposed LAECIPS framework through a real-world robotic semantic segmentation system in a realistic edge-cloud environment to demonstrate its applicability. Extensive experimental results substantiate that LAECIPS achieves significantly higher accuracy, lower task processing latency and communication overhead than its SoTA competitors.
The rest of the paper is structured as follows. Section II explains the related work. The technical details of LAECIPS are given in Section III. Section IV presents the theoretical proof of the generalization ability of LAECIPS. Experimental results are presented in Section V. Finally, we conclude the paper in Section VI.
II Related Work
The related research on cloud-edge collaborative inference can be categorized into two categories: model partition and big/little model cooperation.
Model Partition
Model partition segments a (big) model into multiple sub-models that are deployed on different hosts including cloud server and edge device(s) based on their resource availability. During inference operation, the model is collaboratively computed across all sub-models to obtain the output result. For example, Neurosurgeon [11] uses a performance prediction model to select the optimal split point for a model. JoinDNN [12] formulates the optimal model layers scheduling as the shortest path problem and solves it using integer linear programming. DADS [13] formulates different model partition optimization problems for lightly and heavily loaded conditions. IONN [14] incrementally builds the model on the server using the arriving model partitions to enable early-stage training. DeepThings [15] fuses grids across layers of DNN to construct a fine-grained model partition.
Although partitioning a complex model across the cloud and edge device(s) reduces computational costs and improves accuracy, it may introduce significant communication overheads for transmitting the intermediate results of the split model, which is often overwhelming to resource-constrained IoTs. Also, the sub-models deployed on the cloud and edge device(s) are tightly coupled thus limiting the flexibility and adaptability of model partitioning to face the dynamic IoT environments. In addition, it is difficult to directly apply the existing model partition methods to the recently developed large vision models due to their highly complex model structures.
Big/Little Model Cooperation
The idea of big/little model cooperation is to deploy a lightweight model on the edge device for simple data inference and use a big model on the cloud for handling difficult data. With an appropriate strategy for model selection, big/little model cooperation may achieve high accuracy and low latency with minimized communication overheads. Also, this approach allows loosely coupled models to be deployed on the edge and cloud for more flexibility and adaptability. Therefore, big/little model cooperation offers a promising approach to intelligent perception in the IoT environment.
Collaborative inference based on big/little model cooperation was first proposed in SM [16], where difficult samples were identified using score margin and uploaded to the cloud for inference. In Cachier [17], the interaction between edge and cloud was modeled as a caching system to minimize inference latency. CeDLD [18] applied big/little model collaborative inference to medical image recognition and identified difficult samples based on image similarity. AppealNet [19] transformed the edge model into a multi-head structure to simultaneously identify difficult samples while outputting the inference results. EdgeCNN [20] proposed a collaborative training method for big/little model cooperation that uses large vision model outputs to supervise the training of the small model on an edge device. The newly reported SOTA work is DCSB [21], which applied big/little model collaborative inference to object detection and adaptively down-sampled some regions of the difficult case to reduce bandwidth consumption. Besides that, works that focused on difficult data detection could also bring insights for big/little model collaboration. MESS [22] proposed an early exit method for semantic segmentation tasks, which could also be used in detecting difficult data. SPP [23] proposed a confidence score-based method to detect out-of-distribution examples which could also be regarded as hard samples.
Although encouraging progress has been made in this area, the SOTA technologies for big/little model cooperation still have some limitations that need to be overcome to effectively support intelligent perception in IoTs. Particularly, the current methods lack the capability of online updating for the edge model and adaptive adjustment of the collaboration strategy in response to the dynamic IoT environments. Also, with the rise of large vision models, current methods need further optimization to apply to large vision models.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/system-architecture.jpg)
III Proposed Method
III-A System Overview
We examine a scenario wherein a robot or autonomous vehicle outfitted with a camera is deployed at the edge. The edge node is tasked with executing real-time semantic segmentation duties, while the cloud node acts as a resource-abundant computing center, offering assistance to the edge node. In response to challenges related to the poor performance of edge models when confronted with corner cases, alongside concerns of data drift and heterogeneity within the edge environment, we have devised the LAECIPS architecture, illustrated in Fig. 2.
In this framework, a small semantic segmentation model is deployed on the edge device. In steps and , the small model conducts inference on the collected data inputs to yield small model inference results. Subsequently, in steps and , the hard input mining module processes these results to categorize the collected data into two groups: hard inputs and easy inputs. In step , the small model inference results of easy inputs, which have achieved an acceptable level of accuracy, are directly outputted for reducing processing latency. Conversely, hard inputs that cause low accuracy in edge inference are uploaded to the cloud for further processing to improve inference accuracy.
In step , both the small model and the SAM large vision model deployed in the cloud perform their respective inference on the uploaded hard inputs. In step , a fusion of the cloud inference masks with the small model inference results yields co-inference results. In step and , the cloud node sends the co-inference results to the edge node, which then outputs the co-inference results as the inference results for hard inputs. Additionally, the hard inputs and the co-inference results are stored in the cloud node’s replay buffer. In step , upon the replay buffer’s sample count exceeding a predetermined threshold or a specified time interval elapsing, the cloud node proceeds to continually train the small model, using the hard inputs and their co-inference results as the ground truth. Finally, in step , the small model deployed in the edge node is updated by the small model trained in the cloud node.
III-B Large Vision Model assisted Inference
SAM is one of the most representative large vision models for IoT perception systems developed in recent years. Its prowess lies in its remarkable efficacy in image segmentation tasks, attributed to its strong generalization ability. Nonetheless, as illustrated in Fig. 3c, the SAM model, while adept at producing well-defined contours delineating segmented objects, falls short in providing semantic labels for these segmented images [8]. Consequently, it cannot be directly applied to semantic segmentation tasks. Edge models, in the course of inference, can provide segmented outcomes accompanied by semantic labels. However, these results often bear imperfections, notably exemplified by coarse object edges as depicted in Fig. 3b. In semantic segmentation tasks, the process of image segmentation frequently poses greater challenges than the subsequent labeling of the segmented outcomes [24]. Therefore, a natural idea is to combine the segmentation results from the SAM model with the classification labels from the edge model to obtain a more refined segmentation outcome with classification labels, as depicted in Fig. 3d.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/examples.png)
We assume that the collected image is where is the height of the image and is the width of the image. The corresponding label of the collected image is , with representing the number of classes. The image conforms to the probabilistic distribution . Furthermore, we denote the edge model as : , and the large vision model in cloud as : . Notably, the output produced by the model is the mask of image segmentation. During the process of SAM model-assisted semantic segmentation inference, a sample is inferred by the edge model, resulting in a labeled albeit inaccurate segmentation outcome. Concurrently, the same sample is transmitted to the cloud and inferred by the model, generating an unlabeled yet accurate segmentation outcome. The combination of the model’s inference outcome and that of the edge model forms the model-assisted inference outcome, defined as follows with its description in Algorithm 1.
(1) |
Input: Edge Inference Result: , Large Vision Model Inference Mask: , class num:
Output: large vision model–assisted Inference Result:
The large vision model-assisted co-inference method can significantly improve semantic segmentation accuracy by refining edge model-generated results. However, it faces two challenges: On one hand, it entails the uploading of samples to the cloud, potentially leading to increased latency. Therefore, the edge-cloud collaboration strategy is essential for solving this problem, as will be explained in section III-C. On the other hand, since the effectiveness of edge-cloud joint inference relies on the labels produced by the edge model, if the edge model encounters difficulties of environment mobility, thereby resulting in data drift and heterogeneity, the accuracy of large vision model-assisted co-inference will also diminish accordingly. Therefore, we need to adaptively update the edge model and the edge-cloud collaboration strategy, which will be elaborated in section III-D.
III-C Hard Input Mining Strategy
The hard input mining strategy is pivotal in the LAECIPS architecture. Too many identified hard inputs will result in high inference latency while too few ones will lead to a decrease in the accuracy of handling corner cases. Existing methods rely on loss values [25] or confidence scores [23] to identify hard inputs. However, calculating loss values during inference is challenging due to unknown labels, and confidence score-based methods lack adaptability to changing environments. To address this, we propose a neural network-based hard input mining model, denoted as . This model determines if a data input for edge inference is a hard or easy input, represented as . Based on this hard input mining model, the result of cloud-edge collaborative inference is:
(2) |
We denote the loss of the model’s output as
(3) |
and the inference latency of the model as
(4) |
Then, the loss of cloud-edge collaborative inference is:
|
(5) |
and the latency of cloud-edge collaborative inference is:
|
(6) |
We aim to improve inference accuracy while meeting the requirements of task process latency. Therefore, the overall optimization objective is:
(7) | |||
Since the inference delay is independent of the inference inputs, we can simplify the inference delay as follows:
(8) | |||
Thus, the cloud-edge collaborative inference delay in (6) can be simplified as:
(9) | |||
Therefore, the constraint in (7) can be simplified as:
(10) |
Thus, the optimization objective in (7), which satisfies the KKT conditions [26], can be rewritten as:
|
(11) |
III-D Adaptive Update Process
Since represents the large vision model-assisted inference function, there is no need to optimize . Therefore, the optimization targets are and . Additionally, since it is difficult to obtain the true label of a sample in a real environment, we optimize and using the large vision model-assisted inference result . The optimization objective in (11) can be further rewritten as:
|
(12) |
The model update process can be divided into two steps: in the first step, we freeze and update :
(13) | |||
Then, we freeze and update :
(14) | ||||
The overall workflow of LAECIPS combines the large vision model-assisted inference, hard input mining, and the adaptive update process, as shown in Algorithm 2.
Initialize: Pretrained Edge Model: , Cloud Model: , Pretrained Hard Input Mining Model: .
Parameters: confidence threshold: , max replay buffer size: , max continual training interval: .
Output: Edge inference result: , Co-inference result: .
IV Theoretical Analysis of LAECIPS
The system’s generalization ability will greatly affect its actual effectiveness when deploying in a real-world dynamic IoT environment. In this section, we theoretically analyze the generalization boundary of the proposed system LAECIPS to prove its feasibility.
Based on the optimization objective from equation (12), the expected loss for the semantic segmentation function and hard input mining strategy is defined as follows:
(15) | ||||
Theorem 1
Let be the family of semantic segmentation functions taking values in , be the family of hard input mining functions taking values in . We denote by the empirical loss of function over the Sample . Then, for any , with probability as least over the draw of a sample of size , the following holds for all , where represents the Rademacher complexity [27]:
(16) |
Proof 1
Let be the family of functions . By the general Rademacher complexity bound [28], with probability at least , the following holds for all :
(17) |
Now, the Rademacher complexity can be bounded as follows:
(18) | ||||
Lemma 1
By lemma 1, the Rademacher complexity of products of indicator functions can be bounded by the sum of the Rademacher complexities of each indicator function class, thus:
(20) | ||||
So, the Rademacher complexity can be bounded as follows:
(21) | ||||
This theorem gives generalization guarantees for learning the semantic segmentation function and hard input mining function that admit Rademacher complexities in .
Theorem 1 indicates that the maximum generalization error of LAECIPS is bounded provided that the maximum generalization error of the semantic segmentation models and hard input mining strategy deployed in LAECIPS is controllable. Therefore, it is theoretically feasible to deploy large visual models, small models, and hard input mining strategies in the LAECIPS framework for co-inference in a plug-and-play manner.
V Experiments
V-A Experimental Setup
Hardware and Software Systems
We implemented a system prototype of the proposed LAECCIPS framework for real-world robotic semantic segmentation and conducted experiments on it for performance evaluations. In the hardware setup, we use the Nvidia Jetson Nano [30], which is commonly used in real-world robotic devices, as the edge node. For the cloud node, we have a Dell R750 server with a 48-core Intel Xeon Silver 4310 CPU @ 2.10GHz, 256GB of memory, and 2 Nvidia GeForce 3090 GPUs. The cloud node and edge node are connected via WLAN with a network bandwidth of 4Mbps. We’ve implemented LAECIPS using the distributed AI testing framework Ianvs [31] based on Kubeedge, deploying small models on the Jetson Nano and the large vision model on the Dell R750 server as shown in Fig. 4.
Datasets
Semantic segmentation is a typical task in the IoT perception system and also a fundamental task in the fields of robotics and autonomous driving. To validate the effectiveness of our proposed LAECIPS in the real-world IoT perception environment, we selected four typical real-world semantic segmentation datasets:
-
•
The Cloud-Robotics dataset [32] contains 2600 semantic segmentation images collected by intelligent robotic dogs in the Shenzhen Industrial zone, mainly applicable to robot scenes in semi-enclosed areas.
-
•
The Cityscapes dataset [33] contains 5000 semantic segmentation images collected by smart cars in multiple cities in Germany, mainly applicable to autonomous driving scenes in open-world environments.
-
•
The ADE20K dataset [34] contains 20,000 semantic segmentation images, covering various scenes from indoor to outdoor, natural to urban, and can be used for tasks like scene understanding and image segmentation in robotics and autonomous driving.
-
•
The SYNTHIA dataset [35] contains 9,000 semantic segmentation images, consisting of photo-realistic frames rendered from a virtual city and includes precise pixel-level semantic annotations.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/experiment-architecture.jpg)
Compared Methods
We first compared three different baseline frameworks:
-
•
CLOUD: Upload all inputs to the cloud node for processing by the large visual model.
-
•
EDGE: Process all inputs on the edge node using the small model.
-
•
DCSB: DCSB [21] is the SOTA method for big/little model cooperation. The difference between this framework and our proposed LAECIPS framework is that DCSB does not dynamically update the small model.
Besides that, we also employed three typical hard input mining strategies, MESS [22], SM [16], and SPP [23], to evaluate the effectiveness and generalization of our LAECIPS method.
-
•
MESS is the SOTA method proposed for early exit semantic segmentation, which could also be used in hard input mining. It calculates the confidence score of an inference result by counting the proportion of pixels with a maximum probability distribution greater than a certain threshold:
(22) -
•
SM is the classic method used in edge-cloud collaboration. It calculates the confidence score based on the difference between the maximum probability distribution and the second maximum probability distribution in the inference result:
(23) -
•
SPP is the baseline method for hard input mining. It calculates the confidence score based on the maximum probability distribution in the inference result:
(24)
For a fair comparison, the above three algorithms for hard input mining will be applied in the proposed framework in an online manner during the experimental process.
Evaluation Metrics
The metrics we test in the experiment include mIoU, Cloud Upload Rate (CUR), and latency. mIoU measures the model’s inference accuracy in semantic segmentation tasks. CUR represents the proportion of images uploaded to the cloud, reflecting the communication overhead of edge-cloud co-inference. The latency is the average time for completing the co-inference process for image inputs.
The calculation of the inference mIoU accuracy is as follows:
(25) | ||||
The calculation of the Cloud Upload Rate(CUR) is as follows:
(26) |
The calculation of the latency is as follows:
(27) |
In order to test the algorithm’s performance in dynamically changing environments during the experiment process, we divide the datasets into 5 tasks in chronological order. Fig. 5 shows the class frequency changes for the different tasks divided on these four datasets, which effectively reflect common data drift and heterogeneity phenomena in the real world. We uniformly use the pre-trained semantic segmentation RFNet model [36] for different algorithms to train and test on these 5 tasks. The training learning rate and the continual training epochs of RFNet model is set to 0.001 and 50.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/class-frequency-all.jpg)
V-B Experimental Result
Tables I and II, and Figures 6, 7, 8 and 9 show the experimental results of LAECIPS and other frameworks, algorithms in diferent datasets. Through these experimental results, we aim to answer the following research questions.
- Q1.
-
How effective is the edge-cloud collaboration in our proposed LAECIPS framework?
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/ablation-result.jpg)
Dataset | Method | mIoU | CUR | latency |
---|---|---|---|---|
Cloud-Robotics | LAECIPS | 0.683 | 37.12% | 2.60 |
Cloud | 0.735 | 100% | 5.11 | |
Edge | 0.442 | 0% | 1.12 | |
DCSB | 0.624 | 36.22% | 2.56 | |
Cityscapes | LAECIPS | 0.597 | 34.98% | 2.74 |
Cloud | 0.647 | 100% | 5.83 | |
Edge | 0.396 | 0% | 1.09 | |
DCSB | 0.537 | 35.86% | 2.79 | |
ADE20K | LAECIPS | 0.467 | 38.52% | 2.52 |
Cloud | 0.504 | 100% | 4.88 | |
Edge | 0.352 | 0% | 1.05 | |
DCSB | 0.441 | 33.12% | 2.32 | |
SYNTHIA | LAECIPS | 0.591 | 31.71% | 2.33 |
Cloud | 0.627 | 100% | 5.07 | |
Edge | 0.438 | 0% | 1.06 | |
DCSB | 0.547 | 31.81% | 2.34 |
To answer this question, we make two observations from Fig. 6 and Table I based on accuracy and latency. Firstly, Fig. 6 shows the results of training and inference using the LAECIPS framework on different datasets. Combining the average mIoU accuracy shown in Table I, it can be observed that in the Cloud-Robotics dataset, the LAECIPS method improves the inference mIoU accuracy by 22.1% and 5.9% compared to edge inference and DCSB framework, with only a 5.1% difference compared to cloud inference. In the Cityscapes dataset, the LAECIPS method improves the inference mIoU accuracy by 20.1% and 6.0% compared to edge inference and DCSB framework, with only a 5.0% difference compared to cloud inference. In the ADE20K dataset, the LAECIPS method improves the inference mIoU accuracy by 12.5% and 2.6% compared to edge inference and DCSB framework, with only a 3.7% difference compared to cloud inference. In the SYTHIA dataset, the LAECIPS method improves the inference mIoU accuracy by 15.3% and 4.4% compared to edge inference and DCSB framework, with only a 3.6% difference compared to cloud inference. Those results demonstrate that the LAECIPS method can effectively improve the model’s inference accuracy.
Secondly, Table I shows the average inference latency and CURs. Compared to methods that perform all inference in the cloud, LAECIPS saves over 60% of inference time and communication overhead. Compared to the current SOTA DCSB framework, LAECIPS has very similar inference latency and communication overhead. This proves that the LAECIPS can effectively reduce inference latency and communication overhead.
- Q2.
-
Is our cloud-edge collaboration method more effective in identifying hard inputs compared to other hard input mining algorithms?
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/hard-example-result.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/Comparision-of-Different-Cloud-Update-Rate.jpg)
Dataset | Method | 1-mIoU | 1-CUR | 2-mIoU | 2-CUR | 3-mIoU | 3-CUR | 4-mIoU | 4-CUR | 5-mIoU | 5-CUR | avg mIoU | avg CUR | avg latency |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cloud-Robotics | LAECIPS | 0.682 | 41.94% | 0.647 | 32.85% | 0.676 | 39.29% | 0.677 | 36.04% | 0.733 | 35.55% | 0.683 | 37.12% | 2.60 |
DCSB | 0.639 | 35.70% | 0.610 | 37.83% | 0.664 | 29.15% | 0.534 | 38.17% | 0.674 | 40.35% | 0.624 | 36.22% | 2.56 | |
MESS | 0.654 | 61.38% | 0.505 | 30.83% | 0.545 | 55.76% | 0.497 | 33.14% | 0.679 | 32.72% | 0.576 | 42.93% | 2.83 | |
SM | 0.637 | 64.17% | 0.465 | 25.41% | 0.602 | 47.08% | 0.475 | 34.23% | 0.601 | 48.45% | 0.556 | 43.86% | 2.87 | |
SPP | 0.548 | 38.5% | 0.492 | 39.1% | 0.585 | 49.75% | 0.425 | 28.72% | 0.644 | 29.77% | 0.538 | 37.86% | 2.63 | |
Cityscapes | LAECIPS | 0.623 | 35.45% | 0.587 | 37.37% | 0.576 | 37.29% | 0.607 | 35.05% | 0.593 | 29.46% | 0.597 | 34.98% | 2.74 |
DCSB | 0.532 | 34.99% | 0.546 | 42.47% | 0.493 | 36.04% | 0.539 | 30.58% | 0.579 | 35.23% | 0.537 | 35.86% | 2.79 | |
MESS | 0.564 | 56.38% | 0.475 | 40.83% | 0.545 | 49.26% | 0.497 | 43.14% | 0.529 | 41.22% | 0.522 | 46.33% | 3.28 | |
SM | 0.557 | 54.67% | 0.535 | 50.41% | 0.522 | 49.08% | 0.475 | 37.23% | 0.551 | 53.45% | 0.528 | 49.76% | 3.44 | |
SPP | 0.518 | 43.5% | 0.492 | 39.1% | 0.485 | 54.75% | 0.525 | 38.72% | 0.464 | 31.27% | 0.496 | 41.46% | 3.05 | |
ADE20K | LAECIPS | 0.473 | 39.45% | 0.457 | 37.37% | 0.476 | 41.29% | 0.441 | 35.05% | 0.493 | 39.46% | 0.468 | 38.52% | 2.50 |
DCSB | 0.431 | 31.84% | 0.455 | 28.84% | 0.415 | 37.7% | 0.438 | 34.37% | 0.464 | 32.87% | 0.441 | 33.12% | 2.32 | |
MESS | 0.434 | 52.38% | 0.425 | 41.83% | 0.445 | 47.76% | 0.417 | 40.14% | 0.449 | 42.22% | 0.434 | 44.86% | 2.77 | |
SM | 0.387 | 34.67% | 0.405 | 47.91% | 0.412 | 51.58% | 0.435 | 39.23% | 0.43 | 48.45% | 0.414 | 44.36% | 2.75 | |
SPP | 0.408 | 45.0% | 0.392 | 37.1% | 0.385 | 44.75% | 0.425 | 43.72% | 0.434 | 36.27% | 0.408 | 41.37% | 2.64 | |
SYNTHIA | LAECIPS | 0.59 | 37.72% | 0.584 | 30.96% | 0.592 | 29.02% | 0.603 | 28.3% | 0.585 | 32.54% | 0.591 | 31.71% | 2.33 |
DCSB | 0.523 | 29.07% | 0.546 | 32.83% | 0.566 | 27.69% | 0.564 | 37.16% | 0.538 | 32.33% | 0.547 | 31.81% | 2.34 | |
MESS | 0.534 | 40.71% | 0.468 | 39.33% | 0.523 | 38.68% | 0.493 | 33.75% | 0.478 | 36.82% | 0.499 | 37.86% | 2.58 | |
SM | 0.478 | 36.91% | 0.46 | 40.9% | 0.538 | 36.52% | 0.525 | 33.01% | 0.524 | 35.71% | 0.505 | 36.61% | 2.52 | |
SPP | 0.482 | 37.12% | 0.495 | 34.05% | 0.517 | 36.51% | 0.536 | 37.02% | 0.48 | 32.8% | 0.502 | 35.49% | 2.48 |
![Refer to caption](https://cdn.statically.io/img/arxiv.org/extracted/2404.10498v1/Comparision-of-Different-Algorithms.jpg)
We answer this question by making two observations from Fig. 7 and Fig. 8. Firstly, we classify the samples that satisfy the condition as hard inputs. Fig. 7 shows the differentiation between hard inputs and easy inputs based on the confidence scores of different algorithms. It can be seen that MESS, SM, and SPP methods are unable to clearly distinguish hard inputs from easy inputs based on confidence score, while the LAECIPS method can identify most inputs with a confidence score greater than 0.75 as easy and most inputs with a confidence score less than 0.75 as hard, indicating that the LAECIPS method is more effective in distinguishing hard inputs from easy inputs.
Secondly, as shown in Fig. 8, we tested the inference mIoU accuracy under different CURs by adjusting the threshold with the same edge model. It can be seen that the inference accuracy of LAECIPS is higher than that of other methods under different CURs. The results indicate that LAECIPS introduces less amount of communication overhead compared to other methods for achieving the same level of inference accuracy, further validating the effectiveness of the LAECIPS method in identifying hard inputs.
- Q3.
-
Is the LAECIPS algorithm more adaptable to dynamic environmental changes?
We make two observations from Fig. 9 and Table II to answer this question. Firstly, Fig. 9 shows the inference mIoU accuracy and CURs of various algorithms in different tasks. The data distributions of different tasks from the same dataset are significantly different as shown in Fig. 5, which have certain impacts on the effectiveness of the semantic segmentation models and hard input mining algorithms, leading to fluctuations in the model’s inference accuracy and CURs across different tasks. Therefore, the performance variances of the evaluated methods for handling different tasks reflect their adaptability to dynamic environments.
The obtained results indicate that DCSB, MESS, SM, and SPP methods are greatly affected by environmental changes in terms of both inference accuracy and CUR, while LAECIPS remains relatively stable in different tasks. It can be seen that LAECIPS outperforms other algorithms in various tasks across the 4 datasets in the experiment. LAECIPS has an average inference mIoU accuracy that is more than 5% higher than other algorithms. Table II shows the accuracy and CUR under different tasks. Across different tasks, LAECIPS demonstrates relatively stable CUR variations, while MESS, SM, and SPP methods show significant performance fluctuations. DCSB also exhibits stable performance in terms of CUR, but due to its lack of adaptive updates for small models, there is still a certain gap in accuracy compared to LAECIPS, further highlighting the importance of the adaptive update process used in the LAECIPS framework.
VI Conclusion
This paper delves into the new problem of online cloud-edge collaborative training and inference in dynamic environments, underscored by large vision models in the IoT perception landscape. The crux of this problem lies in discerning optimal collaboration strategies that cater to the real-time demands of edge sensing and computing while bolstering inference accuracy. Our solution, the LAECIPS framework, decouples its primary constituents – a large vision model hosted on the cloud and a small model deployed at the edge – and employs a hard input mining-based co-inference strategy to optimize their collaboration. With LAECIPS, only the hard inputs are deferred to the cloud, and the edge model is adaptively updated, learning from the pre-trained large vision model outputs to ensure resilience to dynamic environmental shifts. The generalization error bound of LAECIPS has been derived, and comprehensive evaluations on real-world robotic semantic segmentation benchmarks have been conducted. Both theoretical and empirical results substantiate the viability and effectiveness of our proposed framework. We believe that our work lays a solid foundation for large vision model-assisted edge-cloud collaboration and facilitates the development of IoT perception systems. In future research, we will further extend the application of LAECIPS from IoT perception systems to other multimodal scenarios.
References
- [1] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR 2021, 2021, pp. 7077–7087.
- [2] J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh, “Anomalynet: An anomaly detection network for video surveillance,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 10, pp. 2537–2550, 2019.
- [3] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
- [4] M. M. H. Shuvo, S. K. Islam, J. Cheng, and B. I. Morshed, “Efficient acceleration of deep learning inference on resource-constrained edge devices: A review,” Proceedings of the IEEE, 2022.
- [5] Y. Zhang, Y. Yao, P. Ram, P. Zhao, T. Chen, M. Hong, Y. Wang, and S. Liu, “Advancing model pruning via bi-level optimization,” Advances in Neural Information Processing Systems, vol. 35, pp. 18 309–18 326, 2022.
- [6] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3366–3385, 2021.
- [7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- [8] T. Chen, Z. Mai, R. Li, and W. lun Chao, “Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation,” 2023.
- [9] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
- [10] S. Duan, D. Wang, J. Ren, F. Lyu, Y. Zhang, H. Wu, and X. Shen, “Distributed artificial intelligence empowered by end-edge-cloud computing: A survey,” IEEE Communications Surveys & Tutorials, 2022.
- [11] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
- [12] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “Jointdnn: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Transactions on Mobile Computing, vol. 20, no. 2, pp. 565–576, 2019.
- [13] C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive dnn surgery for inference acceleration on the edge,” in IEEE INFOCOM 2019. IEEE, 2019, pp. 1423–1431.
- [14] H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “Ionn: Incremental offloading of neural network computations from mobile devices to edge servers,” in SoCC 2018, 2018, pp. 401–411.
- [15] Z. Zhao, K. M. Barijough, and A. Gerstlauer, “Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2348–2359, 2018.
- [16] E. Park, D. Kim, S. Kim, Y.-D. Kim, G. Kim, S. Yoon, and S. Yoo, “Big/little deep neural network for ultra low power inference,” in CODES+ISSS 2015, 2015, pp. 124–132.
- [17] U. Drolia, K. Guo, J. Tan, R. Gandhi, and P. Narasimhan, “Cachier: Edge-caching for recognition applications,” in ICDCS 2017, 2017, pp. 276–286.
- [18] S. Ding, L. Li, Z. Li, H. Wang, and Y. Zhang, “Smart electronic gastroscope system using a cloud–edge collaborative framework,” Future Generation Computer Systems, vol. 100, pp. 395–407, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X18324324
- [19] M. Li, Y. Li, Y. Tian, L. Jiang, and Q. Xu, “Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for dnn inference,” in DAC 2021, 2021, pp. 409–414.
- [20] C. Ding, A. Zhou, Y. Liu, R. N. Chang, C.-H. Hsu, and S. Wang, “A cloud-edge collaboration framework for cognitive service,” IEEE Transactions on Cloud Computing, vol. 10, no. 3, pp. 1489–1499, 2022.
- [21] Z. Cao, Z. Li, Y. Chen, H. Pan, Y. Hu, and J. Liu, “Edge-cloud collaborated object detection via difficult-case discriminator,” in 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS). IEEE, 2023, pp. 259–270.
- [22] A. Kouris, S. I. Venieris, S. Laskaridis, and N. Lane, “Multi-exit semantic segmentation networks,” in ECCV 2022. Cham: Springer Nature Switzerland, 2022, pp. 330–349.
- [23] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” 2018.
- [24] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Weakly supervised semantic segmentation by pixel-to-prototype contrast,” in CVPR 2022, June 2022, pp. 4320–4329.
- [25] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in CVPR 2016, June 2016.
- [26] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” Traces and emergence of nonlinear programming, pp. 247–258, 2014.
- [27] M. Mohri and A. Rostamizadeh, “Rademacher complexity bounds for non-i.i.d. processes,” in NIPS 2008, vol. 21. Curran Associates, Inc., 2008.
- [28] V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” The Annals of Statistics, vol. 30, no. 1, pp. 1–50, 2002.
- [29] G. DeSalvo, M. Mohri, and U. Syed, “Learning with deep cascades,” in Algorithmic Learning Theory: 26th International Conference, ALT 2015, Banff, AB, Canada, October 4-6, 2015, Proceedings 26. Springer, 2015, pp. 254–269.
- [30] Nvidia jetson nano. [Online]. Available: https://developer.nvidia.com/embedded/jetsonnano-developer-kit
- [31] Kubeedge ianvs: Distributed synergy ai benchmarking. [Online]. Available: https://github.com/kubeedge/ianvs
- [32] S. Hu, S. Mao, S. Luo, Z. Huang, Z. Zheng, J. Pu, and F. Wang, “Cloud robotics: a robotic semantic segmentation benchmark for lifelong learning,” [Online]. Available: https://kubeedge-ianvs.github.io/, 2023.
- [33] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR 2016, 2016, pp. 3213–3223.
- [34] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
- [35] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- [36] L. Sun, K. Yang, X. Hu, W. Hu, and K. Wang, “Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images,” IEEE robotics and automation letters, vol. 5, no. 4, pp. 5558–5565, 2020.