subscribe to arXiv mailings

Speech Understanding on Tiny Devices with A Learning Cache

Authors: Afsara Benazir, Zhiming Xu, Felix Xiaozhu Lin

Abstract: This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to… ▽ More This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users. △ Less

Submitted 8 May, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: accepted at MobiSys'24

arXiv:2311.17065 [pdf, other]

Efficient Deep Speech Understanding at the Edge

Authors: Rongxiang Wang, Felix Xiaozhu Lin

Abstract: In contemporary speech understanding (SU), a sophisticated pipeline is employed, encompassing the ingestion of streaming voice input. The pipeline executes beam search iteratively, invoking a deep neural network to generate tentative outputs (referred to as hypotheses) in an autoregressive manner. Periodically, the pipeline assesses attention and Connectionist Temporal Classification (CTC) scores.… ▽ More In contemporary speech understanding (SU), a sophisticated pipeline is employed, encompassing the ingestion of streaming voice input. The pipeline executes beam search iteratively, invoking a deep neural network to generate tentative outputs (referred to as hypotheses) in an autoregressive manner. Periodically, the pipeline assesses attention and Connectionist Temporal Classification (CTC) scores. This paper aims to enhance SU performance on edge devices with limited resources. Adopting a hybrid strategy, our approach focuses on accelerating on-device execution and offloading inputs surpassing the device's capacity. While this approach is established, we tackle SU's distinctive challenges through innovative techniques: (1) Late Contextualization: This involves the parallel execution of a model's attentive encoder during input ingestion. (2) Pilot Inference: Addressing temporal load imbalances in the SU pipeline, this technique aims to mitigate them effectively. (3) Autoregression Offramps: Decisions regarding offloading are made solely based on hypotheses, presenting a novel approach. These techniques are designed to seamlessly integrate with existing speech models, pipelines, and frameworks, offering flexibility for independent or combined application. Collectively, they form a hybrid solution for edge SU. Our prototype, named XYZ, has undergone testing on Arm platforms featuring 6 to 8 cores, demonstrating state-of-the-art accuracy. Notably, it achieves a 2x reduction in end-to-end latency and a corresponding 2x decrease in offloading requirements. △ Less

Submitted 4 December, 2023; v1 submitted 22 November, 2023; originally announced November 2023.

arXiv:2310.02373 [pdf, other]

Secure and Effective Data Appraisal for Machine Learning

Authors: Xu Ouyang, Changhong Yang, Felix Xiaozhu Lin, Yangfeng Ji

Abstract: Essential for an unfettered data market is the ability to discreetly select and evaluate training data before finalizing a transaction between the data owner and model owner. To safeguard the privacy of both data and model, this process involves scrutinizing the target model through Multi-Party Computation (MPC). While prior research has posited that the MPC-based evaluation of Transformer models… ▽ More Essential for an unfettered data market is the ability to discreetly select and evaluate training data before finalizing a transaction between the data owner and model owner. To safeguard the privacy of both data and model, this process involves scrutinizing the target model through Multi-Party Computation (MPC). While prior research has posited that the MPC-based evaluation of Transformer models is excessively resource-intensive, this paper introduces an innovative approach that renders data selection practical. The contributions of this study encompass three pivotal elements: (1) a groundbreaking pipeline for confidential data selection using MPC, (2) replicating intricate high-dimensional operations with simplified low-dimensional MLPs trained on a limited subset of pertinent data, and (3) implementing MPC in a concurrent, multi-phase manner. The proposed method is assessed across an array of Transformer models and NLP/CV benchmarks. In comparison to the direct MPC-based evaluation of the target model, our approach substantially reduces the time required, from thousands of hours to mere tens of hours, with only a nominal 0.20% dip in accuracy when training with the selected data. △ Less

Submitted 24 January, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

arXiv:2212.05974 [pdf, other]

doi 10.1145/3570361.3613277

Federated Few-Shot Learning for Mobile NLP

Authors: Dongqi Cai, Shangguang Wang, Yaozong Wu, Felix Xiaozhu Lin, Mengwei Xu

Abstract: Natural language processing (NLP) sees rich mobile applications. To support various language understanding tasks, a foundation NLP model is often fine-tuned in a federated, privacy-preserving setting (FL). This process currently relies on at least hundreds of thousands of labeled training samples from mobile clients; yet mobile users often lack willingness or knowledge to label their data. Such an… ▽ More Natural language processing (NLP) sees rich mobile applications. To support various language understanding tasks, a foundation NLP model is often fine-tuned in a federated, privacy-preserving setting (FL). This process currently relies on at least hundreds of thousands of labeled training samples from mobile clients; yet mobile users often lack willingness or knowledge to label their data. Such an inadequacy of data labels is known as a few-shot scenario; it becomes the key blocker for mobile NLP applications. For the first time, this work investigates federated NLP in the few-shot scenario (FedFSL). By retrofitting algorithmic advances of pseudo labeling and prompt learning, we first establish a training pipeline that delivers competitive accuracy when only 0.05% (fewer than 100) of the training data is labeled and the remaining is unlabeled. To instantiate the workflow, we further present a system FeS, addressing the high execution cost with novel designs. (1) Curriculum pacing, which injects pseudo labels to the training workflow at a rate commensurate to the learning progress; (2) Representational diversity, a mechanism for selecting the most learnable data, only for which pseudo labels will be generated; (3) Co-planning of a model's training depth and layer capacity. Together, these designs reduce the training delay, client energy, and network traffic by up to 46.0$\times$, 41.2$\times$ and 3000.0$\times$, respectively. Through algorithm/system co-design, FFNLP demonstrates that FL can apply to challenging settings where most training samples are unlabeled. △ Less

Submitted 19 August, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: MobiCom 2023

arXiv:2212.00192 [pdf, other]

doi 10.1145/3578356.3592575

Towards Practical Few-shot Federated NLP

Authors: Dongqi Cai, Yaozong Wu, Haitao Yuan, Shangguang Wang, Felix Xiaozhu Lin, Mengwei Xu

Abstract: Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is… ▽ More Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is often scarce, presenting an additional challenge. To address these challenges, we first introduce a data generator for federated few-shot learning tasks, which encompasses the quantity and skewness of scarce labeled data in a realistic setting. Subsequently, we propose AUG-FedPrompt, a prompt-based federated learning system that exploits abundant unlabeled data for data augmentation. Our experiments indicate that AUG-FedPrompt can perform on par with full-set fine-tuning with a limited amount of labeled data. However, such competitive performance comes at a significant system cost. △ Less

Submitted 19 August, 2023; v1 submitted 30 November, 2022; originally announced December 2022.

Comments: EuroSys23 workshop

arXiv:2207.14386 [pdf, other]

Efficient NLP Model Finetuning via Multistage Data Filtering

Authors: Xu Ouyang, Shahina Mohd Azam Ansari, Felix Xiaozhu Lin, Yangfeng Ji

Abstract: As model finetuning is central to the modern NLP, we set to maximize its efficiency. Motivated by redundancy in training examples and the sheer sizes of pretrained models, we exploit a key opportunity: training only on important data. To this end, we set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are two: (1) automatically deter… ▽ More As model finetuning is central to the modern NLP, we set to maximize its efficiency. Motivated by redundancy in training examples and the sheer sizes of pretrained models, we exploit a key opportunity: training only on important data. To this end, we set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are two: (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. We integrate the above techniques in a holistic, three-stage training process. On a diverse set of benchmarks, our method reduces the required training examples by up to 5.3$\times$ and training time by up to 6.8$\times$, while only seeing minor accuracy degradation. Our method is effective even when training one epoch, where each training example is encountered only once. It is simple to implement and is compatible with the existing finetuning techniques. Code is available at: https://github.com/xo28/efficient- NLP-multistage-training △ Less

Submitted 18 May, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

arXiv:2207.05022 [pdf, other]

doi 10.1145/3575693.3575698

STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining

Authors: Liwei Guo, Wonkyo Choe, Felix Xiaozhu Lin

Abstract: Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the wh… ▽ More Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines. △ Less

Submitted 31 January, 2023; v1 submitted 11 July, 2022; originally announced July 2022.

Comments: ASPLOS'23

arXiv:2205.10963 [pdf, other]

Protecting File Activities via Deception for ARM TrustZone

Authors: Liwei Guo, Kaiyang Zhao, Yiying Zhang, Felix Xiaozhu Lin

Abstract: A TrustZone TEE often invokes an external filesystem. While filedata can be encrypted, the revealed file activities can leak secrets. To hide the file activities from the filesystem and its OS, we propose Enigma, a deception-based defense injecting sybil file activities as the cover of the actual file activities. Enigma contributes three new designs. (1) To make the deception credible, the TEE g… ▽ More A TrustZone TEE often invokes an external filesystem. While filedata can be encrypted, the revealed file activities can leak secrets. To hide the file activities from the filesystem and its OS, we propose Enigma, a deception-based defense injecting sybil file activities as the cover of the actual file activities. Enigma contributes three new designs. (1) To make the deception credible, the TEE generates sybil calls by replaying file calls from the TEE code under protection. (2) To make sybil activities cheap, the TEE requests the OS to run K filesystem images simultaneously. Concealing the disk, the TEE backs only one image with the actual disk while backing other images by only storing their metadata. (3) To protect filesystem image identities, the TEE shuffles the images frequently, preventing the OS from observing any image for long. Enigma works with unmodified filesystems shipped withLinux. On a low-cost Arm SoC with EXT4 and F2FS, our system can concurrently run as many as 50 filesystem images with 1% of disk overhead per additional image. Compared to common obfuscation for hiding addresses in a flat space, Enigma hides file activities with richer semantics. Its cost is lower by one order of magnitude while achieving the same level of probabilistic security guarantees. △ Less

Submitted 24 May, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

Comments: Under submission

arXiv:2205.10162 [pdf, other]

FedAdapter: Efficient Federated Learning for Modern NLP

Authors: Dongqi Cai, Yaozong Wu, Shangguang Wang, Felix Xiaozhu Lin, Mengwei Xu

Abstract: Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towa… ▽ More Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and mobile resources. To automate adapter configuration, we propose FedAdapter, a framework that enhances the existing FedNLP with two novel designs. First, FedAdapter progressively upgrades the adapter configuration throughout a training session; the principle is to quickly learn shallow knowledge by only training fewer and smaller adapters at the model's top layers, and incrementally learn deep knowledge by incorporating deeper and larger adapters. Second, FedAdapter continuously profiles future adapter configurations by allocating participant devices to trial groups. Extensive experiments show that FedAdapter can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines. △ Less

Submitted 8 May, 2023; v1 submitted 20 May, 2022; originally announced May 2022.

Comments: Accepted by MobiCom 2023

arXiv:2111.03065 [pdf, other]

Safe and Practical GPU Acceleration in TrustZone

Authors: Heejin Park, Felix Xiaozhu Lin

Abstract: We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach's key missing piece -- the recording environment, which needs both strong security and access to d… ▽ More We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach's key missing piece -- the recording environment, which needs both strong security and access to diverse mobile GPUs. To this end, we present a novel architecture called CODY, in which a mobile device (which possesses the GPU hardware) and a trustworthy cloud service (which runs the GPU software) exercise the GPU hardware/software in a collaborative, distributed fashion. To overcome numerous network round trips and long delays, CODY contributes optimizations specific to mobile GPUs: register access deferral, speculation, and metastate-only synchronization. With these optimizations, recording a compute workload takes only tens of seconds, which is up to 95% less than a naive approach; replay incurs 25% lower delays compared to insecure, native execution. △ Less

Submitted 4 November, 2021; originally announced November 2021.

arXiv:2110.08303 [pdf, other]

doi 10.1145/3492321.3519565

Minimum Viable Device Drivers for ARM TrustZone

Authors: Liwei Guo, Felix Xiaozhu Lin

Abstract: While TrustZone can isolate IO hardware, it lacks drivers for modern IO devices. Rather than porting drivers, we propose a novel approach to deriving minimum viable drivers: developers exercise a full driver and record the driver/device interactions; the processed recordings, dubbed driverlets, are replayed in the TEE at run time to access IO devices. Driverlets address two key challenges: corre… ▽ More While TrustZone can isolate IO hardware, it lacks drivers for modern IO devices. Rather than porting drivers, we propose a novel approach to deriving minimum viable drivers: developers exercise a full driver and record the driver/device interactions; the processed recordings, dubbed driverlets, are replayed in the TEE at run time to access IO devices. Driverlets address two key challenges: correctness and expressiveness, for which they build on a key construct called interaction template. The interaction template ensures faithful reproduction of recorded IO jobs (albeit on new IO data); it accepts dynamic input values; it tolerates nondeterministic device behaviors. We demonstrate driverlets on a series of sophisticated devices, making them accessible to TrustZone for the first time to our knowledge. Our experiments show that driverlets are secure, easy to build, and incur acceptable overhead (1.4x -2.7x compared to native drivers). Driverlets fill a critical gap in the TrustZone TEE, realizing its long-promised vision of secure IO. △ Less

Submitted 15 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

Comments: Eurosys 2022

arXiv:2105.05085 [pdf, other]

GPUReplay: A 50-KB GPU Stack for Client ML

Authors: Heejin Park, Felix Xiaozhu Lin

Abstract: GPUReplay (GR) is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses high complexity of a modern GPU stack for deployment ease and security. The idea is to record GPU executions on the full GPU stack ahead of time and replay the executions on new input at run time. We address key challenges towards making GR feasible, sound, and practical to use. The… ▽ More GPUReplay (GR) is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses high complexity of a modern GPU stack for deployment ease and security. The idea is to record GPU executions on the full GPU stack ahead of time and replay the executions on new input at run time. We address key challenges towards making GR feasible, sound, and practical to use. The resultant replayer is a drop-in replacement of the original GPU stack. It is tiny (50 KB of executable), robust (replaying long executions without divergence), portable (running in a commodity OS, in TEE, and baremetal), and quick to launch (speeding up startup by up to two orders of magnitude). We show that GPUReplay works with a variety of integrated GPU hardware, GPU APIs, ML frameworks, and 33 neural network (NN) implementations for inference or training. The code is available at https://github.com/bakhi/GPUReplay. △ Less

Submitted 3 April, 2022; v1 submitted 4 May, 2021; originally announced May 2021.

Comments: in Proc. ASPLOS, Mar. 2022

arXiv:2101.08744 [pdf, other]

Enabling Large Neural Networks on Tiny Microcontrollers with Swapping

Authors: Hongyu Miao, Felix Xiaozhu Lin

Abstract: Running neural networks (NNs) on microcontroller units (MCUs) is becoming increasingly important, but is very difficult due to the tiny SRAM size of MCU. Prior work proposes many algorithm-level techniques to reduce NN memory footprints, but all at the cost of sacrificing accuracy and generality, which disqualifies MCUs for many important use cases. We investigate a system solution for MCUs to exe… ▽ More Running neural networks (NNs) on microcontroller units (MCUs) is becoming increasingly important, but is very difficult due to the tiny SRAM size of MCU. Prior work proposes many algorithm-level techniques to reduce NN memory footprints, but all at the cost of sacrificing accuracy and generality, which disqualifies MCUs for many important use cases. We investigate a system solution for MCUs to execute NNs out of core: dynamically swapping NN data chunks between an MCU's tiny SRAM and its large, low-cost external flash. Out-of-core NNs on MCUs raise multiple concerns: execution slowdown, storage wear out, energy consumption, and data security. We present a study showing that none is a showstopper; the key benefit -- MCUs being able to run large NNs with full accuracy and generality -- triumphs the overheads. Our findings suggest that MCUs can play a much greater role in edge intelligence. △ Less

Submitted 1 September, 2021; v1 submitted 14 January, 2021; originally announced January 2021.

arXiv:2012.09329 [pdf, other]

Clique: Spatiotemporal Object Re-identification at the City Scale

Authors: Tiantu Xu, Kaiwen Shen, Yang Fu, Humphrey Shi, Felix Xiaozhu Lin

Abstract: Object re-identification (ReID) is a key application of city-scale cameras. While classic ReID tasks are often considered as image retrieval, we treat them as spatiotemporal queries for locations and times in which the target object appeared. Spatiotemporal reID is challenged by the accuracy limitation in computer vision algorithms and the colossal videos from city cameras. We present Clique, a pr… ▽ More Object re-identification (ReID) is a key application of city-scale cameras. While classic ReID tasks are often considered as image retrieval, we treat them as spatiotemporal queries for locations and times in which the target object appeared. Spatiotemporal reID is challenged by the accuracy limitation in computer vision algorithms and the colossal videos from city cameras. We present Clique, a practical ReID engine that builds upon two new techniques: (1) Clique assesses target occurrences by clustering fuzzy object features extracted by ReID algorithms, with each cluster representing the general impression of a distinct object to be matched against the input; (2) to search in videos, Clique samples cameras to maximize the spatiotemporal coverage and incrementally adds cameras for processing on demand. Through evaluation on 25 hours of videos from 25 cameras, Clique reached a high accuracy of 0.87 (recall at 5) across 70 queries and runs at 830x of video realtime in achieving high accuracy. △ Less

Submitted 16 December, 2020; originally announced December 2020.

arXiv:1912.11598 [pdf, other]

Grand Challenges in Resilience: Autonomous System Resilience through Design and Runtime Measures

Authors: Saurabh Bagchi, Vaneet Aggarwal, Somali Chaterji, Fred Douglis, Aly El Gamal, Jiawei Han, Brian J. Henz, Hank Hoffmann, Suman Jana, Milind Kulkarni, Felix Xiaozhu Lin, Karen Marais, Prateek Mittal, Shaoshuai Mou, Xiaokang Qiu, Gesualdo Scutari

Abstract: A set of about 80 researchers, practitioners, and federal agency program managers participated in the NSF-sponsored Grand Challenges in Resilience Workshop held on Purdue campus on March 19-21, 2019. The workshop was divided into three themes: resilience in cyber, cyber-physical, and socio-technical systems. About 30 attendees in all participated in the discussions of cyber resilience. This articl… ▽ More A set of about 80 researchers, practitioners, and federal agency program managers participated in the NSF-sponsored Grand Challenges in Resilience Workshop held on Purdue campus on March 19-21, 2019. The workshop was divided into three themes: resilience in cyber, cyber-physical, and socio-technical systems. About 30 attendees in all participated in the discussions of cyber resilience. This article brings out the substantive parts of the challenges and solution approaches that were identified in the cyber resilience theme. In this article, we put forward the substantial challenges in cyber resilience in a few representative application domains and outline foundational solutions to address these challenges. These solutions fall into two broad themes: resilience-by-design and resilience-by-reaction. We use examples of autonomous systems as the application drivers motivating cyber resilience. We focus on some autonomous systems in the near horizon (autonomous ground and aerial vehicles) and also a little more distant (autonomous rescue and relief). For resilience-by-design, we focus on design methods in software that are needed for our cyber systems to be resilient. In contrast, for resilience-by-reaction, we discuss how to make systems resilient by responding, reconfiguring, or recovering at runtime when failures happen. We also discuss the notion of adaptive execution to improve resilience, execution transparently and adaptively among available execution platforms (mobile/embedded, edge, and cloud). For each of the two themes, we survey the current state, and the desired state and ways to get there. We conclude the paper by looking at the research challenges we will have to solve in the short and the mid-term to make the vision of resilient autonomous systems a reality. △ Less

Submitted 9 May, 2020; v1 submitted 25 December, 2019; originally announced December 2019.

ACM Class: C.4; D.4.5

Journal ref: IEEE Open Journal of the Computer Society, 2020

arXiv:1909.00841 [pdf, other]

doi 10.1145/3386901.3388948

Approximate Query Service on Autonomous IoT Cameras

Authors: Mengwei Xu, Xiwen Zhang, Yunxin Liu, Gang Huang, Xuanzhe Liu, Felix Xiaozhu Lin

Abstract: Elf is a runtime for an energy-constrained camera to continuously summarize video scenes as approximate object counts. Elf's novelty centers on planning the camera's count actions under energy constraint. (1) Elf explores the rich action space spanned by the number of sample image frames and the choice of per-frame object counters; it unifies errors from both sources into one single bounded error.… ▽ More Elf is a runtime for an energy-constrained camera to continuously summarize video scenes as approximate object counts. Elf's novelty centers on planning the camera's count actions under energy constraint. (1) Elf explores the rich action space spanned by the number of sample image frames and the choice of per-frame object counters; it unifies errors from both sources into one single bounded error. (2) To decide count actions at run time, Elf employs a learning-based planner, jointly optimizing for past and future videos without delaying result materialization. Tested with more than 1,000 hours of videos and under realistic energy constraints, Elf continuously generates object counts within only 11% of the true counts on average. Alongside the counts, Elf presents narrow errors shown to be bounded and up to 3.4x smaller than competitive baselines. At a higher level, Elf makes a case for advancing the geographic frontier of video analytics. △ Less

Submitted 5 May, 2020; v1 submitted 2 September, 2019; originally announced September 2019.

arXiv:1904.12342 [pdf, other]

Video Analytics with Zero-streaming Cameras

Authors: Mengwei Xu, Tiantu Xu, Yunxin Liu, Felix Xiaozhu Lin

Abstract: Low-cost cameras enable powerful analytics. An unexploited opportunity is that most captured videos remain "cold" without being queried. For efficiency, we advocate for these cameras to be zero streaming: capturing videos to local storage and communicating with the cloud only when analytics is requested. How to query zero-streaming cameras efficiently? Our response is a camera/cloud runtime system… ▽ More Low-cost cameras enable powerful analytics. An unexploited opportunity is that most captured videos remain "cold" without being queried. For efficiency, we advocate for these cameras to be zero streaming: capturing videos to local storage and communicating with the cloud only when analytics is requested. How to query zero-streaming cameras efficiently? Our response is a camera/cloud runtime system called DIVA. It addresses two key challenges: to best use limited camera resource during video capture; to rapidly explore massive videos during query execution. DIVA contributes two unconventional techniques. (1) When capturing videos, a camera builds sparse yet accurate landmark frames, from which it learns reliable knowledge for accelerating future queries. (2) When executing a query, a camera processes frames in multiple passes with increasingly more expensive operators. As such, DIVA presents and keeps refining inexact query results throughout the query's execution. On diverse queries over 15 videos lasting 720 hours in total, DIVA runs at more than 100x video realtime and outperforms competitive alternative designs. To our knowledge, DIVA is the first system for querying large videos stored on low-cost remote cameras. △ Less

Submitted 17 June, 2021; v1 submitted 28 April, 2019; originally announced April 2019.

Comments: Mengwei Xu and Tiantu Xu contributed equally to the paper

arXiv:1902.06327 [pdf, other]

Let the Cloud Watch Over Your IoT File Systems

Authors: Liwei Guo, Yiying Zhang, Felix Xiaozhu Lin

Abstract: Smart devices produce security-sensitive data and keep them in on-device storage for persistence. The current storage stack on smart devices, however, offers weak security guarantees: not only because the stack depends on a vulnerable commodity OS, but also because smart device deployment is known weak on security measures. To safeguard such data on smart devices, we present a novel storage stac… ▽ More Smart devices produce security-sensitive data and keep them in on-device storage for persistence. The current storage stack on smart devices, however, offers weak security guarantees: not only because the stack depends on a vulnerable commodity OS, but also because smart device deployment is known weak on security measures. To safeguard such data on smart devices, we present a novel storage stack architecture that i) protects file data in a trusted execution environment (TEE); ii) outsources file system logic and metadata out of TEE; iii) running a metadata-only file system replica in the cloud for continuously verifying the on-device file system behaviors. To realize the architecture, we build Overwatch, aTrustZone-based storage stack. Overwatch addresses unique challenges including discerning metadata at fine grains, hiding network delays, and coping with cloud disconnection. On a suite of three real-world applications, Overwatch shows moderate security overheads. △ Less

Submitted 17 February, 2019; originally announced February 2019.

arXiv:1901.01328 [pdf, other]

doi 10.1145/3297858.3304031

StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory

Authors: Hongyu Miao, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, Felix Xiaozhu Lin

Abstract: Stream analytics have an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity-limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytic… ▽ More Stream analytics have an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity-limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytics appear a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access. This paper presents the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance. StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox-HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM. StreamBox-HBM achieves 110 million records per second and 238 GB/s memory bandwidth while effectively utilizing all 64 cores of Intel's Knights Landing, a commercial server with hybrid memory. It outperforms stream engines with sequential access algorithms without KPAs by 7x and stream engines with random access algorithms by an order of magnitude in throughput. To the best of our knowledge, StreamBox-HBM is the first stream engine optimized for hybrid memories. △ Less

Submitted 28 January, 2019; v1 submitted 4 January, 2019; originally announced January 2019.

arXiv:1812.05448 [pdf, other]

A First Look at Deep Learning Apps on Smartphones

Authors: Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, Xuanzhe Liu

Abstract: We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions:… ▽ More We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions: what are the early adopter apps of deep learning, what do they use deep learning for, and how do their deep learning models look like. Our study has strong implications for app developers, smartphone vendors, and deep learning R\&D. On one hand, our findings paint a promising picture of deep learning for smartphones, showing the prosperity of mobile deep learning frameworks as well as the prosperity of apps building their cores atop deep learning. On the other hand, our findings urge optimizations on deep learning models deployed on smartphones, the protection of these models, and validation of research ideas on these models. △ Less

Submitted 12 January, 2021; v1 submitted 8 November, 2018; originally announced December 2018.

arXiv:1811.05000 [pdf, other]

Transkernel: Bridging Monolithic Kernels to Peripheral Cores

Authors: Liwei Guo, Shuang Zhai, Yi Qiao, Felix Xiaozhu Lin

Abstract: Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off… ▽ More Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off running on a low-power, microcontroller-like core, i.e., peripheral core, relieving CPU from the inefficiency. We therefore present a new OS structure, in which a lightweight virtual executor called transkernel offloads specific phases from a monolithic kernel. The transkernel translates stateful kernel execution through cross-ISA, dynamic binary translation (DBT); it emulates a small set of stateless kernel services behind a narrow, stable binary interface; it specializes for hot paths; it exploits ISA similarities for lowering DBT cost. Through an ARM-based prototype, we demonstrate transkernel's feasibility and benefit. We show that while cross-ISA DBT is typically used under the assumption of efficiency loss, it can enable efficiency gain, even on off-the-shelf hardware. △ Less

Submitted 5 June, 2019; v1 submitted 12 November, 2018; originally announced November 2018.

Comments: The camera-ready version of this paper, will appear at USENIX ATC'19

arXiv:1810.01794 [pdf, other]

doi 10.1145/3302424.3303971

VStore: A Data Store for Analytics on Large Videos

Authors: Tiantu Xu, Luis Materon Botelho, Felix Xiaozhu Lin

Abstract: We present VStore, a data store for supporting fast, resource-efficient analytics over large archival videos. VStore manages video ingestion, storage, retrieval, and consumption. It controls video formats along the video data path. It is challenged by i) the huge combinatorial space of video format knobs; ii) the complex impacts of these knobs and their high profiling cost; iii) optimizing for mul… ▽ More We present VStore, a data store for supporting fast, resource-efficient analytics over large archival videos. VStore manages video ingestion, storage, retrieval, and consumption. It controls video formats along the video data path. It is challenged by i) the huge combinatorial space of video format knobs; ii) the complex impacts of these knobs and their high profiling cost; iii) optimizing for multiple resource types. It explores an idea called backward derivation of configuration: in the opposite direction along the video data path, VStore passes the video quantity and quality expected by analytics backward to retrieval, to storage, and to ingestion. In this process, VStore derives an optimal set of video formats, optimizing for different resources in a progressive manner. VStore automatically derives large, complex configurations consisting of more than one hundred knobs over tens of video formats. In response to queries, VStore selects video formats catering to the executed operators and the target accuracy. It streams video data from disks through decoder to operators. It runs queries as fast as 362x of video realtime. △ Less

Submitted 17 February, 2019; v1 submitted 3 October, 2018; originally announced October 2018.

arXiv:1808.05078 [pdf, other]

StreamBox-TZ: Secure Stream Analytics at the Edge with TrustZone

Authors: Heejin Park, Shuang Zhai, Long Lu, Felix Xiaozhu Lin

Abstract: While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted. This approa… ▽ More While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted. This approach faces two major challenges: (1) executing high-throughput, low-delay stream analytics in a single TEE, which is constrained by a low trusted computing base (TCB) and limited physical memory; (2) verifying execution of stream analytics as the execution involves untrusted software components on the edge. In response, we present StreamBox-TZ (SBT), a stream analytics engine for an edge platform that offers strong data security, verifiable results, and good performance. SBT contributes a data plane designed and optimized for a TEE based on ARM TrustZone. It supports continuous remote attestation for analytics correctness and result freshness while incurring low overhead. SBT only adds 42.5 KB executable to the TCB (16% of the entire TCB). On an octa core ARMv8 platform, it delivers the state-of-the-art performance by processing input events up to 140 MB/sec (12M events/sec) with sub-second delay. The overhead incurred by SBT's security mechanism is less than 25%. △ Less

Submitted 5 June, 2019; v1 submitted 2 August, 2018; originally announced August 2018.

arXiv:1712.01670 [pdf, other]

doi 10.1145/3241539.3241563

DeepCache: Principled Cache for Mobile Deep Vision

Authors: Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, Xuanzhe Liu

Abstract: We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the inpu… ▽ More We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the input of a model, DeepCache discovers video temporal locality by exploiting the video's internal structure, for which it borrows proven heuristics from video compression; into the model, DeepCache propagates regions of reusable results by exploiting the model's internal structure. Notably, DeepCache eschews applying video heuristics to model internals which are not pixels but high-dimensional, difficult-to-interpret data. Our implementation of DeepCache works with unmodified deep learning models, requires zero developer's manual effort, and is therefore immediately deployable on off-the-shelf mobile devices. Our experiments show that DeepCache saves inference execution time by 18% on average and up to 47%. DeepCache reduces system energy consumption by 20% on average. △ Less

Submitted 30 March, 2020; v1 submitted 1 December, 2017; originally announced December 2017.

Comments: Accepted for publication in the MobiCom 2018, copyright the ACM, posted with permission

arXiv:1404.1320 [pdf, other]

Draining our Glass: An Energy and Heat Characterization of Google Glass

Authors: Robert LiKamWa, Zhen Wang, Aaron Carroll, Felix Xiaozhu Lin, Lin Zhong

Abstract: The Google Glass is a mobile device designed to be worn as eyeglasses. This form factor enables new usage possibilities, such as hands-free video chats and instant web search. However, its shape also hampers its potential: (1) battery size, and therefore lifetime, is limited by a need for the device to be lightweight, and (2) high-power processing leads to significant heat, which should be limited… ▽ More The Google Glass is a mobile device designed to be worn as eyeglasses. This form factor enables new usage possibilities, such as hands-free video chats and instant web search. However, its shape also hampers its potential: (1) battery size, and therefore lifetime, is limited by a need for the device to be lightweight, and (2) high-power processing leads to significant heat, which should be limited, due to the Glass' compact form factor and close proximity to the user's skin. We use the Glass in a case study of the power and thermal characteristics of optical head-mounted display devices. We share insights and implications to limit power consumption to increase the safety and utility of head-mounted devices. △ Less

Submitted 26 March, 2014; originally announced April 2014.

Report number: Rice University ECE Technical Report 2014-03-23

arXiv:1212.5170 [pdf, other]

Guadalupe: a browser design for heterogeneous hardware

Authors: Zhen Wang, Felix Xiaozhu Lin, Lin Zhong, Mansoor Chishtie

Abstract: Mobile systems are embracing heterogeneous architectures by getting more types of cores and more specialized cores, which allows applications to be faster and more efficient. We aim at exploiting the hardware heterogeneity from the browser without requiring any changes to either the OS or the web applications. Our design, Guadalupe, can use hardware processing units with different degrees of capab… ▽ More Mobile systems are embracing heterogeneous architectures by getting more types of cores and more specialized cores, which allows applications to be faster and more efficient. We aim at exploiting the hardware heterogeneity from the browser without requiring any changes to either the OS or the web applications. Our design, Guadalupe, can use hardware processing units with different degrees of capability for matched browser services. It starts with a weak hardware unit, determines if and when a strong unit is needed, and seamlessly migrates to the strong one when necessary. Guadalupe not only makes more computing resources available to mobile web browsing but also improves its energy proportionality. Based on Chrome for Android and TI OMAP4, We provide a prototype browser implementation for resource loading and rendering. Compared to Chrome for Android, we show that Guadalupe browser for rendering can increase other 3D application's frame rate by up to 767% and save 4.7% of the entire system's energy consumption. More importantly, by using the two cases, we demonstrate that Guadalupe creates the great opportunity for many browser services to get better resource utilization and energy proportionality by exploiting hardware heterogeneity. △ Less

Submitted 19 December, 2012; originally announced December 2012.

Report number: Rice University ECE Technical Report 2012-12-19

arXiv:1112.3691 [pdf]

How Far Can Client-Only Solutions Go for Mobile Browser Speed?

Authors: Zhen Wang, Felix Xiaozhu Lin, Lin Zhong, Mansoor Chishtie

Abstract: Mobile browser is known to be slow because of the bottleneck in resource loading. Client-only solutions to improve resource loading are attractive because they are immediately deployable, scalable, and secure. We present the first publicly known treatment of client-only solutions to understand how much they can improve mobile browser speed without infrastructure support. Leveraging an unprecedente… ▽ More Mobile browser is known to be slow because of the bottleneck in resource loading. Client-only solutions to improve resource loading are attractive because they are immediately deployable, scalable, and secure. We present the first publicly known treatment of client-only solutions to understand how much they can improve mobile browser speed without infrastructure support. Leveraging an unprecedented set of web usage data collected from 24 iPhone users continuously over one year, we examine the three fundamental, orthogonal approaches a client-only solution can take: caching, prefetching, and speculative loading, which is first proposed and studied in this work. Speculative loading predicts and speculatively loads the subresources needed to open a web page once its URL is given. We show that while caching and prefetching are highly limited for mobile browsing, speculative loading can be significantly more effective. Empirically, we show that client-only solutions can improve the browser speed by about 1.4 second on average for web sites visited by the 24 iPhone users. We also report the design, realization, and evaluation of speculative loading in a WebKit-based browser called Tempo. On average, Tempo can reduce browser delay by 1 second (~20%). △ Less

Submitted 15 December, 2011; originally announced December 2011.

Report number: TR1215-2011, Rice University and Texas Instruments

arXiv:1103.2348 [pdf]

Transparent Programming of Heterogeneous Smartphones for Sensing

Authors: Felix Xiaozhu Lin, Zhen Wang, Robert LiKamWa, Lin Zhong

Abstract: Sensing on smartphones is known to be power-hungry. It has been shown that this problem can be solved by adding an ultra low-power processor to execute simple, frequent sensor data processing. While very effective in saving energy, this resulting heterogeneous, distributed architecture poses a significant challenge to application development. We present Reflex, a suite of runtime and compilation… ▽ More Sensing on smartphones is known to be power-hungry. It has been shown that this problem can be solved by adding an ultra low-power processor to execute simple, frequent sensor data processing. While very effective in saving energy, this resulting heterogeneous, distributed architecture poses a significant challenge to application development. We present Reflex, a suite of runtime and compilation techniques to conceal the heterogeneous, distributed nature from developers. The Reflex automatically transforms the developer's code for distributed execution with the help of the Reflex runtime. To create a unified system illusion, Reflex features a novel software distributed shared memory (DSM) design that leverages the extreme architectural asymmetry between the low-power processor and the powerful central processor to achieve both energy efficiency and performance. We report a complete realization of Reflex for heterogeneous smartphones with Maemo/Linux as the central kernel. Using a tri-processor hardware prototype and sensing applications reported in recent literature, we evaluate the Reflex realization for programming transparency, energy efficiency, and performance. We show that Reflex supports a programming style that is very close to contemporary smartphone programming. It allows existing sensing applications to be ported with minor source code changes. Reflex reduces the system power in sensing by up to 83%, and its runtime system only consumes 10% local memory on a typical ultra-low power processor. △ Less

Submitted 11 March, 2011; originally announced March 2011.

Showing 1–28 of 28 results for author: Lin, F X