Distributed, Parallel, and Cluster Computing
See recent articles
- [1] arXiv:2407.11260 [pdf, other]
-
Title: Quality Scalable Quantization Methodology for Deep Learning on EdgeSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Deep Learning Architectures employ heavy computations and bulk of the computational energy is taken up by the convolution operations in the Convolutional Neural Networks. The objective of our proposed work is to reduce the energy consumption and size of CNN for using machine learning techniques in edge computing on ubiquitous computing devices. We propose Systematic Quality Scalable Design Methodology consisting of Quality Scalable Quantization on a higher abstraction level and Quality Scalable Multipliers at lower abstraction level. The first component consists of parameter compression where we approximate representation of values in filters of deep learning models by encoding in 3 bits. A shift and scale based on-chip decoding hardware is proposed which can decode these 3-bit representations to recover approximate filter values. The size of the DNN model is reduced this way and can be sent over a communication channel to be decoded on the edge computing devices. This way power is reduced by limiting data bits by approximation. In the second component we propose a quality scalable multiplier which reduces the number of partial products by converting numbers in canonic sign digit representations and further approximating the number by reducing least significant bits. These quantized CNNs provide almost same ac-curacy as network with original weights with little or no fine-tuning. The hardware for the adaptive multipliers utilize gate clocking for reducing energy consumption during multiplications. The proposed methodology greatly reduces the memory and power requirements of DNN models making it a feasible approach to deploy Deep Learning on edge computing. The experiments done on LeNet and ConvNets show an increase upto 6% of zeros and memory savings upto 82.4919% while keeping the accuracy near the state of the art.
- [2] arXiv:2407.11302 [pdf, html, other]
-
Title: Edge-Mapping of Service Function Trees for Sensor Event ProcessingComments: 11 pages, 7 figures. This is an accepted paper and it is going to appear in the proceedings of IEEE International Conference on Web Services (ICWS 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Fog computing offers increased performance and efficiency for Industrial Internet of Things (IIoT) applications through distributed data processing in nearby proximity to sensors. Given resource constraints and their contentious use in IoT networks, current strategies strive to optimise which data processing tasks should be selected to run on fog devices. In this paper, we advance a more effective data processing architecture for optimisation purposes. Specifically, we consider the distinct functions of sensor data streaming, multi-stream data aggregation and event handling, required by IoT applications for identifying actionable events. We retrofit this event processing pipeline into a logical architecture, structured as a service function tree (SFT), comprising service function chains. We present a novel algorithm for mapping the SFT into a fog network topology in which nodes selected to process SFT functions (microservices) have the requisite resource capacity and network speed to meet their event processing deadlines. We used simulations to validate the algorithm's effectiveness in finding a successful SFT mapping to a physical network. Overall, our approach overcomes the bottlenecks of single service placement strategies for fog computing through composite service placements of SFTs.
- [3] arXiv:2407.11388 [pdf, html, other]
-
Title: Paralleling and Accelerating Arc Consistency Enforcement with Recurrent Tensor ComputationsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
We propose a new arc consistency enforcement paradigm that transforms arc consistency enforcement into recurrent tensor operations. In each iteration of the recurrence, all involved processes can be fully parallelized with tensor operations. And the number of iterations is quite small. Based on these benefits, the resulting algorithm fully leverages the power of parallelization and GPU, and therefore is extremely efficient on large and densely connected constraint networks.
- [4] arXiv:2407.11432 [pdf, html, other]
-
Title: Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific ComputingHaochen Pan, Ryan Chard, Sicheng Zhou, Alok Kamatar, Rafael Vescovi, Valerie Hayot-Sasson, André Bauer, Maxime Gonthier, Kyle Chard, Ian FosterComments: 12 pages, 8 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Scientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. To support the development of scientific EDA, we introduce Octopus, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers. Octopus can be scaled to meet demand, permits the deployment of highly available Triggers for automatic event processing, and enforces fine-grained access control. We identify requirements in self-driving laboratories, scientific data automation, online task scheduling, epidemic modeling, and dynamic workflow management use cases, and present results demonstrating Octopus' ability to meet those requirements. Octopus supports producing and consuming events at a rate of over 4.2 M and 9.6 M events per second, respectively, from distributed clients.
- [5] arXiv:2407.11488 [pdf, html, other]
-
Title: Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUsComments: 30th International European Conference on Parallel and Distributed Computing (Euro-PAR 2024) (Best Paper Candidates Session)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD's HIP. We do so by extending Kernel Tuner, an open-source Python library for auto-tuning GPU programs. We analyze the performance impact and tuning difficulty for four highly-tunable benchmark kernels on four different GPUs: two from Nvidia and two from AMD. Our results demonstrate that auto-tuning has a significantly higher impact on performance on AMD compared to Nvidia (10x vs 2x). Additionally, we show that applications tuned for Nvidia do not perform optimally on AMD, underscoring the importance of auto-tuning specifically for AMD to achieve high performance on these GPUs.
- [6] arXiv:2407.11582 [pdf, html, other]
-
Title: Reducing Tail Latencies Through Environment- and Neighbour-aware Thread ManagementSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Application tail latency is a key metric for many services, with high latencies being linked directly to loss of revenue. Modern deeply-nested micro-service architectures exacerbate tail latencies, increasing the likelihood of users experiencing them. In this work, we show how CPU overcommitment by OS threads leads to high tail latencies when applications are under heavy load. CPU overcommitment can arise from two operational factors: incorrectly determining the number of CPUs available when under a CPU quota, and the ignorance of neighbour applications and their CPU usage. We discuss different languages' solutions to obtaining the CPUs available, evaluating the impact, and discuss opportunities for a more unified language-independent interface to obtain the number of CPUs available. We then evaluate the impact of neighbour usage on tail latency and introduce a new neighbour-aware threadpool, the friendlypool, that dynamically avoids overcommitment. In our evaluation, the friendlypool reduces maximum worker latency by up to $6.7\times$ at the cost of decreasing throughput by up to $1.4\times$.
- [7] arXiv:2407.11830 [pdf, other]
-
Title: Personalized Conversational Travel Assistant powered by Generative AIComments: 13 pages, 4 FiguresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
The Tourism and Destination Management Organization (DMO) industry is rapidly evolving to adapt to new technologies and traveler expectations. Generative Artificial Intelligence (AI) offers an astonishing and innovative opportunity to enhance the tourism experience by providing personalized, interactive and engaging assistance. In this article, we propose a generative AI-based chatbot for tourism assistance. The chatbot leverages AI ability to generate realistic and creative texts, adopting the friendly persona of the well-known Italian all-knowledgeable aunties, to provide tourists with personalized information, tailored and dynamic pre, during and post recommendations and trip plans and personalized itineraries, using both text and voice commands, and supporting different languages to satisfy Italian and foreign tourists expectations. This work is under development in the Molise CTE research project, funded by the Italian Minister of the Economic Growth (MIMIT), with the aim to leverage the best emerging technologies available, such as Cloud and AI to produce state of the art solutions in the Smart City environment.
- [8] arXiv:2407.11967 [pdf, html, other]
-
Title: Hydra: Brokering Cloud and HPC Resources to Support the Execution of Heterogeneous Workloads at ScaleSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
Scientific discovery increasingly depends on middleware that enables the execution of heterogeneous workflows on heterogeneous platforms One of the main challenges is to design software components that integrate within the existing ecosystem to enable scale and performance across cloud and high-performance computing HPC platforms Researchers are met with a varied computing landscape which includes services available on commercial cloud platforms data and network capabilities specifically designed for scientific discovery on government-sponsored cloud platforms and scale and performance on HPC platforms We present Hydra an intra cross-cloud HPC brokering system capable of concurrently acquiring resources from commercial private cloud and HPC platforms and managing the execution of heterogeneous workflow applications on those resources This paper offers four main contributions (1) the design of brokering capabilities in the presence of task platform resource and middleware heterogeneity; (2) a reference implementation of that design with Hydra; (3) an experimental characterization of Hydra s overheads and strong weak scaling with heterogeneous workloads and platforms and, (4) the implementation of a workflow that models sea rise with Hydra and its scaling on cloud and HPC platforms
New submissions for Wednesday, 17 July 2024 (showing 8 of 8 entries )
- [9] arXiv:2407.10985 (cross-list from cs.NI) [pdf, other]
-
Title: Strategies for Tracking Individual IP Packets Towards DDoSComments: arXiv admin note: substantial text overlap with arXiv:2004.09327Journal-ref: PIK - Praxis der Informationsverarbeitung und Kommunikation 2015Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)
The identification of the exact path that packets are routed in the network is quite a challenge. This paper presents a novel, efficient traceback strategy in combination with a defence system against distributed denial of service (DDoS) attacks named Tracemax. A single packets can be directly traced over many more hops than the current existing techniques allow. It let good connections pass while bad ones get thwarted. Initiated by the victim the routers in the network cooperate in tracing and become automatically self-organised and self-managed. The novel concept support analyses of packet flows and transmission paths in a network infrastructure. It can effectively reduce the effect of common bandwidth and resource consumption attacks and foster in addition early warning and prevention.
- [10] arXiv:2407.11061 (cross-list from cs.LG) [pdf, html, other]
-
Title: Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go HierarchicalAdarsh Prasad Behera, Paulius Daubaris, Iñaki Bravo, José Gallego, Roberto Morabito, Joerg Widmer, Jaya Prakash Varma ChampatiSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
On-device inference holds great potential for increased energy efficiency, responsiveness, and privacy in edge ML systems. However, due to less capable ML models that can be embedded in resource-limited devices, use cases are limited to simple inference tasks such as visual keyword spotting, gesture recognition, and predictive analytics. In this context, the Hierarchical Inference (HI) system has emerged as a promising solution that augments the capabilities of the local ML by offloading selected samples to an edge server or cloud for remote ML inference. Existing works demonstrate through simulation that HI improves accuracy. However, they do not account for the latency and energy consumption on the device, nor do they consider three key heterogeneous dimensions that characterize ML systems: hardware, network connectivity, and models. In contrast, this paper systematically compares the performance of HI with on-device inference based on measurements of accuracy, latency, and energy for running embedded ML models on five devices with different capabilities and three image classification datasets. For a given accuracy requirement, the HI systems we designed achieved up to 73% lower latency and up to 77% lower device energy consumption than an on-device inference system. The key to building an efficient HI system is the availability of small-size, reasonably accurate on-device models whose outputs can be effectively differentiated for samples that require remote inference. Despite the performance gains, HI requires on-device inference for all samples, which adds a fixed overhead to its latency and energy consumption. Therefore, we design a hybrid system, Early Exit with HI (EE-HI), and demonstrate that compared to HI, EE-HI reduces the latency by up to 59.7% and lowers the device's energy consumption by up to 60.4%.
- [11] arXiv:2407.11091 (cross-list from eess.SP) [pdf, other]
-
Title: SENTINEL: Securing Indoor Localization against Adversarial Attacks with Capsule Neural NetworksSubjects: Signal Processing (eess.SP); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
With the increasing demand for edge device powered location-based services in indoor environments, Wi-Fi received signal strength (RSS) fingerprinting has become popular, given the unavailability of GPS indoors. However, achieving robust and efficient indoor localization faces several challenges, due to RSS fluctuations from dynamic changes in indoor environments and heterogeneity of edge devices, leading to diminished localization accuracy. While advances in machine learning (ML) have shown promise in mitigating these phenomena, it remains an open problem. Additionally, emerging threats from adversarial attacks on ML-enhanced indoor localization systems, especially those introduced by malicious or rogue access points (APs), can deceive ML models to further increase localization errors. To address these challenges, we present SENTINEL, a novel embedded ML framework utilizing modified capsule neural networks to bolster the resilience of indoor localization solutions against adversarial attacks, device heterogeneity, and dynamic RSS fluctuations. We also introduce RSSRogueLoc, a novel dataset capturing the effects of rogue APs from several real-world indoor environments. Experimental evaluations demonstrate that SENTINEL achieves significant improvements, with up to 3.5x reduction in mean error and 3.4x reduction in worst-case error compared to state-of-the-art frameworks using simulated adversarial attacks. SENTINEL also achieves improvements of up to 2.8x in mean error and 2.7x in worst-case error compared to state-of-the-art frameworks when evaluated with the real-world RSSRogueLoc dataset.
- [12] arXiv:2407.11308 (cross-list from cs.LG) [pdf, html, other]
-
Title: Detection of Global Anomalies on Distributed IoT Edges with Device-to-Device CommunicationComments: 6 pages, 3 figures, ACM MobiHoc AIoT 2023 (accepted)Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Anomaly detection is an important function in IoT applications for finding outliers caused by abnormal events. Anomaly detection sometimes comes with high-frequency data sampling which should be carried out at Edge devices rather than Cloud. In this paper, we consider the case that multiple IoT devices are installed in a single remote site and that they collaboratively detect anomalies from the observations with device-to-device communications. For this, we propose a fully distributed collaborative scheme for training distributed anomaly detectors with Wireless Ad Hoc Federated Learning, namely "WAFL-Autoencoder". We introduce the concept of Global Anomaly which sample is not only rare to the local device but rare to all the devices in the target domain. We also propose a distributed threshold-finding algorithm for Global Anomaly detection. With our standard benchmark-based evaluation, we have confirmed that our scheme trained anomaly detectors perfectly across the devices. We have also confirmed that the devices collaboratively found thresholds for Global Anomaly detection with low false positive rates while achieving high true positive rates with few exceptions.
- [13] arXiv:2407.11454 (cross-list from quant-ph) [pdf, html, other]
-
Title: Cloud-based Semi-Quantum MoneySubjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
In the 1970s, Wiesner introduced the concept of quantum money, where quantum states generated according to specific rules function as currency. These states circulate among users with quantum resources through quantum channels or face-to-face interactions. Quantum mechanics grants quantum money physical-level unforgeability but also makes minting, storing, and circulating it significantly challenging. Currently, quantum computers capable of minting and preserving quantum money have not yet emerged, and existing quantum channels are not stable enough to support the efficient transmission of quantum states for quantum money, limiting its practicality. Semi-quantum money schemes support fully classical transactions and complete classical banks, reducing dependence on quantum resources and enhancing feasibility. To further minimize the system's reliance on quantum resources, we propose a cloud-based semi-quantum money (CSQM) scheme. This scheme relies only on semi-honest third-party quantum clouds, while the rest of the system remains entirely classical. We also discuss estimating the computational power required by the quantum cloud for the scheme and conduct a security analysis.
- [14] arXiv:2407.11483 (cross-list from cs.NI) [pdf, html, other]
-
Title: Performance Analysis of Internet of Vehicles Mesh Networks Based on Actual Switch ModelsSubjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)
The rapid growth of the automotive industry has exacerbated the conflict between the complex traffic environment, increasing communication demands, and limited resources. Given the imperative to mitigate traffic and network congestion, analyzing the performance of Internet of Vehicles (IoV) mesh networks is of great practical significance. Most studies focus solely on individual performance metrics and influencing factors, and the adopted simulation tools, such as OPNET, cannot achieve the dynamic link generation of IoV mesh networks. To address these problems, a network performance analysis model based on actual switches is proposed. First, a typical IoV mesh network architecture is constructed and abstracted into a mathematical model that describes how the link and topology changes over time. Then, the task generation model and the task forwarding model based on actual switches are proposed to obtain the real traffic distribution of the network. Finally, a scientific network performance indicator system is constructed. Simulation results demonstrate that, with rising task traffic and decreasing node caching capacity, the packet loss rate increases, and the task arrival rate decreases in the network. The proposed model can effectively evaluate the network performance across various traffic states and provide valuable insights for network construction and enhancement.
- [15] arXiv:2407.11531 (cross-list from eess.SY) [pdf, html, other]
-
Title: Finite State Machines-Based Path-Following Collaborative Computing Strategy for Emergency UAV SwarmsSubjects: Systems and Control (eess.SY); Distributed, Parallel, and Cluster Computing (cs.DC)
Offloading services to UAV swarms for delay-sensitive tasks in Emergency UAV Networks (EUN) can greatly enhance rescue efficiency. Most task-offloading strategies assumed that UAVs were location-fixed and capable of handling all tasks. However, in complex disaster environments, UAV locations often change dynamically, and the heterogeneity of on-board resources presents a significant challenge in optimizing task scheduling in EUN to minimize latency. To address these problems, a Finite state machines-based Path-following Collaborative computation strategy (FPC) for emergency UAV swarms is proposed. First, an Extended Finite State Machine Space-time Graph (EFSMSG) model is constructed to accurately characterize on-board resources and state transitions while shielding the EUN dynamic characteristic. Based on the EFSMSG, a mathematical model is formulated for the FPC strategy to minimize task processing delay while facilitating computation during transmission. Finally, the Constraint Selection Adaptive Binary Particle Swarm Optimization (CSABPSO) algorithm is proposed for the solution. Simulation results demonstrate that the proposed FPC strategy effectively reduces task processing delay, meeting the requirements of delay-sensitive tasks in emergency situations.
- [16] arXiv:2407.11742 (cross-list from physics.med-ph) [pdf, html, other]
-
Title: Revolutionizing MRI Data Processing Using FSL: Preliminary Findings with the Fugaku SupercomputerTianxiang Lyu, Wataru Uchida, Zhe Sun, Christina Andica, Keita Tokuda, Rui Zou, Jie Mao, Keigo Shimoji, Koji Kamagata, Mitsuhisa Sato, Ryutaro Himeno, Shigeki AokiSubjects: Medical Physics (physics.med-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Quantitative Methods (q-bio.QM)
The amount of Magnetic resonance imaging data has grown tremendously recently, creating an urgent need to accelerate data processing, which requires substantial computational resources and time. In this preliminary study, we applied FMRIB Software Library commands on T1-weighted and diffusion-weighted images of a single young adult using the Fugaku supercomputer. The tensor-based measurements and subcortical structure segmentations performed on Fugaku supercomputer were highly consistent with those from conventional systems, demonstrating its reliability and significantly reduced processing time.
- [17] arXiv:2407.11762 (cross-list from cs.LG) [pdf, other]
-
Title: Self-Duplicating Random Walks for Resilient Decentralized Learning on GraphsSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Applications (stat.AP)
Consider the setting of multiple random walks (RWs) on a graph executing a certain computational task. For instance, in decentralized learning via RWs, a model is updated at each iteration based on the local data of the visited node and then passed to a randomly chosen neighbor. RWs can fail due to node or link failures. The goal is to maintain a desired number of RWs to ensure failure resilience. Achieving this is challenging due to the lack of a central entity to track which RWs have failed to replace them with new ones by forking (duplicating) surviving ones. Without duplications, the number of RWs will eventually go to zero, causing a catastrophic failure of the system. We propose a decentralized algorithm called DECAFORK that can maintain the number of RWs in the graph around a desired value even in the presence of arbitrary RW failures. Nodes continuously estimate the number of surviving RWs by estimating their return time distribution and fork the RWs when failures are likely to happen. We present extensive numerical simulations that show the performance of DECAFORK regarding fast detection and reaction to failures. We further present theoretical guarantees on the performance of this algorithm.
- [18] arXiv:2407.11763 (cross-list from cs.LG) [pdf, html, other]
-
Title: Enhancing Split Computing and Early Exit Applications through Predefined SparsityLuigi Capogrosso, Enrico Fraccaroli, Giulio Petrozziello, Francesco Setti, Samarjit Chakraborty, Franco Fummi, Marco CristaniComments: Accepted at the 27th Forum on specification and Design Languages (FDL 2024)Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
In the past decade, Deep Neural Networks (DNNs) achieved state-of-the-art performance in a broad range of problems, spanning from object classification and action recognition to smart building and healthcare. The flexibility that makes DNNs such a pervasive technology comes at a price: the computational requirements preclude their deployment on most of the resource-constrained edge devices available today to solve real-time and real-world tasks. This paper introduces a novel approach to address this challenge by combining the concept of predefined sparsity with Split Computing (SC) and Early Exit (EE). In particular, SC aims at splitting a DNN with a part of it deployed on an edge device and the rest on a remote server. Instead, EE allows the system to stop using the remote server and rely solely on the edge device's computation if the answer is already good enough. Specifically, how to apply such a predefined sparsity to a SC and EE paradigm has never been studied. This paper studies this problem and shows how predefined sparsity significantly reduces the computational, storage, and energy burdens during the training and inference phases, regardless of the hardware platform. This makes it a valuable approach for enhancing the performance of SC and EE applications. Experimental results showcase reductions exceeding 4x in storage and computational complexity without compromising performance. The source code is available at this https URL.
- [19] arXiv:2407.11798 (cross-list from cs.CL) [pdf, html, other]
-
Title: PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined SpeculationComments: 11 pages, submitted to SC24 conferenceSubjects: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15$\times$ improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference.
- [20] arXiv:2407.11807 (cross-list from cs.IT) [pdf, html, other]
-
Title: Scalable and Reliable Over-the-Air Federated Edge LearningSubjects: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Federated edge learning (FEEL) has emerged as a core paradigm for large-scale optimization. However, FEEL still suffers from a communication bottleneck due to the transmission of high-dimensional model updates from the clients to the federator. Over-the-air computation (AirComp) leverages the additive property of multiple-access channels by aggregating the clients' updates over the channel to save communication resources. While analog uncoded transmission can benefit from the increased signal-to-noise ratio (SNR) due to the simultaneous transmission of many clients, potential errors may severely harm the learning process for small SNRs. To alleviate this problem, channel coding approaches were recently proposed for AirComp in FEEL. However, their error-correction capability degrades with an increasing number of clients. We propose a digital lattice-based code construction with constant error-correction capabilities in the number of clients, and compare to nested-lattice codes, well-known for their optimal rate and power efficiency in the point-to-point AWGN channel.
Cross submissions for Wednesday, 17 July 2024 (showing 12 of 12 entries )
- [21] arXiv:2306.08601 (replaced) [pdf, html, other]
-
Title: Capturing Periodic I/O Using Frequency TechniquesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Many HPC applications perform their I/O in bursts that follow a periodic pattern. This allows for making predictions as to when a burst occurs. System providers can take advantage of such knowledge to reduce file-system contention by actively scheduling I/O bandwidth. The effectiveness of this approach, however, depends on the ability to detect and quantify the periodicity of I/O patterns online. In this paper, we introduce FTIO, an online method to detect periodic I/O phases, which is based on discrete Fourier transform (DFT), combined with outlier detection. We provide metrics that gauge the confidence in the output and tell how far from being periodic the signal is. We validate our approach with large-scale experiments on a production system and examine its limitations extensively. Our experiments show that FTIO has a mean error below 11%. Finally, we demonstrate that FTIO allowed the I/O scheduler Set- 10 to boost system utilization by 26% and reduce I/O slowdown by 56%.
- [22] arXiv:2307.05740 (replaced) [pdf, html, other]
-
Title: Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkComments: 15 pages, 7 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS); Performance (cs.PF); Programming Languages (cs.PL)
Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse tensor with a network of several dense matrices or tensors (SpTTN). Prior works on high-performance tensor decomposition and completion have focused on performance and scalability optimizations for specific SpTTN kernels. We present algorithms and a runtime system for identifying and executing the most efficient loop nest for any SpTTN kernel. We consider both enumeration of such loop nests for autotuning and efficient algorithms for finding the lowest cost loop-nest for simpler metrics, such as buffer size or cache miss models. Our runtime system identifies the best choice of loop nest without user guidance, and also provides a distributed-memory parallelization of SpTTN kernels. We evaluate our framework using both real-world and synthetic tensors. Our results demonstrate that our approach outperforms available generalized state-of-the-art libraries and matches the performance of specialized codes.
- [23] arXiv:2407.08584 (replaced) [pdf, html, other]
-
Title: Data-Locality-Aware Task Assignment and Scheduling for Distributed Job ExecutionsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
This paper investigates a data-locality-aware task assignment and scheduling problem aimed at minimizing job completion times for distributed job executions. Without prior knowledge of future job arrivals, we propose an optimal balanced task assignment algorithm (OBTA) that minimizes the completion time of each arriving job. We significantly reduce OBTA's computational overhead by narrowing the search space of potential solutions. Additionally, we extend an approximate algorithm known as water-filling (WF) and nontrivially prove that its approximation factor equals the number of task groups in the job assignment. We also design a novel heuristic, replica-deletion (RD), which outperforms WF. To further reduce the completion time of each job, we expand the problem to include job reordering, where we adjust the order of outstanding jobs following the shortest-estimated-time-first policy. Extensive trace-driven evaluations validate the performance and efficiency of the proposed algorithms.
- [24] arXiv:2312.13795 (replaced) [pdf, html, other]
-
Title: Sparse Training for Federated Learning with Regularized Error CorrectionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated Learning (FL) has attracted much interest due to the significant advantages it brings to training deep neural network (DNN) models. However, since communications and computation resources are limited, training DNN models in FL systems face challenges such as elevated computational and communication costs in complex tasks. Sparse training schemes gain increasing attention in order to scale down the dimensionality of each client (i.e., node) transmission. Specifically, sparsification with error correction methods is a promising technique, where only important updates are sent to the parameter server (PS) and the rest are accumulated locally. While error correction methods have shown to achieve a significant sparsification level of the client-to-PS message without harming convergence, pushing sparsity further remains unresolved due to the staleness effect. In this paper, we propose a novel algorithm, dubbed Federated Learning with Accumulated Regularized Embeddings (FLARE), to overcome this challenge. FLARE presents a novel sparse training approach via accumulated pulling of the updated models with regularization on the embeddings in the FL process, providing a powerful solution to the staleness effect, and pushing sparsity to an exceptional level. The performance of FLARE is validated through extensive experiments on diverse and complex models, achieving a remarkable sparsity level (10 times and more beyond the current state-of-the-art) along with significantly improved accuracy. Additionally, an open-source software package has been developed for the benefit of researchers and developers in related fields.
- [25] arXiv:2403.02694 (replaced) [pdf, html, other]
-
Title: MeanCache: User-Centric Semantic Cache for Large Language Model Based Web ServicesWaris Gill (1), Mohamed Elidrisi (2), Pallavi Kalapatapu (2), Ammar Ahmed (3), Ali Anwar (3), Muhammad Ali Gulzar (1) ((1) Virginia Tech, USA, (2) Cisco, USA (3) University of Minnesota, Minneapolis, USA)Comments: This study presents the first privacy aware semantic cache for LLMs based on Federated Learning. MeanCache is the first cache that can handle contextual queries efficiently. Total pages 14Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters, where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries, which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries nor do they operate on contextual queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model without violating user privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache also encodes context chains for every cached query, offering a simple yet highly effective mechanism to discern contextual query responses from standalone. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions while performing even better on contextual queries. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%.