-
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Authors:
Feiyang Kang,
Hoang Anh Just,
Yifan Sun,
Himanshu Jahagirdar,
Yuanzhi Zhang,
Rongxing Du,
Anit Kumar Sahu,
Ruoxi Jia
Abstract:
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerg…
▽ More
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Towards Realistic Mechanisms That Incentivize Federated Participation and Contribution
Authors:
Marco Bornstein,
Amrit Singh Bedi,
Anit Kumar Sahu,
Furqan Khan,
Furong Huang
Abstract:
Edge device participation in federating learning (FL) is typically studied through the lens of device-server communication (e.g., device dropout) and assumes an undying desire from edge devices to participate in FL. As a result, current FL frameworks are flawed when implemented in realistic settings, with many encountering the free-rider dilemma. In a step to push FL towards realistic settings, we…
▽ More
Edge device participation in federating learning (FL) is typically studied through the lens of device-server communication (e.g., device dropout) and assumes an undying desire from edge devices to participate in FL. As a result, current FL frameworks are flawed when implemented in realistic settings, with many encountering the free-rider dilemma. In a step to push FL towards realistic settings, we propose RealFM: the first federated mechanism that (1) realistically models device utility, (2) incentivizes data contribution and device participation, (3) provably removes the free-rider dilemma, and (4) relaxes assumptions on data homogeneity and data sharing. Compared to previous FL mechanisms, RealFM allows for a non-linear relationship between model accuracy and utility, which improves the utility gained by the server and participating devices. On real-world data, RealFM improves device and server utility, as well as data contribution, by over 3 and 4 magnitudes respectively compared to baselines.
△ Less
Submitted 22 May, 2024; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Federated Representation Learning for Automatic Speech Recognition
Authors:
Guruprasad V Ramesh,
Gopinath Chennupati,
Milind Rao,
Anit Kumar Sahu,
Ariya Rastrow,
Jasha Droppo
Abstract:
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respec…
▽ More
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new language, French, and show a 20% (WER) improvement over no pre-training.
△ Less
Submitted 7 August, 2023; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Authors:
Feiyang Kang,
Hoang Anh Just,
Anit Kumar Sahu,
Ruoxi Jia
Abstract:
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size a…
▽ More
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called <projektor>, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that <projektor> significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, <projektor> outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Federated Self-Learning with Weak Supervision for Speech Recognition
Authors:
Milind Rao,
Gopinath Chennupati,
Gautam Tiwari,
Anit Kumar Sahu,
Anirudh Raju,
Ariya Rastrow,
Jasha Droppo
Abstract:
Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy. We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine t…
▽ More
Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy. We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine transcriptions from a stronger ASR model. In particular, we study the performance of a self-learning based scheme, with a paired teacher model updated through an exponential moving average of ASR models. Further, we propose using possibly noisy weak-supervision signals such as feedback scores and natural language understanding semantics determined from user behavior across multiple turns in a session of interactions with the conversational agent. These signals are leveraged in a multi-task policy-gradient training approach to improve the performance of self-learning for ASR. Finally, we show how catastrophic forgetting can be mitigated by combining on-device learning with a memory-replay approach using selected historical datasets. These innovations allow for 10% relative improvement in WER on new use cases with minimal degradation on other test sets in the absence of strong-supervision signals such as ground-truth transcriptions.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Learning When to Trust Which Teacher for Weakly Supervised ASR
Authors:
Aakriti Agrawal,
Milind Rao,
Anit Kumar Sahu,
Gopinath Chennupati,
Andreas Stolcke
Abstract:
Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by t…
▽ More
Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale
Authors:
Gopinath Chennupati,
Milind Rao,
Gurpreet Chadha,
Aaron Eakin,
Anirudh Raju,
Gautam Tiwari,
Anit Kumar Sahu,
Ariya Rastrow,
Jasha Droppo,
Andy Oberlin,
Buddha Nandanoor,
Prahalad Venkataramanan,
Zheng Wu,
Pankaj Sitpure
Abstract:
Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems…
▽ More
Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR). By privacy preserving, we mean, usage of ephemeral data which are not human annotated. This system is a step forward for production levelASR models for incremental/continual learning that offers near real-time test-bed for experimentation in the cloud for end-to-end ASR, while adhering to privacy-preserving policies. We show that the proposed system can improve the production models significantly(3%) over a new time period of six months even in the absence of human annotated labels with varying levels of weak supervision and large batch sizes in incremental learning. This improvement is 20% over test sets with new words and phrases in the new time period. We demonstrate the effectiveness of model building in a privacy-preserving incremental fashion for ASR while further exploring the utility of having an effective teacher model and use of large batch sizes.
△ Less
Submitted 22 July, 2022; v1 submitted 19 July, 2022;
originally announced July 2022.
-
FedBC: Calibrating Global and Local Models via Federated Learning Beyond Consensus
Authors:
Amrit Singh Bedi,
Chen Fan,
Alec Koppel,
Anit Kumar Sahu,
Brian M. Sadler,
Furong Huang,
Dinesh Manocha
Abstract:
In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considerin…
▽ More
In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop a novel primal-dual method called Federated Learning Beyond Consensus (\texttt{FedBC}). Theoretically, we establish that \texttt{FedBC} converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on a tolerance parameter introduced to scalarize the multi-criterion formulation. Finally, we demonstrate that \texttt{FedBC} balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with state-of-the-art.
△ Less
Submitted 1 February, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Self-Aware Personalized Federated Learning
Authors:
Huili Chen,
Jie Ding,
Eric Tramel,
Shuang Wu,
Anit Kumar Sahu,
Salman Avestimehr,
Tao Zhang
Abstract:
In the context of personalized federated learning (FL), the critical challenge is to balance local model improvement and global model tuning when the personal and global objectives may not be exactly aligned. Inspired by Bayesian hierarchical models, we develop a self-aware personalized FL method where each client can automatically balance the training of its local personal model and the global mo…
▽ More
In the context of personalized federated learning (FL), the critical challenge is to balance local model improvement and global model tuning when the personal and global objectives may not be exactly aligned. Inspired by Bayesian hierarchical models, we develop a self-aware personalized FL method where each client can automatically balance the training of its local personal model and the global model that implicitly contributes to other clients' training. Such a balance is derived from the inter-client and intra-client uncertainty quantification. A larger inter-client variation implies more personalization is needed. Correspondingly, our method uses uncertainty-driven local training steps and aggregation rule instead of conventional local fine-tuning and sample size-based aggregation. With experimental studies on synthetic data, Amazon Alexa audio data, and public datasets such as MNIST, FEMNIST, CIFAR10, and Sent140, we show that our proposed method can achieve significantly improved personalization performance compared with the existing counterparts.
△ Less
Submitted 17 April, 2022;
originally announced April 2022.
-
Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise
Authors:
Dusan Jakovetic,
Dragana Bajovic,
Anit Kumar Sahu,
Soummya Kar,
Nemanja Milosevic,
Dusan Stamenkovic
Abstract:
We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assum…
▽ More
We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assuming a strongly convex cost function with Lipschitz continuous gradients under very general assumptions on the gradient noise. Most notably, we show that, for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD's mean squared error (MSE), or equivalently, the expected cost function's optimality gap, converges to zero at rate~$O(1/t^ζ)$, $ζ\in (0,1)$. In contrast, for the same noise setting, the linear SGD generates a sequence with unbounded variances. Furthermore, for the nonlinearities that can be decoupled component wise, like, e.g., sign gradient or component-wise clipping, we show that the nonlinear SGD asymptotically (locally) achieves a $O(1/t)$ rate in the weak convergence sense and explicitly quantify the corresponding asymptotic variance. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets with heavy tail noises.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Federated Learning Challenges and Opportunities: An Outlook
Authors:
Jie Ding,
Eric Tramel,
Anit Kumar Sahu,
Shuang Wu,
Salman Avestimehr,
Tao Zhang
Abstract:
Federated learning (FL) has been developed as a promising framework to leverage the resources of edge devices, enhance customers' privacy, comply with regulations, and reduce development costs. Although many methods and applications have been developed for FL, several critical challenges for practical FL systems remain unaddressed. This paper provides an outlook on FL development, categorized into…
▽ More
Federated learning (FL) has been developed as a promising framework to leverage the resources of edge devices, enhance customers' privacy, comply with regulations, and reduce development costs. Although many methods and applications have been developed for FL, several critical challenges for practical FL systems remain unaddressed. This paper provides an outlook on FL development, categorized into five emerging directions of FL, namely algorithm foundation, personalization, hardware and security constraints, lifelong learning, and nonstandard data. Our unique perspectives are backed by practical observations from large-scale federated systems for edge devices.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits
Authors:
Sunwoo Lee,
Anit Kumar Sahu,
Chaoyang He,
Salman Avestimehr
Abstract:
Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discrepancy across the workers making the global loss c…
▽ More
Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discrepancy across the workers making the global loss converge slowly. While recent advanced optimization methods tackle the issue focused on non-IID settings, there still exists the model discrepancy issue due to the underlying periodic model averaging. We propose a partial model averaging framework that mitigates the model discrepancy issue in Federated Learning. The partial averaging encourages the local models to stay close to each other on parameter space, and it enables to more effectively minimize the global loss. Given a fixed number of iterations and a large number of workers (128), the partial averaging achieves up to 2.2% higher validation accuracy than the periodic full averaging.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
You Only Query Once: Effective Black Box Adversarial Attacks with Minimal Repeated Queries
Authors:
Devin Willmott,
Anit Kumar Sahu,
Fatemeh Sheikholeslami,
Filipe Condessa,
Zico Kolter
Abstract:
Researchers have repeatedly shown that it is possible to craft adversarial attacks on deep classifiers (small perturbations that significantly change the class label), even in the "black-box" setting where one only has query access to the classifier. However, all prior work in the black-box setting attacks the classifier by repeatedly querying the same image with minor modifications, usually thous…
▽ More
Researchers have repeatedly shown that it is possible to craft adversarial attacks on deep classifiers (small perturbations that significantly change the class label), even in the "black-box" setting where one only has query access to the classifier. However, all prior work in the black-box setting attacks the classifier by repeatedly querying the same image with minor modifications, usually thousands of times or more, making it easy for defenders to detect an ensuing attack. In this work, we instead show that it is possible to craft (universal) adversarial perturbations in the black-box setting by querying a sequence of different images only once. This attack prevents detection from high number of similar queries and produces a perturbation that causes misclassification when applied to any input to the classifier. In experiments, we show that attacks that adhere to this restriction can produce untargeted adversarial perturbations that fool the vast majority of MNIST and CIFAR-10 classifier inputs, as well as in excess of $60-70\%$ of inputs on ImageNet classifiers. In the targeted setting, we exhibit targeted black-box universal attacks on ImageNet classifiers with success rates above $20\%$ when only allowed one query per image, and $66\%$ when allowed two queries per image.
△ Less
Submitted 29 January, 2021;
originally announced February 2021.
-
Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial Attacks
Authors:
Anit Kumar Sahu,
Satya Narayan Shukla,
J. Zico Kolter
Abstract:
We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle, providing us with loss function evaluations. Although this setting has been investigated in previous work, most past approaches using zeroth order optimization implicitly assume that the gradients of the loss function with respect to the input images are \emph{unstruc…
▽ More
We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle, providing us with loss function evaluations. Although this setting has been investigated in previous work, most past approaches using zeroth order optimization implicitly assume that the gradients of the loss function with respect to the input images are \emph{unstructured}. In this work, we show that in fact substantial correlations exist within these gradients, and we propose to capture these correlations via a Gaussian Markov random field (GMRF). Given the intractability of the explicit covariance structure of the MRF, we show that the covariance structure can be efficiently represented using the Fast Fourier Transform (FFT), along with low-rank updates to perform exact posterior estimation under this model. We use this modeling technique to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method~(FGSM), and show that the method uses fewer queries and achieves higher attack success rates than the current state of the art. We also highlight the general applicability of this gradient modeling setup.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.
-
Simple and Efficient Hard Label Black-box Adversarial Attacks in Low Query Budget Regimes
Authors:
Satya Narayan Shukla,
Anit Kumar Sahu,
Devin Willmott,
J. Zico Kolter
Abstract:
We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples for deep learning models solely based on information limited to output label~(hard label) to a queried data input. We propose a simple and efficient Bayesian Optimization~(BO) based approach for developing black-box adversarial attacks. Issues with BO's performance in high dimensions are avo…
▽ More
We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples for deep learning models solely based on information limited to output label~(hard label) to a queried data input. We propose a simple and efficient Bayesian Optimization~(BO) based approach for developing black-box adversarial attacks. Issues with BO's performance in high dimensions are avoided by searching for adversarial examples in a structured low-dimensional subspace. We demonstrate the efficacy of our proposed attack method by evaluating both $\ell_\infty$ and $\ell_2$ norm constrained untargeted and targeted hard label black-box attacks on three standard datasets - MNIST, CIFAR-10 and ImageNet. Our proposed approach consistently achieves 2x to 10x higher attack success rate while requiring 10x to 20x fewer queries compared to the current state-of-the-art black-box adversarial attacks.
△ Less
Submitted 11 June, 2021; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Data-driven Thermal Model Inference with ARMAX, in Smart Environments, based on Normalized Mutual Information
Authors:
Zhanhong Jiang,
Jonathan Francis,
Anit Kumar Sahu,
Sirajum Munir,
Charles Shelton,
Anthony Rowe,
Mario Bergés
Abstract:
Understanding the models that characterize the thermal dynamics in a smart building is important for the comfort of its occupants and for its energy optimization. A significant amount of research has attempted to utilize thermodynamics (physical) models for smart building control, but these approaches remain challenging due to the stochastic nature of the intermittent environmental disturbances. T…
▽ More
Understanding the models that characterize the thermal dynamics in a smart building is important for the comfort of its occupants and for its energy optimization. A significant amount of research has attempted to utilize thermodynamics (physical) models for smart building control, but these approaches remain challenging due to the stochastic nature of the intermittent environmental disturbances. This paper presents a novel data-driven approach for indoor thermal model inference, which combines an Autoregressive Moving Average with eXogenous inputs model (ARMAX) with a Normalized Mutual Information scheme (NMI). Based on this information-theoretic method, NMI, causal dependencies between the indoor temperature and exogenous inputs are explicitly obtained as a guideline for the ARMAX model to find the dominating inputs. For validation, we use three datasets based on building energy systems-against which we compare our method to an autoregressive model with exogenous inputs (ARX), a regularized ARMAX model, and state-space models.
△ Less
Submitted 10 June, 2020;
originally announced June 2020.
-
FedDANE: A Federated Newton-Type Method
Authors:
Tian Li,
Anit Kumar Sahu,
Manzil Zaheer,
Maziar Sanjabi,
Ameet Talwalkar,
Virginia Smith
Abstract:
Federated learning aims to jointly learn statistical models over massively distributed remote devices. In this work, we propose FedDANE, an optimization method that we adapt from DANE, a method for classical distributed optimization, to handle the practical constraints of federated learning. We provide convergence guarantees for this method when learning over both convex and non-convex functions.…
▽ More
Federated learning aims to jointly learn statistical models over massively distributed remote devices. In this work, we propose FedDANE, an optimization method that we adapt from DANE, a method for classical distributed optimization, to handle the practical constraints of federated learning. We provide convergence guarantees for this method when learning over both convex and non-convex functions. Despite encouraging theoretical results, we find that the method has underwhelming performance empirically. In particular, through empirical simulations on both synthetic and real-world datasets, FedDANE consistently underperforms baselines of FedAvg and FedProx in realistic federated settings. We identify low device participation and statistical device heterogeneity as two underlying causes of this underwhelming performance, and conclude by suggesting several directions of future work.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
Black-box Adversarial Attacks with Bayesian Optimization
Authors:
Satya Narayan Shukla,
Anit Kumar Sahu,
Devin Willmott,
J. Zico Kolter
Abstract:
We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples using information limited to loss function evaluations of input-output pairs. We use Bayesian optimization~(BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks. We alleviate the issues surrounding BO in regards to optimizing high…
▽ More
We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples using information limited to loss function evaluations of input-output pairs. We use Bayesian optimization~(BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks. We alleviate the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques. Our proposed approach achieves performance comparable to the state of the art black-box adversarial attacks albeit with a much lower average query count. In particular, in low query budget regimes, our proposed method reduces the query count up to $80\%$ with respect to the state of the art methods.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Noisy Batch Active Learning with Deterministic Annealing
Authors:
Gaurav Gupta,
Anit Kumar Sahu,
Wan-Yi Lin
Abstract:
We study the problem of training machine learning models incrementally with batches of samples annotated with noisy oracles. We select each batch of samples that are important and also diverse via clustering and importance sampling. More importantly, we incorporate model uncertainty into the sampling probability to compensate for poor estimation of the importance scores when the training data is t…
▽ More
We study the problem of training machine learning models incrementally with batches of samples annotated with noisy oracles. We select each batch of samples that are important and also diverse via clustering and importance sampling. More importantly, we incorporate model uncertainty into the sampling probability to compensate for poor estimation of the importance scores when the training data is too small to build a meaningful model. Experiments on benchmark image classification datasets (MNIST, SVHN, CIFAR10, and EMNIST) show improvement over existing active learning strategies. We introduce an extra denoising layer to deep networks to make active learning robust to label noises and show significant improvements.
△ Less
Submitted 28 October, 2020; v1 submitted 26 September, 2019;
originally announced September 2019.
-
Federated Learning: Challenges, Methods, and Future Directions
Authors:
Tian Li,
Anit Kumar Sahu,
Ameet Talwalkar,
Virginia Smith
Abstract:
Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving da…
▽ More
Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
△ Less
Submitted 21 August, 2019;
originally announced August 2019.
-
MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling
Authors:
Jianyu Wang,
Anit Kumar Sahu,
Zhouyi Yang,
Gauri Joshi,
Soummya Kar
Abstract:
This paper studies the problem of error-runtime trade-off, typically encountered in decentralized training based on stochastic gradient descent (SGD) using a given network. While a denser (sparser) network topology results in faster (slower) error convergence in terms of iterations, it incurs more (less) communication time/delay per iteration. In this paper, we propose MATCHA, an algorithm that ca…
▽ More
This paper studies the problem of error-runtime trade-off, typically encountered in decentralized training based on stochastic gradient descent (SGD) using a given network. While a denser (sparser) network topology results in faster (slower) error convergence in terms of iterations, it incurs more (less) communication time/delay per iteration. In this paper, we propose MATCHA, an algorithm that can achieve a win-win in this error-runtime trade-off for any arbitrary network topology. The main idea of MATCHA is to parallelize inter-node communication by decomposing the topology into matchings. To preserve fast error convergence speed, it identifies and communicates more frequently over critical links, and saves communication time by using other links less frequently. Experiments on a suite of datasets and deep neural networks validate the theoretical analyses and demonstrate that MATCHA takes up to $5\times$ less time than vanilla decentralized SGD to reach the same training loss.
△ Less
Submitted 18 November, 2019; v1 submitted 22 May, 2019;
originally announced May 2019.
-
Distributed stochastic optimization with gradient tracking over strongly-connected networks
Authors:
Ran Xin,
Anit Kumar Sahu,
Usman A. Khan,
Soummya Kar
Abstract:
In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle ($\mathcal{SFO}$), we propose a novel distributed method, called $\mathcal{S}$-$\mathcal{AB}$, where each agent uses an auxilia…
▽ More
In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle ($\mathcal{SFO}$), we propose a novel distributed method, called $\mathcal{S}$-$\mathcal{AB}$, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The $\mathcal{S}$-$\mathcal{AB}$ algorithm employs row- and column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, $\mathcal{S}$-$\mathcal{AB}$ is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, $\mathcal{S}$-$\mathcal{AB}$ converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.
△ Less
Submitted 9 April, 2019; v1 submitted 18 March, 2019;
originally announced March 2019.
-
Federated Optimization in Heterogeneous Networks
Authors:
Tian Li,
Anit Kumar Sahu,
Manzil Zaheer,
Maziar Sanjabi,
Ameet Talwalkar,
Virginia Smith
Abstract:
Federated Learning is a distributed learning paradigm with two key challenges that differentiate it from traditional distributed optimization: (1) significant variability in terms of the systems characteristics on each device in the network (systems heterogeneity), and (2) non-identically distributed data across the network (statistical heterogeneity). In this work, we introduce a framework, FedPr…
▽ More
Federated Learning is a distributed learning paradigm with two key challenges that differentiate it from traditional distributed optimization: (1) significant variability in terms of the systems characteristics on each device in the network (systems heterogeneity), and (2) non-identically distributed data across the network (statistical heterogeneity). In this work, we introduce a framework, FedProx, to tackle heterogeneity in federated networks. FedProx can be viewed as a generalization and re-parametrization of FedAvg, the current state-of-the-art method for federated learning. While this re-parameterization makes only minor modifications to the method itself, these modifications have important ramifications both in theory and in practice. Theoretically, we provide convergence guarantees for our framework when learning over data from non-identical distributions (statistical heterogeneity), and while adhering to device-level systems constraints by allowing each participating device to perform a variable amount of work (systems heterogeneity). Practically, we demonstrate that FedProx allows for more robust convergence than FedAvg across a suite of realistic federated datasets. In particular, in highly heterogeneous settings, FedProx demonstrates significantly more stable and accurate convergence behavior relative to FedAvg---improving absolute test accuracy by 22% on average.
△ Less
Submitted 21 April, 2020; v1 submitted 14 December, 2018;
originally announced December 2018.
-
Managing App Install Ad Campaigns in RTB: A Q-Learning Approach
Authors:
Anit Kumar Sahu,
Shaunak Mishra,
Narayan Bhamidipati
Abstract:
Real time bidding (RTB) enables demand side platforms (bidders) to scale ad campaigns across multiple publishers affiliated to an RTB ad exchange. While driving multiple campaigns for mobile app install ads via RTB, the bidder typically has to: (i) maintain each campaign's efficiency (i.e., meet advertiser's target cost-per-install), (ii) be sensitive to advertiser's budget, and (iii) make profit…
▽ More
Real time bidding (RTB) enables demand side platforms (bidders) to scale ad campaigns across multiple publishers affiliated to an RTB ad exchange. While driving multiple campaigns for mobile app install ads via RTB, the bidder typically has to: (i) maintain each campaign's efficiency (i.e., meet advertiser's target cost-per-install), (ii) be sensitive to advertiser's budget, and (iii) make profit after payouts to the ad exchange. In this process, there is a sense of delayed rewards for the bidder's actions; the exchange charges the bidder right after the ad is shown, but the bidder gets to know about resultant installs after considerable delay. This makes it challenging for the bidder to decide beforehand the bid (and corresponding cost charged to advertiser) for each ad display opportunity. To jointly handle the objectives mentioned above, we propose a state space based policy which decides the exchange bid and advertiser cost for each opportunity. The state space captures the current efficiency, budget utilization and profit. The policy based on this state space is trained on past decisions and outcomes via a novel Q-learning algorithm which accounts for the delay in install notifications. In our experiments based on data from app install campaigns managed by Yahoo's Gemini advertising platform, the Q-learning based policy led to a significant increase in the profit and number of efficient campaigns.
△ Less
Submitted 11 November, 2018;
originally announced November 2018.
-
Towards Gradient Free and Projection Free Stochastic Optimization
Authors:
Anit Kumar Sahu,
Manzil Zaheer,
Soummya Kar
Abstract:
This paper focuses on the problem of \emph{constrained} \emph{stochastic} optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate…
▽ More
This paper focuses on the problem of \emph{constrained} \emph{stochastic} optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate $O\left(1/T^{1/3}\right)$, where $T$ denotes the iteration count. In particular, the primal sub-optimality gap is shown to have a dimension dependence of $O\left(d^{1/3}\right)$, which is the best known dimension dependence among all zeroth order optimization algorithms with one directional derivative per iteration. For non-convex functions, we obtain the \emph{Frank-Wolfe} gap to be $O\left(d^{1/3}T^{-1/4}\right)$. Experiments on black-box optimization setups demonstrate the efficacy of the proposed algorithm.
△ Less
Submitted 18 February, 2019; v1 submitted 7 October, 2018;
originally announced October 2018.
-
$\mathcal{CIRFE}$: A Distributed Random Fields Estimator
Authors:
Anit Kumar Sahu,
Dusan Jakovetic,
Soummya Kar
Abstract:
This paper presents a communication efficient distributed algorithm, $\mathcal{CIRFE}$ of the \emph{consensus}+\emph{innovations} type, to estimate a high-dimensional parameter in a multi-agent network, in which each agent is interested in reconstructing only a few components of the parameter. This problem arises for example when monitoring the high-dimensional distributed state of a large-scale i…
▽ More
This paper presents a communication efficient distributed algorithm, $\mathcal{CIRFE}$ of the \emph{consensus}+\emph{innovations} type, to estimate a high-dimensional parameter in a multi-agent network, in which each agent is interested in reconstructing only a few components of the parameter. This problem arises for example when monitoring the high-dimensional distributed state of a large-scale infrastructure with a network of limited capability sensors and where each sensor is tasked with estimating some local components of the state. At each observation sampling epoch, each agent updates its local estimate of the parameter components in its interest set by simultaneously processing the latest locally sensed information~(\emph{innovations}) and the parameter estimates from agents~(\emph{consensus}) in its communication neighborhood given by a time-varying possibly sparse graph. Under minimal conditions on the inter-agent communication network and the sensing models, almost sure convergence of the estimate sequence at each agent to the components of the true parameter in its interest set is established. Furthermore, the paper establishes the performance of $\mathcal{CIRFE}$ in terms of asymptotic covariance of the estimate sequences and specifically characterizes the dependencies of the component wise asymptotic covariance in terms of the number of agents tasked with estimating it. Finally, simulation experiments demonstrate the efficacy of $\mathcal{CIRFE}$.
△ Less
Submitted 11 June, 2018; v1 submitted 13 February, 2018;
originally announced February 2018.
-
Automatic Pill Reminder for Easy Supervision
Authors:
A. Jabeena,
Animesh Kumar Sahu,
Rohit Roy,
N. Sardar Basha
Abstract:
In this paper we present a working model of an automatic pill reminder and dispenser setup that can alleviate irregularities in taking prescribed dosage of medicines at the right time dictated by the medical practitioner and switch from approaches predominantly dependent on human memory to automation with negligible supervision, thus relieving persons from error-prone tasks of giving wrong medicin…
▽ More
In this paper we present a working model of an automatic pill reminder and dispenser setup that can alleviate irregularities in taking prescribed dosage of medicines at the right time dictated by the medical practitioner and switch from approaches predominantly dependent on human memory to automation with negligible supervision, thus relieving persons from error-prone tasks of giving wrong medicine at the wrong time in the wrong amount.
△ Less
Submitted 17 November, 2017;
originally announced November 2017.
-
Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics
Authors:
Anit Kumar Sahu,
Soummya Kar,
Jose' M. F. Moura,
H. Vincent Poor
Abstract:
This paper focuses on the problem of recursive nonlinear least squares parameter estimation in multi-agent networks, in which the individual agents observe sequentially over time an independent and identically distributed (i.i.d.) time-series consisting of a nonlinear function of the true but unknown parameter corrupted by noise. A distributed recursive estimator of the \emph{consensus} + \emph{in…
▽ More
This paper focuses on the problem of recursive nonlinear least squares parameter estimation in multi-agent networks, in which the individual agents observe sequentially over time an independent and identically distributed (i.i.d.) time-series consisting of a nonlinear function of the true but unknown parameter corrupted by noise. A distributed recursive estimator of the \emph{consensus} + \emph{innovations} type, namely $\mathcal{CIWNLS}$, is proposed, in which the agents update their parameter estimates at each observation sampling epoch in a collaborative way by simultaneously processing the latest locally sensed information~(\emph{innovations}) and the parameter estimates from other agents~(\emph{consensus}) in the local neighborhood conforming to a pre-specified inter-agent communication topology. Under rather weak conditions on the connectivity of the inter-agent communication and a \emph{global observability} criterion, it is shown that at every network agent, the proposed algorithm leads to consistent parameter estimates. Furthermore, under standard smoothness assumptions on the local observation functions, the distributed estimator is shown to yield order-optimal convergence rates, i.e., as far as the order of pathwise convergence is concerned, the local parameter estimates at each agent are as good as the optimal centralized nonlinear least squares estimator which would require access to all the observations across all the agents at all times. In order to benchmark the performance of the proposed distributed $\mathcal{CIWNLS}$ estimator with that of the centralized nonlinear least squares estimator, the asymptotic normality of the estimate sequence is established and the asymptotic covariance of the distributed estimator is evaluated. Finally, simulation results are presented which illustrate and verify the analytical findings.
△ Less
Submitted 19 October, 2016; v1 submitted 31 January, 2016;
originally announced February 2016.
-
Recursive Distributed Detection for Composite Hypothesis Testing: Nonlinear Observation Models in Additive Gaussian Noise
Authors:
Anit Kumar Sahu,
Soummya Kar
Abstract:
This paper studies recursive composite hypothesis testing in a network of sparsely connected agents. The network objective is to test a simple null hypothesis against a composite alternative concerning the state of the field, modeled as a vector of (continuous) unknown parameters determining the parametric family of probability measures induced on the agents' observation spaces under the hypothese…
▽ More
This paper studies recursive composite hypothesis testing in a network of sparsely connected agents. The network objective is to test a simple null hypothesis against a composite alternative concerning the state of the field, modeled as a vector of (continuous) unknown parameters determining the parametric family of probability measures induced on the agents' observation spaces under the hypotheses. Specifically, under the alternative hypothesis, each agent sequentially observes an independent and identically distributed time-series consisting of a (nonlinear) function of the true but unknown parameter corrupted by Gaussian noise, whereas, under the null, they obtain noise only. Two distributed recursive generalized likelihood ratio test type algorithms of the \emph{consensus+innovations} form are proposed, namely $\mathcal{CIGLRT-L}$ and $\mathcal{CIGLRT-NL}$, in which the agents estimate the underlying parameter and in parallel also update their test decision statistics by simultaneously processing the latest local sensed information and information obtained from neighboring agents. For $\mathcal{CIGLRT-NL}$, for a broad class of nonlinear observation models and under a global observability condition, algorithm parameters which ensure asymptotically decaying probabilities of errors~(probability of miss and probability of false detection) are characterized. For $\mathcal{CIGLRT-L}$, a linear observation model is considered and upper bounds on large deviations decay exponent for the error probabilities are obtained.
△ Less
Submitted 21 February, 2017; v1 submitted 18 January, 2016;
originally announced January 2016.