Skip to main content

Showing 1–50 of 59 results for author: Rawat, A S

  1. arXiv:2407.10005  [pdf, other

    cs.LG cs.AI cs.CL math.OC

    Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

    Authors: Yingcong Li, Ankit Singh Rawat, Samet Oymak

    Abstract: Recent research has shown that Transformers with linear attention are capable of in-context learning (ICL) by implementing a linear estimator through gradient descent steps. However, the existing results on the optimization landscape apply under stylized settings where task and feature vectors are assumed to be IID and the attention weights are fully parameterized. In this work, we develop a stron… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  2. arXiv:2406.17968  [pdf, other

    cs.IR cs.AI cs.LG stat.ML

    Efficient Document Ranking with Learnable Late Interactions

    Authors: Ziwei Ji, Himanshu Jain, Andreas Veit, Sashank J. Reddi, Sadeep Jayasumana, Ankit Singh Rawat, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar

    Abstract: Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been p… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  3. arXiv:2406.00060  [pdf, other

    cs.CL cs.LG

    Cascade-Aware Training of Language Models

    Authors: Congchao Wang, Sean Augenstein, Keith Rush, Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Aditya Krishna Menon, Alec Go

    Abstract: Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the… ▽ More

    Submitted 29 May, 2024; originally announced June 2024.

    Comments: 22 pages, 13 figures

  4. arXiv:2405.19261  [pdf, other

    cs.CL cs.AI cs.LG

    Faster Cascades via Speculative Decoding

    Authors: Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in p… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  5. arXiv:2404.10136  [pdf, other

    cs.CL cs.AI cs.LG

    Language Model Cascades: Token-level uncertainty and beyond

    Authors: Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning c… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  6. arXiv:2403.08081  [pdf, other

    cs.LG cs.AI cs.CL math.OC

    Mechanics of Next Token Prediction with Self-Attention

    Authors: Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet Oymak

    Abstract: Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: Accepted to AISTATS 2024

  7. arXiv:2402.13512  [pdf, other

    cs.LG cs.AI cs.CL

    From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

    Authors: M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak

    Abstract: Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: 30 pages

  8. arXiv:2310.10636  [pdf, other

    cs.LG

    Dual-Encoders for Extreme Multi-Label Classification

    Authors: Nilesh Gupta, Devvrit Khatri, Ankit S Rawat, Srinadh Bhojanapalli, Prateek Jain, Inderjit Dhillon

    Abstract: Dual-encoder (DE) models are widely used in retrieval tasks, most commonly studied on open QA benchmarks that are often characterized by multi-class and limited training data. In contrast, their performance in multi-label and data-rich retrieval settings like extreme multi-label classification (XMC), remains under-explored. Current empirical evidence indicates that DE models fall significantly sho… ▽ More

    Submitted 17 March, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: 27 pages, 8 figures

    Journal ref: ICLR 2024 camera-ready publication

  9. arXiv:2310.08461  [pdf, other

    cs.CL cs.AI cs.LG

    DistillSpec: Improving Speculative Decoding via Knowledge Distillation

    Authors: Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

    Abstract: Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, w… ▽ More

    Submitted 30 March, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

  10. arXiv:2310.05337  [pdf, other

    cs.LG cs.CV

    What do larger image classifiers memorise?

    Authors: Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the correspondi… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    MSC Class: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Machine Learning (stat.ML)

  11. arXiv:2310.02226  [pdf, other

    cs.CL cs.AI cs.LG

    Think before you speak: Training Language Models With Pause Tokens

    Authors: Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

    Abstract: Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on lan… ▽ More

    Submitted 20 April, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: Published at ICLR 2024

  12. arXiv:2307.02764  [pdf, other

    cs.LG stat.ML

    When Does Confidence-Based Cascade Deferral Suffice?

    Authors: Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sanjiv Kumar

    Abstract: Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite… ▽ More

    Submitted 23 January, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  13. arXiv:2306.03435  [pdf, other

    cs.LG cs.CL stat.ML

    On the Role of Attention in Prompt-tuning

    Authors: Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

    Abstract: Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mi… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: Published at ICML 2023

  14. arXiv:2302.01576  [pdf, other

    cs.LG cs.AI stat.ME stat.ML

    ResMem: Learn what you can and memorize the rest

    Authors: Zitong Yang, Michal Lukasik, Vaishnavh Nagarajan, Zonglin Li, Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a ne… ▽ More

    Submitted 20 October, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

  15. arXiv:2301.12245  [pdf, other

    cs.LG

    Supervision Complexity and its Role in Knowledge Distillation

    Authors: Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar

    Abstract: Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate inte… ▽ More

    Submitted 28 January, 2023; originally announced January 2023.

    Comments: Published at ICLR 2023

  16. arXiv:2301.12005  [pdf, other

    cs.LG

    EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

    Authors: Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, Sanjiv Kumar

    Abstract: Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages… ▽ More

    Submitted 3 July, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

  17. arXiv:2211.05110  [pdf, other

    cs.CL cs.AI cs.LG

    Large Language Models with Controllable Working Memory

    Authors: Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

    Abstract: Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model with an informational context to aid its performa… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

  18. arXiv:2210.06313  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

    Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

    Abstract: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP… ▽ More

    Submitted 9 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: A short version was presented at ICLR 2023. Previous title: Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

  19. arXiv:2210.02617  [pdf, other

    cs.LG

    Generalization Properties of Retrieval-based Models

    Authors: Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

    Abstract: Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

  20. arXiv:2210.02415  [pdf, other

    cs.LG cs.DS stat.ML

    A Fourier Approach to Mixture Learning

    Authors: Mingda Qiao, Guru Guruganesh, Ankit Singh Rawat, Avinava Dubey, Manzil Zaheer

    Abstract: We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(μ_j, I_d)$, the goal is to estimate the means $μ_1, μ_2, \ldots, μ_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $Δ$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghava… ▽ More

    Submitted 5 October, 2022; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: To appear at NeurIPS 2022; v2 corrected author information

  21. arXiv:2208.06825  [pdf, other

    cs.LG

    Teacher Guided Training: An Efficient Framework for Knowledge Transfer

    Authors: Manzil Zaheer, Ankit Singh Rawat, Seungyeon Kim, Chong You, Himanshu Jain, Andreas Veit, Rob Fergus, Sanjiv Kumar

    Abstract: The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we propose the teacher-guided training (TGT) framework for training a… ▽ More

    Submitted 14 August, 2022; originally announced August 2022.

  22. arXiv:2204.13208  [pdf, other

    cs.LG stat.ML

    ELM: Embedding and Logit Margins for Long-Tail Learning

    Authors: Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

    Abstract: Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural m… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: 24 pages

  23. arXiv:2201.11865  [pdf, other

    cs.LG cs.DC

    FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

    Authors: Jianyu Wang, Hang Qi, Ankit Singh Rawat, Sashank Reddi, Sagar Waghmare, Felix X. Yu, Gauri Joshi

    Abstract: In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solut… ▽ More

    Submitted 16 February, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

  24. arXiv:2110.10305  [pdf, other

    cs.LG

    When in Doubt, Summon the Titans: Efficient Inference with Large Models

    Authors: Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar

    Abstract: Scaling neural networks to "large" sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on distillation that realizes the modelling benefits of the large models, while la… ▽ More

    Submitted 19 October, 2021; originally announced October 2021.

  25. arXiv:2105.05736  [pdf, other

    cs.LG stat.ML

    Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

    Authors: Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar

    Abstract: Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off pe… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

    Comments: To appear in ICML 2021

  26. arXiv:2102.06849  [pdf, other

    cs.LG cs.AI stat.ML

    Distilling Double Descent

    Authors: Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou

    Abstract: Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that,… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  27. arXiv:2102.03349  [pdf, other

    cs.LG

    On the Reproducibility of Neural Network Predictions

    Authors: Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, Sanjiv Kumar

    Abstract: Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause {\em churn} -- for the same input, disagreements between predictions of the two models independently trained by the same algorithm, con… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

    Comments: 19 pages, 7 figures

  28. arXiv:2012.00363  [pdf, other

    cs.CL cs.LG

    Modifying Memories in Transformer Models

    Authors: Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar

    Abstract: Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transfor… ▽ More

    Submitted 1 December, 2020; originally announced December 2020.

  29. arXiv:2007.07314  [pdf, other

    cs.LG stat.ML

    Long-tail learning via logit adjustment

    Authors: Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv Kumar

    Abstract: Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these chall… ▽ More

    Submitted 9 July, 2021; v1 submitted 14 July, 2020; originally announced July 2020.

    Comments: Published as a conference paper in ICLR 2021

  30. arXiv:2007.06555  [pdf, other

    cs.LG cs.DS stat.ML

    Adversarial robustness via robust low rank representations

    Authors: Pranjal Awasthi, Himanshu Jain, Ankit Singh Rawat, Aravindan Vijayaraghavan

    Abstract: Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time. In this work we highlight the benefits of natural low rank representations that often exist for real data such as images, for training neural networks with certified robustness guarantees. Our first contribution is for certified robustness to perturbations measured i… ▽ More

    Submitted 1 August, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

    Comments: fixed a bug in the proof of Proposition B.2

  31. arXiv:2006.04862  [pdf, other

    cs.LG stat.ML

    $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

    Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental… ▽ More

    Submitted 19 December, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 31 pages, NeurIPS 2020 Camera-ready

  32. arXiv:2005.10419  [pdf, other

    cs.LG stat.ML

    Why distillation helps: a statistical perspective

    Authors: Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

    Abstract: Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation w… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

  33. arXiv:2004.10915  [pdf, other

    cs.LG stat.ML

    Doubly-stochastic mining for heterogeneous retrieval

    Authors: Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses are difficult to optimise even on a single example.… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

  34. arXiv:2004.10342  [pdf, ps, other

    cs.LG stat.ML

    Federated Learning with Only Positive Labels

    Authors: Felix X. Yu, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

    Abstract: We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classifier without having access to the features and the model parameters for the negative classes. Thus, naively employing conventional decentra… ▽ More

    Submitted 21 April, 2020; originally announced April 2020.

  35. arXiv:2004.05465  [pdf, other

    cs.LG stat.ML

    Robust Large-Margin Learning in Hyperbolic Space

    Authors: Melanie Weber, Manzil Zaheer, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar

    Abstract: Recently, there has been a surge of interest in representation learning in hyperbolic spaces, driven by their ability to represent hierarchical data with significantly fewer dimensions than standard Euclidean spaces. However, the viability and benefits of hyperbolic spaces for downstream machine learning tasks have received less attention. In this paper, we present, to our knowledge, the first the… ▽ More

    Submitted 1 November, 2022; v1 submitted 11 April, 2020; originally announced April 2020.

    Comments: Revision corrects error in section 3.1

  36. arXiv:2002.08892  [pdf, other

    cs.DC cs.DS cs.IT cs.LG

    Reliable Distributed Clustering with Redundant Data Assignment

    Authors: Venkata Gandikota, Arya Mazumdar, Ankit Singh Rawat

    Abstract: In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations. The assignment scheme… ▽ More

    Submitted 20 February, 2020; originally announced February 2020.

  37. arXiv:2002.07028  [pdf, other

    cs.LG stat.ML

    Low-Rank Bottleneck in Multi-head Attention Models

    Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 17 pages, 4 figures

  38. arXiv:2001.09599  [pdf, other

    cs.AR

    Achieving Multi-Port Memory Performance on Single-Port Memory with Coding Techniques

    Authors: Hardik Jain, Matthew Edwards, Ethan Elenberg, Ankit Singh Rawat, Sriram Vishwanath

    Abstract: Many performance critical systems today must rely on performance enhancements, such as multi-port memories, to keep up with the increasing demand of memory-access capacity. However, the large area footprints and complexity of existing multi-port memory designs limit their applicability. This paper explores a coding theoretic framework to address this problem. In particular, this paper introduces a… ▽ More

    Submitted 27 January, 2020; originally announced January 2020.

    Comments: 10 pages, 20 figures, ICICT 2020 conference

  39. arXiv:1912.10077  [pdf, other

    cs.LG stat.ML

    Are Transformers universal approximators of sequence-to-sequence functions?

    Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

    Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using… ▽ More

    Submitted 24 February, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

    Comments: 23 pages, ICLR 2020 camera-ready version

  40. arXiv:1907.10747  [pdf, other

    cs.LG stat.ML

    Sampled Softmax with Random Fourier Features

    Authors: Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar

    Abstract: The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method. However, the sampled softmax provides a biased esti… ▽ More

    Submitted 31 December, 2019; v1 submitted 24 July, 2019; originally announced July 2019.

    Comments: In NeurIPS 2019

  41. arXiv:1807.06976  [pdf, ps, other

    cs.IT eess.SP math.ST

    The Generalized Lasso for Sub-gaussian Measurements with Dithered Quantization

    Authors: Christos Thrampoulidis, Ankit Singh Rawat

    Abstract: In the problem of structured signal recovery from high-dimensional linear observations, it is commonly assumed that full-precision measurements are available. Under this assumption, the recovery performance of the popular Generalized Lasso (G-Lasso) is by now well-established. In this paper, we extend these types of results to the practically relevant settings with quantized measurements. We study… ▽ More

    Submitted 18 July, 2018; originally announced July 2018.

  42. arXiv:1805.08327  [pdf, ps, other

    stat.ML cs.DC cs.IT cs.LG

    Robust Gradient Descent via Moment Encoding with LDPC Codes

    Authors: Raj Kumar Maity, Ankit Singh Rawat, Arya Mazumdar

    Abstract: This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of {\em straggling} processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the secon… ▽ More

    Submitted 2 January, 2019; v1 submitted 21 May, 2018; originally announced May 2018.

  43. arXiv:1803.04304  [pdf, ps, other

    stat.ML cs.IT cs.LG

    Representation Learning and Recovery in the ReLU Model

    Authors: Arya Mazumdar, Ankit Singh Rawat

    Abstract: Rectified linear units, or ReLUs, have become the preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a ReLU-network -- a neural network with ReLU activations. As a primarily theoretical study, we limit ourselves to a single-layer network. The first problem we study… ▽ More

    Submitted 12 March, 2018; originally announced March 2018.

  44. arXiv:1709.08216  [pdf, other

    cs.IT cs.CC

    MDS Code Constructions with Small Sub-packetization and Near-optimal Repair Bandwidth

    Authors: Ankit Singh Rawat, Itzhak Tamo, Venkatesan Guruswami, Klim Efremenko

    Abstract: This paper addresses the problem of constructing MDS codes that enable exact repair of each code block with small repair bandwidth, which refers to the total amount of information flow from the remaining code blocks during the repair process. This problem naturally arises in the context of distributed storage systems as the node repair problem [7]. The constructions of exact-repairable MDS codes w… ▽ More

    Submitted 24 September, 2017; originally announced September 2017.

    Comments: Significant overlap with arXiv:1608.00191

  45. arXiv:1611.09621  [pdf, other

    stat.ML cs.IT cs.LG

    Associative Memory using Dictionary Learning and Expander Decoding

    Authors: Arya Mazumdar, Ankit Singh Rawat

    Abstract: An associative memory is a framework of content-addressable memory that stores a collection of message vectors (or a dataset) over a neural network while enabling a neurally feasible mechanism to recover any message in the dataset from its noisy version. Designing an associative memory requires addressing two main tasks: 1) learning phase: given a dataset, learn a concise representation of the dat… ▽ More

    Submitted 29 November, 2016; originally announced November 2016.

    Comments: To appear in AAAI 2017

  46. arXiv:1608.01732  [pdf, ps, other

    cs.IT

    A Note on Secure Minimum Storage Regenerating Codes

    Authors: Ankit Singh Rawat

    Abstract: This short note revisits the problem of designing secure minimum storage regenerating (MSR) codes for distributed storage systems. A secure MSR code ensures that a distributed storage system does not reveal the stored information to a passive eavesdropper. The eavesdropper is assumed to have access to the content stored on $\ell_1$ number of storage nodes in the system and the data downloaded duri… ▽ More

    Submitted 4 August, 2016; originally announced August 2016.

  47. arXiv:1608.00191  [pdf, ps, other

    cs.IT cs.DS

    New MDS codes with small sub-packetization and near-optimal repair bandwidth

    Authors: Venkatesan Guruswami, Ankit Singh Rawat

    Abstract: An $(n, M)$ vector code $\mathcal{C} \subseteq \mathbb{F}^n$ is a collection of $M$ codewords where $n$ elements (from the field $\mathbb{F}$) in each of the codewords are referred to as code blocks. Assuming that $\mathbb{F} \cong \mathbb{B}^{\ell}$, the code blocks are treated as $\ell$-length vectors over the base field $\mathbb{B}$. Equivalently, the code is said to have the sub-packetization… ▽ More

    Submitted 31 July, 2016; originally announced August 2016.

  48. arXiv:1603.04822  [pdf, other

    cs.IT

    Centralized Repair of Multiple Node Failures with Applications to Communication Efficient Secret Sharing

    Authors: Ankit Singh Rawat, O. Ozan Koyluoglu, Sriram Vishwanath

    Abstract: This paper considers a distributed storage system, where multiple storage nodes can be reconstructed simultaneously at a centralized location. This centralized multi-node repair (CMR) model is a generalization of regenerating codes that allow for bandwidth-efficient repair of a single failed node. This work focuses on the trade-off between the amount of data stored and repair bandwidth in this CMR… ▽ More

    Submitted 15 March, 2016; originally announced March 2016.

  49. arXiv:1601.06362  [pdf, ps, other

    cs.IT

    Progress on High-rate MSR Codes: Enabling Arbitrary Number of Helper Nodes

    Authors: Ankit Singh Rawat, O. Ozan Koyluoglu, Sriram Vishwanath

    Abstract: This paper presents a construction for high-rate MDS codes that enable bandwidth-efficient repair of a single node. Such MDS codes are also referred to as the minimum storage regenerating (MSR) codes in the distributed storage literature. The construction presented in this paper generates MSR codes for all possible number of helper nodes $d$ as $d$ is a design parameter in the construction. Furthe… ▽ More

    Submitted 24 January, 2016; originally announced January 2016.

  50. arXiv:1410.2920  [pdf, other

    cs.IT

    Batch Codes through Dense Graphs without Short Cycles

    Authors: Alexandros G. Dimakis, Anna Gal, Ankit Singh Rawat, Zhao Song

    Abstract: Consider a large database of $n$ data items that need to be stored using $m$ servers. We study how to encode information so that a large number $k$ of read requests can be performed in parallel while the rate remains constant (and ideally approaches one). This problem is equivalent to the design of multiset Batch Codes introduced by Ishai, Kushilevitz, Ostrovsky and Sahai [17]. We give families… ▽ More

    Submitted 10 October, 2014; originally announced October 2014.