-
Text Quality-Based Pruning for Efficient Training of Language Models
Authors:
Vasu Sharma,
Karthik Padthe,
Newsha Ardalani,
Kushal Tirumala,
Russell Howes,
Hu Xu,
Po-Yao Huang,
Shang-Wen Li,
Armen Aghajanyan,
Gargi Ghosh,
Luke Zettlemoyer
Abstract:
In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score".
By proposing the text quality metric, th…
▽ More
In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score".
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training.
For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.
△ Less
Submitted 10 May, 2024; v1 submitted 26 April, 2024;
originally announced May 2024.
-
Construction of CCC and ZCCS Through Additive Characters Over Galois Field
Authors:
Gobinda Ghosh,
Sudhan Majhi,
Subhabrata Paul
Abstract:
The rapid progression in wireless communication technologies, especially in multicarrier code-division multiple access (MC-CDMA), there is a need of advanced code construction methods. Traditional approaches, mainly based on generalized Boolean functions, have limitations in code length versatility. This paper introduces a novel approach to constructing complete complementary codes (CCC) and Z-com…
▽ More
The rapid progression in wireless communication technologies, especially in multicarrier code-division multiple access (MC-CDMA), there is a need of advanced code construction methods. Traditional approaches, mainly based on generalized Boolean functions, have limitations in code length versatility. This paper introduces a novel approach to constructing complete complementary codes (CCC) and Z-complementary code sets (ZCCS), for reducing interference in MC-CDMA systems. The proposed construction, distinct from Boolean function-based approaches, employs additive characters over Galois fields GF($p^{r}$), where $p$ is prime and $r$ is a positive integer. First, we develop CCCs with lengths of $p^{r}$, which are then extended to construct ZCCS with both unreported lengths and sizes of $np^{r}$, where $n$ are arbitrary positive integers. The versatility of this method is further highlighted as it includes the lengths of ZCCS reported in prior studies as special cases, underscoring the method's comprehensive nature and superiority.
△ Less
Submitted 18 March, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Demystifying CLIP Data
Authors:
Hu Xu,
Saining Xie,
Xiaoqing Ellen Tan,
Po-Yao Huang,
Russell Howes,
Vasu Sharma,
Shang-Wen Li,
Gargi Ghosh,
Luke Zettlemoyer,
Christoph Feichtenhofer
Abstract:
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been…
▽ More
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
△ Less
Submitted 7 April, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
Authors:
Lili Yu,
Bowen Shi,
Ramakanth Pasunuru,
Benjamin Muller,
Olga Golovneva,
Tianlu Wang,
Arun Babu,
Binh Tang,
Brian Karrer,
Shelly Sheynin,
Candace Ross,
Adam Polyak,
Russell Howes,
Vasu Sharma,
Puxin Xu,
Hovhannes Tamoyan,
Oron Ashual,
Uriel Singer,
Shang-Wen Li,
Susan Zhang,
Richard James,
Gargi Ghosh,
Yaniv Taigman,
Maryam Fazel-Zarandi,
Asli Celikyilmaz
, et al. (2 additional authors not shown)
Abstract:
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted fr…
▽ More
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
LIMA: Less Is More for Alignment
Authors:
Chunting Zhou,
Pengfei Liu,
Puxin Xu,
Srini Iyer,
Jiao Sun,
Yuning Mao,
Xuezhe Ma,
Avia Efrat,
Ping Yu,
Lili Yu,
Susan Zhang,
Gargi Ghosh,
Mike Lewis,
Luke Zettlemoyer,
Omer Levy
Abstract:
Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervis…
▽ More
Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Construction of Optimal Binary Z-Complementary Code Sets with New Lengths
Authors:
Gobinda Ghosh,
Sudhan Majhi,
Shubabrata Paul
Abstract:
Z-complementary code sets (ZCCSs) are used in multicarrier code-division multiple access (MC-CDMA) systems, for interference-free communication over multiuser and quasi-asynchronous environments.
In this letter, we propose three new constructions of optimal binary $\left(R2^{k+1},2^{k+1}, Rγ,γ\right)$-ZCCS, $\left(R2^{k+1},2^{k+1}, R2^{m_{2}},2^{m_{2}}\right)$-ZCCS and…
▽ More
Z-complementary code sets (ZCCSs) are used in multicarrier code-division multiple access (MC-CDMA) systems, for interference-free communication over multiuser and quasi-asynchronous environments.
In this letter, we propose three new constructions of optimal binary $\left(R2^{k+1},2^{k+1}, Rγ,γ\right)$-ZCCS, $\left(R2^{k+1},2^{k+1}, R2^{m_{2}},2^{m_{2}}\right)$-ZCCS and $\left(2^{k+1},2^{k+1},3γ,2γ\right)$-ZCCS
based on generalized Boolean functions (GBFs), where $γ=2^{m_{1}-1}+2^{m_{1}-3}, m_{1}\geq 5, k\geq 1,m_{2}\geq 1$ and $R$ is any even number. The proposed ZCCSs cover many unreported lengths and large set sizes.
△ Less
Submitted 22 February, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
A Direct Construction of Optimal 2D-ZCACS with Flexible Array Size and Large Set Size
Authors:
Gobinda Ghosh,
Sudhan Majhi,
Shubhabrata Paul
Abstract:
In this paper, we propose a direct construction of optimal two-dimensional Z-complementary array code sets (2D-ZCACS) using multivariable functions (MVFs). In contrast to earlier works, the proposed construction allows for a flexible array size and a large set size. Additionally, the proposed design can be transformed into a one-dimensional Z-complementary code set (1D-ZCCS). Many of the 1D-ZCCS d…
▽ More
In this paper, we propose a direct construction of optimal two-dimensional Z-complementary array code sets (2D-ZCACS) using multivariable functions (MVFs). In contrast to earlier works, the proposed construction allows for a flexible array size and a large set size. Additionally, the proposed design can be transformed into a one-dimensional Z-complementary code set (1D-ZCCS). Many of the 1D-ZCCS described in the literature appeared to be special cases of this proposed construction. At last, we compare our work with the current state of the art and then draw our conclusions.
△ Less
Submitted 6 January, 2023;
originally announced January 2023.
-
CiT: Curation in Training for Effective Vision-Language Data
Authors:
Hu Xu,
Saining Xie,
Po-Yao Huang,
Licheng Yu,
Russell Howes,
Gargi Ghosh,
Luke Zettlemoyer,
Christoph Feichtenhofer
Abstract:
Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contr…
▽ More
Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
△ Less
Submitted 5 January, 2023;
originally announced January 2023.
-
ALERT: Adapting Language Models to Reasoning Tasks
Authors:
Ping Yu,
Tianlu Wang,
Olga Golovneva,
Badr AlKhamissi,
Siddharth Verma,
Zhijing Jin,
Gargi Ghosh,
Mona Diab,
Asli Celikyilmaz
Abstract:
Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart…
▽ More
Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.
△ Less
Submitted 7 July, 2023; v1 submitted 16 December, 2022;
originally announced December 2022.
-
MAViL: Masked Audio-Video Learners
Authors:
Po-Yao Huang,
Vasu Sharma,
Hu Xu,
Chaitanya Ryali,
Haoqi Fan,
Yanghao Li,
Shang-Wen Li,
Gargi Ghosh,
Jitendra Malik,
Christoph Feichtenhofer
Abstract:
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pr…
▽ More
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks.
△ Less
Submitted 17 July, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
A Direct Construction of 2D-CCC with Arbitrary Array Size and Flexible Set Size Using Multivariable Function
Authors:
Gobinda Ghosh,
Sudhan Majhi
Abstract:
Recently, two-dimensional (2D) array codes have been found to have applications in wireless communication.In this paper, we propose direct construction of 2D complete complementary codes (2D-CCCs) with arbitrary array size and flexible set size using multivariable functions (MVF). The Peak-to-mean envelope power ratio (PMEPR) properties of row and column sequences of the constructed 2D-CCC arrays…
▽ More
Recently, two-dimensional (2D) array codes have been found to have applications in wireless communication.In this paper, we propose direct construction of 2D complete complementary codes (2D-CCCs) with arbitrary array size and flexible set size using multivariable functions (MVF). The Peak-to-mean envelope power ratio (PMEPR) properties of row and column sequences of the constructed 2D-CCC arrays are investigated. The proposed construction generalizes many of the existing state-of-the-art such as Golay complementary pair (GCP), one-dimensional (1D)-CCC, 2D Golay complementary array set (2D-GCAS), and 2D-CCC with better parameters compared to the existing work.
△ Less
Submitted 28 February, 2024; v1 submitted 27 July, 2022;
originally announced July 2022.
-
CM3: A Causal Masked Multimodal Model of the Internet
Authors:
Armen Aghajanyan,
Bernie Huang,
Candace Ross,
Vladimir Karpukhin,
Hu Xu,
Naman Goyal,
Dmytro Okhonko,
Mandar Joshi,
Gargi Ghosh,
Mike Lewis,
Luke Zettlemoyer
Abstract:
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking obje…
▽ More
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Authors:
Hu Xu,
Gargi Ghosh,
Po-Yao Huang,
Dmytro Okhonko,
Armen Aghajanyan,
Florian Metze,
Luke Zettlemoyer,
Christoph Feichtenhofer
Abstract:
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including se…
▽ More
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.
△ Less
Submitted 1 October, 2021; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Direct Construction of Optimal Z-Complementary Code Sets for all Possible Even Length by Using Pseudo-Boolean Functions
Authors:
Gobinda Ghosh,
Sudhan Majhi,
Palash Sarkar,
Ashish Kumar Upadhyay
Abstract:
Z-complementary code set (ZCCS) are well known to be used in multicarrier code-division multiple access (MCCDMA) system to provide a interference free environment. Based on the existing literature, the direct construction of optimal ZCCSs are limited to its length. In this paper, we are interested in constructing optimal ZCCSs of all possible even lengths using Pseudo-Boolean functions. The maximu…
▽ More
Z-complementary code set (ZCCS) are well known to be used in multicarrier code-division multiple access (MCCDMA) system to provide a interference free environment. Based on the existing literature, the direct construction of optimal ZCCSs are limited to its length. In this paper, we are interested in constructing optimal ZCCSs of all possible even lengths using Pseudo-Boolean functions. The maximum column sequence peakto-man envelop power ratio (PMEPR) of the proposed ZCCSs is upper-bounded by two, which may give an extra benefit in managing PMEPR in an ZCCS based MC-CDMA system, as well as the ability to handle a large number of users.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
HTLM: Hyper-Text Pre-Training and Prompting of Language Models
Authors:
Armen Aghajanyan,
Dmytro Okhonko,
Mike Lewis,
Mandar Joshi,
Hu Xu,
Gargi Ghosh,
Luke Zettlemoyer
Abstract:
We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of…
▽ More
We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Authors:
Hu Xu,
Gargi Ghosh,
Po-Yao Huang,
Prahal Arora,
Masoumeh Aminzadeh,
Christoph Feichtenhofer,
Florian Metze,
Luke Zettlemoyer
Abstract:
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early c…
▽ More
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.
△ Less
Submitted 30 September, 2021; v1 submitted 20 May, 2021;
originally announced May 2021.
-
Multi-task Retrieval for Knowledge-Intensive Tasks
Authors:
Jean Maillard,
Vladimir Karpukhin,
Fabio Petroni,
Wen-tau Yih,
Barlas Oğuz,
Veselin Stoyanov,
Gargi Ghosh
Abstract:
Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data.
Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide va…
▽ More
Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data.
Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.
△ Less
Submitted 31 December, 2020;
originally announced January 2021.
-
A stabilized finite element method for delamination analysis of composites using cohesive elements
Authors:
Gourab Ghosh,
Ravindra Duddu,
Chandrasekhar Annavarapu
Abstract:
We demonstrate the ability of a stabilized finite element method, inspired by the weighted Nitsche approach, to alleviate spurious traction oscillations at interlaminar interfaces in multi-ply multi-directional composite laminates. In contrast with the standard (penalty-like) method, the stabilized method allows the use of arbitrarily large values of cohesive stiffness and obviates the need for en…
▽ More
We demonstrate the ability of a stabilized finite element method, inspired by the weighted Nitsche approach, to alleviate spurious traction oscillations at interlaminar interfaces in multi-ply multi-directional composite laminates. In contrast with the standard (penalty-like) method, the stabilized method allows the use of arbitrarily large values of cohesive stiffness and obviates the need for engineering approaches to estimate minimum cohesive stiffness necessary for accurate delamination analysis. This is achieved by defining a weighted interface traction in the stabilized method, which allows a gradual transition from penalty-like method for soft elastic contact to Nitsche-like method for rigid contact. We conducted several simulation studies involving constant strain patch tests and benchmark delamination tests under mode-I, mode-II and mixed-mode loadings. Our results show clear evidence of traction oscillations with the standard method with structured and perturbed finite element meshes, and that the stabilized method alleviates these oscillations, thus illustrating its robustness.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Pre-training via Paraphrasing
Authors:
Mike Lewis,
Marjan Ghazvininejad,
Gargi Ghosh,
Armen Aghajanyan,
Sida Wang,
Luke Zettlemoyer
Abstract:
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked language modeling paradigm, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of genera…
▽ More
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked language modeling paradigm, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of generating the original. We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization. The objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance on several tasks. For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation. We further show that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
Optimizing Query Evaluations using Reinforcement Learning for Web Search
Authors:
Corby Rosset,
Damien Jose,
Gargi Ghosh,
Bhaskar Mitra,
Saurabh Tiwary
Abstract:
In web search, typically a candidate generation step selects a small set of documents---from collections containing as many as billions of web pages---that are subsequently ranked and pruned before being presented to the user. In Bing, the candidate generation involves scanning the index using statically designed match plans that prescribe sequences of different match criteria and stopping conditi…
▽ More
In web search, typically a candidate generation step selects a small set of documents---from collections containing as many as billions of web pages---that are subsequently ranked and pruned before being presented to the user. In Bing, the candidate generation involves scanning the index using statically designed match plans that prescribe sequences of different match criteria and stopping conditions. In this work, we pose match planning as a reinforcement learning task and observe up to 20% reduction in index blocks accessed, with small or no degradation in the quality of the candidate sets.
△ Less
Submitted 18 August, 2018; v1 submitted 12 April, 2018;
originally announced April 2018.
-
Local Community Detection in Dynamic Networks
Authors:
Daniel J. DiTursi,
Gaurav Ghosh,
Petko Bogdanov
Abstract:
Given a time-evolving network, how can we detect communities over periods of high internal and low external interactions? To address this question we generalize traditional local community detection in graphs to the setting of dynamic networks. Adopting existing static-network approaches in an "aggregated" graph of all temporal interactions is not appropriate for the problem as dynamic communities…
▽ More
Given a time-evolving network, how can we detect communities over periods of high internal and low external interactions? To address this question we generalize traditional local community detection in graphs to the setting of dynamic networks. Adopting existing static-network approaches in an "aggregated" graph of all temporal interactions is not appropriate for the problem as dynamic communities may be short-lived and thus lost when mixing interactions over long periods. Hence, dynamic community mining requires the detection of both the community nodes and an optimal time interval in which they are actively interacting.
We propose a filter-and-verify framework for dynamic community detection. To scale to long intervals of graph evolution, we employ novel spectral bounds for dynamic community conductance and employ them to filter suboptimal periods in near-linear time. We also design a time-and-graph-aware locality sensitive hashing family to effectively spot promising community cores. Our method PHASR discovers communities of consistently higher quality (2 to 67 times better) than those of baselines. At the same time, our bounds allow for pruning between $55\%$ and $95\%$ of the search space, resulting in significant savings in running time compared to exhaustive alternatives for even modest time intervals of graph evolution.
△ Less
Submitted 12 September, 2017;
originally announced September 2017.