Skip to main content

Showing 1–15 of 15 results for author: Karpukhin, V

  1. arXiv:2403.13257  [pdf, other

    cs.CL cs.AI cs.LG

    Arcee's MergeKit: A Toolkit for Merging Large Language Models

    Authors: Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, Jacob Solawetz

    Abstract: The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to uti… ▽ More

    Submitted 20 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: 11 pages, 4 figures

  2. arXiv:2210.02068  [pdf, other

    cs.IR cs.AI

    Nonparametric Decoding for Generative Retrieval

    Authors: Hyunji Lee, Jaeyoung Kim, Hoyeon Chang, Hanseok Oh, Sohee Yang, Vlad Karpukhin, Yi Lu, Minjoon Seo

    Abstract: The generative retrieval model depends solely on the information encoded in its model parameters without external memory, its information capacity is limited and fixed. To overcome the limitation, we propose Nonparametric Decoding (Np Decoding) which can be applied to existing generative retrieval models. Np Decoding uses nonparametric contextualized vocab embeddings (external memory) rather than… ▽ More

    Submitted 28 May, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: published at Findings of ACL 2023

  3. arXiv:2201.07520  [pdf, other

    cs.CL

    CM3: A Causal Masked Multimodal Model of the Internet

    Authors: Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer

    Abstract: We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking obje… ▽ More

    Submitted 19 January, 2022; originally announced January 2022.

  4. arXiv:2112.09924  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

    Authors: Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel

    Abstract: In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus t… ▽ More

    Submitted 24 May, 2022; v1 submitted 18 December, 2021; originally announced December 2021.

  5. arXiv:2112.05717  [pdf, other

    cs.CL cs.LG stat.ML

    Discourse-Aware Soft Prompting for Text Generation

    Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Vera Gor, Asli Celikyilmaz

    Abstract: Current efficient fine-tuning methods (e.g., adapters, prefix-tuning, etc.) have optimized conditional text generation via training a small set of extra parameters of the neural language model, while freezing the rest for efficiency. While showing strong performance on some generation tasks, they don't generalize across all generation tasks. We show that soft-prompt based conditional text generati… ▽ More

    Submitted 23 May, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

  6. arXiv:2107.13602  [pdf, other

    cs.CL cs.IR

    Domain-matched Pre-training Tasks for Dense Retrieval

    Authors: Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau Yih, Sonal Gupta, Yashar Mehdad

    Abstract: Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder m… ▽ More

    Submitted 28 July, 2021; originally announced July 2021.

  7. arXiv:2101.00133  [pdf, other

    cs.CL cs.AI

    NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

    Authors: Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini , et al. (28 additional authors not shown)

    Abstract: We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage conte… ▽ More

    Submitted 19 September, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

    Comments: 26 pages; Published in Proceedings of Machine Learning Research (PMLR), NeurIPS 2020 Competition and Demonstration Track

  8. arXiv:2101.00117  [pdf, other

    cs.CL

    Multi-task Retrieval for Knowledge-Intensive Tasks

    Authors: Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh

    Abstract: Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide va… ▽ More

    Submitted 31 December, 2020; originally announced January 2021.

  9. Joint Verification and Reranking for Open Fact Checking Over Tables

    Authors: Michael Schlichtkrull, Vladimir Karpukhin, Barlas Oğuz, Mike Lewis, Wen-tau Yih, Sebastian Riedel

    Abstract: Structured information is an important knowledge source for automatic verification of factual claims. Nevertheless, the majority of existing research into this task has focused on textual data, and the few recent inquiries into structured data have been for the closed-domain setting where appropriate evidence for each claim is assumed to have already been retrieved. In this paper, we investigate v… ▽ More

    Submitted 20 August, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

  10. arXiv:2012.14610  [pdf, other

    cs.CL

    UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering

    Authors: Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, Scott Yih

    Abstract: We study open-domain question answering with structured, unstructured and semi-structured knowledge sources, including text, tables, lists and knowledge bases. Departing from prior work, we propose a unifying approach that homogenizes all sources by reducing them to text and applies the retriever-reader model which has so far been limited to text sources only. Our approach greatly improves the res… ▽ More

    Submitted 3 May, 2022; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: NAACL-HLT 2022 Findings

  11. arXiv:2009.02252  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    KILT: a Benchmark for Knowledge Intensive Language Tasks

    Authors: Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel

    Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research… ▽ More

    Submitted 27 May, 2021; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: accepted at NAACL 2021

  12. arXiv:2005.11401  [pdf, other

    cs.CL cs.LG

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

    Abstract: Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for… ▽ More

    Submitted 12 April, 2021; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Accepted at NeurIPS 2020

  13. arXiv:2004.04906  [pdf, other

    cs.CL

    Dense Passage Retrieval for Open-Domain Question Answering

    Authors: Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih

    Abstract: Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder fra… ▽ More

    Submitted 30 September, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020

  14. arXiv:2004.01655  [pdf, other

    cs.CL cs.LG stat.ML

    Aligned Cross Entropy for Non-Autoregressive Machine Translation

    Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy

    Abstract: Non-autoregressive machine translation models significantly speed up decoding by allowing for parallel prediction of the entire target sequence. However, modeling word order is more challenging due to the lack of autoregressive factors in the model. This difficultly is compounded during training with cross entropy loss, which can highly penalize small shifts in word order. In this paper, we propos… ▽ More

    Submitted 3 April, 2020; originally announced April 2020.

  15. arXiv:1902.01509  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation

    Authors: Vladimir Karpukhin, Omer Levy, Jacob Eisenstein, Marjan Ghazvininejad

    Abstract: We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and character-level encoders, but these methods are highly sensitive to spelling mistakes. We show how training on a mild amount of random synthetic noise can dramatica… ▽ More

    Submitted 4 February, 2019; originally announced February 2019.