Skip to main content

Showing 1–6 of 6 results for author: Bharti, T

  1. arXiv:2106.09889  [pdf, other

    cs.CL cs.CV cs.MM

    GEM: A General Evaluation Benchmark for Multimodal Tasks

    Authors: Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti

    Abstract: In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO an… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

    Comments: Accepted by Findings of ACL 2021

  2. arXiv:2006.02635  [pdf, other

    cs.CL cs.CV

    M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

    Authors: Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Jianfeng Gao, Dongdong Zhang, Nan Duan

    Abstract: We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage… ▽ More

    Submitted 31 March, 2021; v1 submitted 3 June, 2020; originally announced June 2020.

    Comments: Accepted to CVPR 2021

  3. arXiv:2004.01401  [pdf, ps, other

    cs.CL

    XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

    Authors: Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, Ming Zhou

    Abstract: In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it pr… ▽ More

    Submitted 22 May, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

  4. arXiv:2003.01473  [pdf, ps, other

    cs.CL cs.CV cs.LG

    XGPT: Cross-modal Generative Pre-Training for Image Captioning

    Authors: Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou

    Abstract: While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation… ▽ More

    Submitted 4 March, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

    Comments: 12 pages, 3 figures, 7 tables

  5. arXiv:2002.06353  [pdf, other

    cs.CV cs.CL cs.LG eess.AS eess.IV

    UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

    Authors: Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou

    Abstract: With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video… ▽ More

    Submitted 15 September, 2020; v1 submitted 15 February, 2020; originally announced February 2020.

  6. arXiv:2001.07966  [pdf, other

    cs.CV

    ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

    Authors: Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti

    Abstract: In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRF… ▽ More

    Submitted 23 January, 2020; v1 submitted 22 January, 2020; originally announced January 2020.