-
Improving Alignment and Robustness with Circuit Breakers
Authors:
Andy Zou,
Long Phan,
Justin Wang,
Derek Duenas,
Maxwell Lin,
Maksym Andriushchenko,
Rowan Wang,
Zico Kolter,
Matt Fredrikson,
Dan Hendrycks
Abstract:
AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to…
▽ More
AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, circuit breakers allow the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.
△ Less
Submitted 12 July, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Stochastic Learning of Computational Resource Usage as Graph Structured Multimarginal Schrödinger Bridge
Authors:
Georgiy A. Bondar,
Robert Gifford,
Linh Thi Xuan Phan,
Abhishek Halder
Abstract:
We propose to learn the time-varying stochastic computational resource usage of software as a graph structured Schrödinger bridge problem. In general, learning the computational resource usage from data is challenging because resources such as the number of CPU instructions and the number of last level cache requests are both time-varying and statistically correlated. Our proposed method enables l…
▽ More
We propose to learn the time-varying stochastic computational resource usage of software as a graph structured Schrödinger bridge problem. In general, learning the computational resource usage from data is challenging because resources such as the number of CPU instructions and the number of last level cache requests are both time-varying and statistically correlated. Our proposed method enables learning the joint time-varying stochasticity in computational resource usage from the measured profile snapshots in a nonparametric manner. The method can be used to predict the most-likely time-varying distribution of computational resource availability at a desired time. We provide detailed algorithms for stochastic learning in both single and multi-core cases, discuss the convergence guarantees, computational complexities, and demonstrate their practical use in two case studies: a single-core nonlinear model predictive controller, and a synthetic multi-core software.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Authors:
Nathaniel Li,
Alexander Pan,
Anjali Gopal,
Summer Yue,
Daniel Berrios,
Alice Gatti,
Justin D. Li,
Ann-Kathrin Dombrowski,
Shashwat Goel,
Long Phan,
Gabriel Mukobi,
Nathan Helm-Burger,
Rassin Lababidi,
Lennart Justen,
Andrew B. Liu,
Michael Chen,
Isabelle Barrass,
Oliver Zhang,
Xiaoyuan Zhu,
Rishub Tamirisa,
Bhrugu Bharathi,
Adam Khoja,
Zhenqi Zhao,
Ariel Herbert-Voss,
Cort B. Breuer
, et al. (32 additional authors not shown)
Abstract:
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe…
▽ More
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
△ Less
Submitted 15 May, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation
Authors:
Phuc Phan,
Hieu Tran,
Long Phan
Abstract:
We propose a straightforward approach called Distillation Contrastive Decoding (DCD) to enhance the reasoning capabilities of Large Language Models (LLMs) during inference. In contrast to previous approaches that relied on smaller amateur models or analysis of hidden state differences, DCD employs Contrastive Chain-of-thought Prompting and advanced distillation techniques, including Dropout and Qu…
▽ More
We propose a straightforward approach called Distillation Contrastive Decoding (DCD) to enhance the reasoning capabilities of Large Language Models (LLMs) during inference. In contrast to previous approaches that relied on smaller amateur models or analysis of hidden state differences, DCD employs Contrastive Chain-of-thought Prompting and advanced distillation techniques, including Dropout and Quantization. This approach effectively addresses the limitations of Contrastive Decoding (CD), which typically requires both an expert and an amateur model, thus increasing computational resource demands. By integrating contrastive prompts with distillation, DCD obviates the need for an amateur model and reduces memory usage. Our evaluations demonstrate that DCD significantly enhances LLM performance across a range of reasoning benchmarks, surpassing both CD and existing methods in the GSM8K and StrategyQA datasets.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Authors:
Mantas Mazeika,
Long Phan,
Xuwang Yin,
Andy Zou,
Zifan Wang,
Norman Mu,
Elham Sakhaee,
Nathaniel Li,
Steven Basart,
Bo Li,
David Forsyth,
Dan Hendrycks
Abstract:
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties prev…
▽ More
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.
△ Less
Submitted 26 February, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training
Authors:
Michael Benington,
Leo Phan,
Chris Pierre Paul,
Evan Shoemaker,
Priyanka Ranade,
Torstein Collett,
Grant Hodgson Perez,
Christopher Krieger
Abstract:
AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art, transformer-based model today requires use of GPU-accelerated high performance computers with high-speed interconnects. As datasets and models continue to increase in si…
▽ More
AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art, transformer-based model today requires use of GPU-accelerated high performance computers with high-speed interconnects. As datasets and models continue to increase in size, computational requirements and memory demands for AI also continue to grow. These challenges have inspired the development of distributed algorithm and circuit-based optimization techniques that enable the ability to progressively scale models in multi-node environments, efficiently minimize neural network cost functions for faster convergence, and store more parameters into a set number of available resources. In our research project, we focus on parallel and distributed machine learning algorithm development, specifically for optimizing the data processing and pre-training of a set of 5 encoder-decoder LLMs, ranging from 580 million parameters to 13 billion parameters. We performed a fine-grained study to quantify the relationships between three ML parallelism methods, specifically exploring Microsoft DeepSpeed Zero Redundancy Optimizer (ZeRO) stages.
△ Less
Submitted 10 October, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
Representation Engineering: A Top-Down Approach to AI Transparency
Authors:
Andy Zou,
Long Phan,
Sarah Chen,
James Campbell,
Phillip Guo,
Richard Ren,
Alexander Pan,
Xuwang Yin,
Mantas Mazeika,
Ann-Kathrin Dombrowski,
Shashwat Goel,
Nathaniel Li,
Michael J. Byun,
Zifan Wang,
Alex Mallen,
Steven Basart,
Sanmi Koyejo,
Dawn Song,
Matt Fredrikson,
J. Zico Kolter,
Dan Hendrycks
Abstract:
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive p…
▽ More
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
△ Less
Submitted 10 October, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Path Structured Multimarginal Schrödinger Bridge for Probabilistic Learning of Hardware Resource Usage by Control Software
Authors:
Georgiy A. Bondar,
Robert Gifford,
Linh Thi Xuan Phan,
Abhishek Halder
Abstract:
The solution of the path structured multimarginal Schrödinger bridge problem (MSBP) is the most-likely measure-valued trajectory consistent with a sequence of observed probability measures or distributional snapshots. We leverage recent algorithmic advances in solving such structured MSBPs for learning stochastic hardware resource usage by control software. The solution enables predicting the time…
▽ More
The solution of the path structured multimarginal Schrödinger bridge problem (MSBP) is the most-likely measure-valued trajectory consistent with a sequence of observed probability measures or distributional snapshots. We leverage recent algorithmic advances in solving such structured MSBPs for learning stochastic hardware resource usage by control software. The solution enables predicting the time-varying distribution of hardware resource availability at a desired time with guaranteed linear convergence. We demonstrate the efficacy of our probabilistic learning approach in a model predictive control software execution case study. The method exhibits rapid convergence to an accurate prediction of hardware resource utilization of the controller. The method can be broadly applied to any software to predict cyber-physical context-dependent performance at arbitrary time.
△ Less
Submitted 3 October, 2023; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Authors:
Zheng-Xin Yong,
Ruochen Zhang,
Jessica Zosa Forde,
Skyler Wang,
Arjun Subramonian,
Holy Lovenia,
Samuel Cahyawijaya,
Genta Indra Winata,
Lintang Sutawika,
Jan Christian Blaise Cruz,
Yin Lin Tan,
Long Phan,
Rowena Garcia,
Thamar Solorio,
Alham Fikri Aji
Abstract:
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero…
▽ More
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.
△ Less
Submitted 12 September, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Authors:
Hugo Laurençon,
Lucile Saulnier,
Thomas Wang,
Christopher Akiki,
Albert Villanova del Moral,
Teven Le Scao,
Leandro Von Werra,
Chenghao Mou,
Eduardo González Ponferrada,
Huu Nguyen,
Jörg Frohberg,
Mario Šaško,
Quentin Lhoest,
Angelina McMillan-Major,
Gerard Dupont,
Stella Biderman,
Anna Rogers,
Loubna Ben allal,
Francesco De Toni,
Giada Pistilli,
Olivier Nguyen,
Somaieh Nikpoor,
Maraim Masoud,
Pierre Colombo,
Javier de la Rosa
, et al. (29 additional authors not shown)
Abstract:
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the f…
▽ More
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Authors:
BigScience Workshop,
:,
Teven Le Scao,
Angela Fan,
Christopher Akiki,
Ellie Pavlick,
Suzana Ilić,
Daniel Hesslow,
Roman Castagné,
Alexandra Sasha Luccioni,
François Yvon,
Matthias Gallé,
Jonathan Tow,
Alexander M. Rush,
Stella Biderman,
Albert Webson,
Pawan Sasanka Ammanamanchi,
Thomas Wang,
Benoît Sagot,
Niklas Muennighoff,
Albert Villanova del Moral,
Olatunji Ruwase,
Rachel Bawden,
Stas Bekman,
Angelina McMillan-Major
, et al. (369 additional authors not shown)
Abstract:
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access…
▽ More
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
△ Less
Submitted 27 June, 2023; v1 submitted 9 November, 2022;
originally announced November 2022.
-
MTet: Multi-domain Translation for English and Vietnamese
Authors:
Chinh Ngo,
Trieu H. Trinh,
Long Phan,
Hieu Tran,
Tai Dang,
Hieu Nguyen,
Minh Nguyen,
Minh-Thang Luong
Abstract:
We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained m…
▽ More
We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller.
△ Less
Submitted 19 October, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation
Authors:
Long Phan,
Tai Dang,
Hieu Tran,
Trieu H. Trinh,
Vy Phan,
Lam D. Chau,
Minh-Thang Luong
Abstract:
Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encod…
▽ More
Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.
△ Less
Submitted 29 January, 2023; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Federated Learning for Non-IID Data via Client Variance Reduction and Adaptive Server Update
Authors:
Hiep Nguyen,
Lam Phan,
Harikrishna Warrier,
Yogesh Gupta
Abstract:
Federated learning (FL) is an emerging technique used to collaboratively train a global machine learning model while keeping the data localized on the user devices. The main obstacle to FL's practical implementation is the Non-Independent and Identical (Non-IID) data distribution across users, which slows convergence and degrades performance. To tackle this fundamental issue, we propose a method (…
▽ More
Federated learning (FL) is an emerging technique used to collaboratively train a global machine learning model while keeping the data localized on the user devices. The main obstacle to FL's practical implementation is the Non-Independent and Identical (Non-IID) data distribution across users, which slows convergence and degrades performance. To tackle this fundamental issue, we propose a method (ComFed) that enhances the whole training process on both the client and server sides. The key idea of ComFed is to simultaneously utilize client-variance reduction techniques to facilitate server aggregation and global adaptive update techniques to accelerate learning. Our experiments on the Cifar-10 classification task show that ComFed can improve state-of-the-art algorithms dedicated to Non-IID data.
△ Less
Submitted 29 July, 2022; v1 submitted 18 July, 2022;
originally announced July 2022.
-
Improving Domain Generalization by Learning without Forgetting: Application in Retail Checkout
Authors:
Thuy C. Nguyen,
Nam LH. Phan,
Son T. Nguyen
Abstract:
Designing an automatic checkout system for retail stores at the human level accuracy is challenging due to similar appearance products and their various poses. This paper addresses the problem by proposing a method with a two-stage pipeline. The first stage detects class-agnostic items, and the second one is dedicated to classify product categories. We also track the objects across video frames to…
▽ More
Designing an automatic checkout system for retail stores at the human level accuracy is challenging due to similar appearance products and their various poses. This paper addresses the problem by proposing a method with a two-stage pipeline. The first stage detects class-agnostic items, and the second one is dedicated to classify product categories. We also track the objects across video frames to avoid duplicated counting. One major challenge is the domain gap because the models are trained on synthetic data but tested on the real images. To reduce the error gap, we adopt domain generalization methods for the first-stage detector. In addition, model ensemble is used to enhance the robustness of the 2nd-stage classifier. The method is evaluated on the AI City challenge 2022 -- Track 4 and gets the F1 score $40\%$ on the test A set. Code is released at the link https://github.com/cybercore-co-ltd/aicity22-track4.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.
-
ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation
Authors:
Long Phan,
Hieu Tran,
Hieu Nguyen,
Trieu H. Trinh
Abstract:
We present ViT5, a pretrained Transformer-based encoder-decoder model for the Vietnamese language. With T5-style self-supervised pretraining, ViT5 is trained on a large corpus of high-quality and diverse Vietnamese texts. We benchmark ViT5 on two downstream text generation tasks, Abstractive Text Summarization and Named Entity Recognition. Although Abstractive Text Summarization has been widely st…
▽ More
We present ViT5, a pretrained Transformer-based encoder-decoder model for the Vietnamese language. With T5-style self-supervised pretraining, ViT5 is trained on a large corpus of high-quality and diverse Vietnamese texts. We benchmark ViT5 on two downstream text generation tasks, Abstractive Text Summarization and Named Entity Recognition. Although Abstractive Text Summarization has been widely studied for the English language thanks to its rich and large source of data, there has been minimal research into the same task in Vietnamese, a much lower resource language. In this work, we perform exhaustive experiments on both Vietnamese Abstractive Summarization and Named Entity Recognition, validating the performance of ViT5 against many other pretrained Transformer-based encoder-decoder models. Our experiments show that ViT5 significantly outperforms existing models and achieves state-of-the-art results on Vietnamese Text Summarization. On the task of Named Entity Recognition, ViT5 is competitive against previous best results from pretrained encoder-based Transformer models. Further analysis shows the importance of context length during the self-supervised pretraining on downstream performance across different settings.
△ Less
Submitted 26 May, 2022; v1 submitted 13 May, 2022;
originally announced May 2022.
-
Span Detection for Aspect-Based Sentiment Analysis in Vietnamese
Authors:
Kim Thi-Thanh Nguyen,
Sieu Khai Huynh,
Luong Luc Phan,
Phuc Huynh Pham,
Duc-Vu Nguyen,
Kiet Van Nguyen
Abstract:
Aspect-based sentiment analysis plays an essential role in natural language processing and artificial intelligence. Recently, researchers only focused on aspect detection and sentiment classification but ignoring the sub-task of detecting user opinion span, which has enormous potential in practical applications. In this paper, we present a new Vietnamese dataset (UIT-ViSD4SA) consisting of 35,396…
▽ More
Aspect-based sentiment analysis plays an essential role in natural language processing and artificial intelligence. Recently, researchers only focused on aspect detection and sentiment classification but ignoring the sub-task of detecting user opinion span, which has enormous potential in practical applications. In this paper, we present a new Vietnamese dataset (UIT-ViSD4SA) consisting of 35,396 human-annotated spans on 11,122 feedback comments for evaluating the span detection in aspect-based sentiment analysis. Besides, we also propose a novel system using Bidirectional Long Short-Term Memory (BiLSTM) with a Conditional Random Field (CRF) layer (BiLSTM-CRF) for the span detection task in Vietnamese aspect-based sentiment analysis. The best result is a 62.76% F1 score (macro) for span detection using BiLSTM-CRF with embedding fusion of syllable embedding, character embedding, and contextual embedding from XLM-RoBERTa. In future work, span detection will be extended in many NLP tasks such as constructive detection, emotion recognition, complaint analysis, and opinion mining. Our dataset is freely available at https://github.com/kimkim00/UIT-ViSD4SA for research purposes.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization?
Authors:
Hieu Nguyen,
Long Phan,
James Anibal,
Alec Peltekian,
Hieu Tran
Abstract:
Text summarization is a challenging task within natural language processing that involves text generation from lengthy input sequences. While this task has been widely studied in English, there is very limited research on summarization for Vietnamese text. In this paper, we investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization. Lever…
▽ More
Text summarization is a challenging task within natural language processing that involves text generation from lengthy input sequences. While this task has been widely studied in English, there is very limited research on summarization for Vietnamese text. In this paper, we investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization. Leveraging transfer learning and self-supervised learning, we validate the performance of the methods on two Vietnamese datasets.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Improving Object Detection by Label Assignment Distillation
Authors:
Chuong H. Nguyen,
Thuy C. Nguyen,
Tuan N. Tang,
Nam L. H. Phan
Abstract:
Label assignment in object detection aims to assign targets, foreground or background, to sampled regions in an image. Unlike labeling for image classification, this problem is not well defined due to the object's bounding box. In this paper, we investigate the problem from a perspective of distillation, hence we call Label Assignment Distillation (LAD). Our initial motivation is very simple, we u…
▽ More
Label assignment in object detection aims to assign targets, foreground or background, to sampled regions in an image. Unlike labeling for image classification, this problem is not well defined due to the object's bounding box. In this paper, we investigate the problem from a perspective of distillation, hence we call Label Assignment Distillation (LAD). Our initial motivation is very simple, we use a teacher network to generate labels for the student. This can be achieved in two ways: either using the teacher's prediction as the direct targets (soft label), or through the hard labels dynamically assigned by the teacher (LAD). Our experiments reveal that: (i) LAD is more effective than soft-label, but they are complementary. (ii) Using LAD, a smaller teacher can also improve a larger student significantly, while soft-label can't. We then introduce Co-learning LAD, in which two networks simultaneously learn from scratch and the role of teacher and student are dynamically interchanged. Using PAA-ResNet50 as a teacher, our LAD techniques can improve detectors PAA-ResNet101 and PAA-ResNeXt101 to $46 \rm AP$ and $47.5\rm AP$ on the COCO test-dev set. With a stronger teacher PAA-SwinB, we improve the students PAA-ResNet50 to $43.7\rm AP$ by only 1x schedule training and standard setting, and PAA-ResNet101 to $47.9\rm AP$, significantly surpassing the current methods. Our source code and checkpoints are released at https://git.io/JrDZo.
△ Less
Submitted 19 October, 2021; v1 submitted 24 August, 2021;
originally announced August 2021.
-
SPBERT: An Efficient Pre-training BERT on SPARQL Queries for Question Answering over Knowledge Graphs
Authors:
Hieu Tran,
Long Phan,
James Anibal,
Binh T. Nguyen,
Truong-Son Nguyen
Abstract:
In this paper, we propose SPBERT, a transformer-based language model pre-trained on massive SPARQL query logs. By incorporating masked language modeling objectives and the word structural objective, SPBERT can learn general-purpose representations in both natural language and SPARQL query language. We investigate how SPBERT and encoder-decoder architecture can be adapted for Knowledge-based QA cor…
▽ More
In this paper, we propose SPBERT, a transformer-based language model pre-trained on massive SPARQL query logs. By incorporating masked language modeling objectives and the word structural objective, SPBERT can learn general-purpose representations in both natural language and SPARQL query language. We investigate how SPBERT and encoder-decoder architecture can be adapted for Knowledge-based QA corpora. We conduct exhaustive experiments on two additional tasks, including SPARQL Query Construction and Answer Verbalization Generation. The experimental results show that SPBERT can obtain promising results, achieving state-of-the-art BLEU scores on several of these tasks.
△ Less
Submitted 30 June, 2021; v1 submitted 18 June, 2021;
originally announced June 2021.
-
1st Place Solution for YouTubeVOS Challenge 2021:Video Instance Segmentation
Authors:
Thuy C. Nguyen,
Tuan N. Tang,
Nam LH. Phan,
Chuong H. Nguyen,
Masayuki Yamazaki,
Masao Yamanaka
Abstract:
Video Instance Segmentation (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously. Extended from image set applications, video data additionally induces the temporal information, which, if handled appropriately, is very useful to identify and predict object motions. In this work, we design a unified model to mutually learn these tasks. Specifically, we propo…
▽ More
Video Instance Segmentation (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously. Extended from image set applications, video data additionally induces the temporal information, which, if handled appropriately, is very useful to identify and predict object motions. In this work, we design a unified model to mutually learn these tasks. Specifically, we propose two modules, named Temporally Correlated Instance Segmentation (TCIS) and Bidirectional Tracking (BiTrack), to take the benefit of the temporal correlation between the object's instance masks across adjacent frames. On the other hand, video data is often redundant due to the frame's overlap. Our analysis shows that this problem is particularly severe for the YoutubeVOS-VIS2021 data. Therefore, we propose a Multi-Source Data (MSD) training mechanism to compensate for the data deficiency. By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline, and outperforms other methods by a considerable margin on the YoutubeVOS-VIS 2019 and 2021 datasets.
△ Less
Submitted 8 July, 2021; v1 submitted 11 June, 2021;
originally announced June 2021.
-
SciFive: a text-to-text transformer model for biomedical literature
Authors:
Long N. Phan,
James T. Anibal,
Hieu Tran,
Shaurya Chanana,
Erol Bahadroglu,
Alec Peltekian,
Grégoire Altan-Bonnet
Abstract:
In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show that text-generation methods have significant potential in a broad array of biomedical…
▽ More
In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show that text-generation methods have significant potential in a broad array of biomedical NLP tasks, particularly those requiring longer, more complex outputs. Our results support the exploration of more difficult text generation tasks and the development of new methods in this area
△ Less
Submitted 28 May, 2021;
originally announced June 2021.
-
SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence
Authors:
Luong Luc Phan,
Phuc Huynh Pham,
Kim Thi-Thanh Nguyen,
Tham Thi Nguyen,
Sieu Khai Huynh,
Luan Thanh Nguyen,
Tin Van Huynh,
Kiet Van Nguyen
Abstract:
In this paper, we present a process of building a social listening system based on aspect-based sentiment analysis in Vietnamese from creating a dataset to building a real application. Firstly, we create UIT-ViSFD, a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on a strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-…
▽ More
In this paper, we present a process of building a social listening system based on aspect-based sentiment analysis in Vietnamese from creating a dataset to building a real application. Firstly, we create UIT-ViSFD, a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on a strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes. We also present a proposed approach based on the Bi-LSTM architecture with the fastText word embeddings for the Vietnamese aspect based sentiment task. Our experiments show that our approach achieves the best performances with the F1-score of 84.48% for the aspect task and 63.06% for the sentiment task, which performs several conventional machine learning and deep learning systems. Last but not least, we build SA2SL, a social listening system based on the best performance model on our dataset, which will inspire more social listening systems in future.
△ Less
Submitted 10 June, 2021; v1 submitted 31 May, 2021;
originally announced May 2021.
-
Hierarchical Transformer Encoders for Vietnamese Spelling Correction
Authors:
Hieu Tran,
Cuong V. Dinh,
Long Phan,
Son T. Nguyen
Abstract:
In this paper, we propose a Hierarchical Transformer model for Vietnamese spelling correction problem. The model consists of multiple Transformer encoders and utilizes both character-level and word-level to detect errors and make corrections. In addition, to facilitate future work in Vietnamese spelling correction tasks, we propose a realistic dataset collected from real-life texts for the problem…
▽ More
In this paper, we propose a Hierarchical Transformer model for Vietnamese spelling correction problem. The model consists of multiple Transformer encoders and utilizes both character-level and word-level to detect errors and make corrections. In addition, to facilitate future work in Vietnamese spelling correction tasks, we propose a realistic dataset collected from real-life texts for the problem. We compare our method with other methods and publicly available systems. The proposed method outperforms all of the contemporary methods in terms of recall, precision, and f1-score. A demo version is publicly available.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
CoTexT: Multi-task Learning with Code-Text Transformer
Authors:
Long Phan,
Hieu Tran,
Daniel Le,
Hieu Nguyen,
James Anibal,
Alec Peltekian,
Yanfang Ye
Abstract:
We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code gen…
▽ More
We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.
△ Less
Submitted 21 June, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.
-
Leveraging Transfer Learning for Reliable Intelligence Identification on Vietnamese SNSs (ReINTEL)
Authors:
Trung-Hieu Tran,
Long Phan,
Truong-Son Nguyen,
Tien-Huy Nguyen
Abstract:
This paper proposed several transformer-based approaches for Reliable Intelligence Identification on Vietnamese social network sites at VLSP 2020 evaluation campaign. We exploit both of monolingual and multilingual pre-trained models. Besides, we utilize the ensemble method to improve the robustness of different approaches. Our team achieved a score of 0.9378 at ROC-AUC metric in the private test…
▽ More
This paper proposed several transformer-based approaches for Reliable Intelligence Identification on Vietnamese social network sites at VLSP 2020 evaluation campaign. We exploit both of monolingual and multilingual pre-trained models. Besides, we utilize the ensemble method to improve the robustness of different approaches. Our team achieved a score of 0.9378 at ROC-AUC metric in the private test set which is competitive to other participants.
△ Less
Submitted 16 December, 2020; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Multi-Site Infant Brain Segmentation Algorithms: The iSeg-2019 Challenge
Authors:
Yue Sun,
Kun Gao,
Zhengwang Wu,
Zhihao Lei,
Ying Wei,
Jun Ma,
Xiaoping Yang,
Xue Feng,
Li Zhao,
Trung Le Phan,
Jitae Shin,
Tao Zhong,
Yu Zhang,
Lequan Yu,
Caizi Li,
Ramesh Basnet,
M. Omair Ahmad,
M. N. S. Swamy,
Wenao Ma,
Qi Dou,
Toan Duc Bui,
Camilo Bermudez Noguera,
Bennett Landman,
Ian H. Gotlib,
Kathryn L. Humphreys
, et al. (8 additional authors not shown)
Abstract:
To better understand early brain growth patterns in health and disorder, it is critical to accurately segment infant brain magnetic resonance (MR) images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). Deep learning-based methods have achieved state-of-the-art performance; however, one of major limitations is that the learning-based methods may suffer from the multi-site i…
▽ More
To better understand early brain growth patterns in health and disorder, it is critical to accurately segment infant brain magnetic resonance (MR) images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). Deep learning-based methods have achieved state-of-the-art performance; however, one of major limitations is that the learning-based methods may suffer from the multi-site issue, that is, the models trained on a dataset from one site may not be applicable to the datasets acquired from other sites with different imaging protocols/scanners. To promote methodological development in the community, iSeg-2019 challenge (http://iseg2019.web.unc.edu) provides a set of 6-month infant subjects from multiple sites with different protocols/scanners for the participating methods. Training/validation subjects are from UNC (MAP) and testing subjects are from UNC/UMN (BCP), Stanford University, and Emory University. By the time of writing, there are 30 automatic segmentation methods participating in iSeg-2019. We review the 8 top-ranked teams by detailing their pipelines/implementations, presenting experimental results and evaluating performance in terms of the whole brain, regions of interest, and gyral landmark curves. We also discuss their limitations and possible future directions for the multi-site issue. We hope that the multi-site dataset in iSeg-2019 and this review article will attract more researchers on the multi-site issue.
△ Less
Submitted 11 July, 2020; v1 submitted 4 July, 2020;
originally announced July 2020.
-
Visualization techniques for the developing chicken heart
Authors:
Ly Phan,
Sandra Rugonyi,
Cindy Grimm
Abstract:
We present a geometric surface parameterization algorithm and several visualization techniques adapted to the problem of understanding the 4D peristaltic-like motion of the outflow tract (OFT) in an embryonic chick heart. We illustrated the techniques using data from hearts under normal conditions (four embryos), and hearts in which blood flow conditions are altered through OFT banding (four embry…
▽ More
We present a geometric surface parameterization algorithm and several visualization techniques adapted to the problem of understanding the 4D peristaltic-like motion of the outflow tract (OFT) in an embryonic chick heart. We illustrated the techniques using data from hearts under normal conditions (four embryos), and hearts in which blood flow conditions are altered through OFT banding (four embryos). The overall goal is to create quantitative measures of the temporal heart-shape change both within a single subject and between multiple subjects. These measures will help elucidate how altering hemodynamic conditions changes the shape and motion of the OFT walls, which in turn influence the stresses and strains on the developing heart, causing it to develop differently. We take advantage of the tubular shape and periodic motion of the OFT to produce successively lower dimensional visualizations of the cardiac motion (e.g. curvature, volume, and cross-section) over time, and quantifications of such visualizations.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.