subscribe to arXiv mailings

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Authors: Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F. Wong, Xiaowei Huang, Qiufeng Wang, Kaizhu Huang

Abstract: Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a s… ▽ More Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a substantial risk of model overfitting and fails to accurately represent genuine mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it should be robustly and readily applied across a diverse array of tasks. Motivated by this, we introduce MATHCHECK, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists efficiently. MATHCHECK includes multiple mathematical reasoning tasks and robustness test types to facilitate a comprehensive evaluation of both mathematical reasoning ability and behavior testing. Utilizing MATHCHECK, we develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathematical textual reasoning and multi-modal reasoning capabilities, respectively, serving as upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K. We adopt MATHCHECK-GSM and MATHCHECK-GEO to evaluate over 20 LLMs and 11 MLLMs, assessing their comprehensive mathematical reasoning abilities. Our results demonstrate that while frontier LLMs like GPT-4o continue to excel in various abilities on the checklist, many other model families exhibit a significant decline. Further experiments indicate that, compared to traditional math benchmarks, MATHCHECK better reflects true mathematical abilities and represents mathematical intelligence more linearly, thereby supporting our design. On our MATHCHECK, we can easily conduct detailed behavior analysis to deeply investigate models. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 35 pages, 10 figures, preprint

arXiv:2407.04879 [pdf, other]

All Neural Low-latency Directional Speech Extraction

Authors: Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu

Abstract: We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted… ▽ More We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to quickly adapt to abrupt changes in DOA. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Accepted for publication at INTERSPEECH 2024

arXiv:2406.16405 [pdf, ps, other]

doi 10.4204/EPTCS.403.23

Greedy Gray Codes for some Restricted Classes of Binary Words

Authors: Nathanaël Hassler, Vincent Vajnovszki, Dennis Wong

Abstract: We investigate the existence of greedy Gray codes, based on the choice of the first element in the code, for two classes of binary words: generalized Fibonacci words and generalized Dyck words. We investigate the existence of greedy Gray codes, based on the choice of the first element in the code, for two classes of binary words: generalized Fibonacci words and generalized Dyck words. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: In Proceedings GASCom 2024, arXiv:2406.14588

Journal ref: EPTCS 403, 2024, pp. 108-112

arXiv:2406.11432 [pdf, other]

AnyTrans: Translate AnyText in the Image with Large Scale Models

Authors: Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji

Abstract: This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during tr… ▽ More This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.07054 [pdf, other]

CoEvol: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation

Authors: Renhao Li, Minghuan Tan, Derek F. Wong, Min Yang

Abstract: In recent years, instruction fine-tuning (IFT) on large language models (LLMs) has garnered considerable attention to enhance model performance on unseen tasks. Attempts have been made on automatic construction and effective selection for IFT data. However, we posit that previous methods have not fully harnessed the potential of LLMs for enhancing data quality. The responses within IFT data could… ▽ More In recent years, instruction fine-tuning (IFT) on large language models (LLMs) has garnered considerable attention to enhance model performance on unseen tasks. Attempts have been made on automatic construction and effective selection for IFT data. However, we posit that previous methods have not fully harnessed the potential of LLMs for enhancing data quality. The responses within IFT data could be further enhanced by leveraging the capabilities of LLMs themselves. In this paper, we propose CoEvol, an LLM-based multi-agent cooperation framework for the improvement of responses to instructions. To effectively refine the responses, we develop an iterative framework following a debate-advise-edit-judge paradigm. A two-stage multi-agent debate strategy is further devised to ensure the diversity and reliability of editing suggestions within the framework. Empirically, models equipped with CoEvol outperform competitive baselines evaluated by MT-Bench and AlpacaEval, demonstrating its effectiveness in enhancing instruction-following capabilities for LLMs. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.05221 [pdf, other]

GCAPS: GPU Context-Aware Preemptive Priority-based Scheduling for Real-Time Tasks

Authors: Yidi Wang, Cong Liu, Daniel Wong, Hyoseung Kim

Abstract: Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of GPU-level preemption, ex… ▽ More Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of GPU-level preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose GCAPS, a GPU Context-Aware Preemptive Scheduling approach for real-time GPU tasks. Our approach exerts control over GPU context scheduling at the device driver level and enables preemption of GPU execution based on task priorities by simply adding one-line macros to GPU segment boundaries. In addition, we provide a comprehensive response time analysis of GPU-using tasks for both our proposed approach as well as the default Nvidia GPU driver scheduling that follows a work-conserving round-robin policy. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and response time. The results highlight significant improvements over prior work as well as the default scheduling approach, with up to 40% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted by ECRTS 2024. arXiv admin note: substantial text overlap with arXiv:2401.16529

arXiv:2406.03450 [pdf, other]

What is the Best Way for ChatGPT to Translate Poetry?

Authors: Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao

Abstract: Machine translation (MT) has historically faced significant challenges when applied to literary works, particularly in the domain of poetry translation. The advent of Large Language Models such as ChatGPT holds potential for innovation in this field. This study examines ChatGPT's capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to asce… ▽ More Machine translation (MT) has historically faced significant challenges when applied to literary works, particularly in the domain of poetry translation. The advent of Large Language Models such as ChatGPT holds potential for innovation in this field. This study examines ChatGPT's capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to ascertain optimal performance. Despite promising outcomes, our analysis reveals persistent issues in the translations generated by ChatGPT that warrant attention. To address these shortcomings, we propose an Explanation-Assisted Poetry Machine Translation (EAPMT) method, which leverages monolingual poetry explanation as a guiding information for the translation process. Furthermore, we refine existing evaluation criteria to better suit the nuances of modern poetry translation. We engaged a panel of professional poets for assessments, complemented evaluations by using GPT-4. The results from both human and machine evaluations demonstrate that our EAPMT method outperforms traditional translation methods of ChatGPT and the existing online systems. This paper validates the efficacy of our method and contributes a novel perspective to machine-assisted literary translation. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 19 pages, 1 figure. The paper has been accepted by ACL 2024(Main Conference)

arXiv:2406.00839 [pdf, other]

FOCUS: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models

Authors: Kaixin Lan, Tao Fang, Derek F. Wong, Yabo Xu, Lidia S. Chao, Cecilia G. Zhao

Abstract: Pre-trained Language Models (PLMs) have shown impressive results in various Natural Language Generation (NLG) tasks, such as powering chatbots and generating stories. However, an ethical concern arises due to their potential to produce verbatim copies of paragraphs from their training data. This is problematic as PLMs are trained on corpora constructed by human authors. As such, there is a pressin… ▽ More Pre-trained Language Models (PLMs) have shown impressive results in various Natural Language Generation (NLG) tasks, such as powering chatbots and generating stories. However, an ethical concern arises due to their potential to produce verbatim copies of paragraphs from their training data. This is problematic as PLMs are trained on corpora constructed by human authors. As such, there is a pressing need for research to promote the generation of original content by these models. In this study, we introduce a unique "self-plagiarism" contrastive decoding strategy, aimed at boosting the originality of text produced by PLMs. Our method entails modifying prompts in LLMs to develop an amateur model and a professional model. Specifically, the amateur model is urged to plagiarize using three plagiarism templates we have designed, while the professional model maintains its standard language model status. This strategy employs prompts to stimulate the model's capacity to identify non-original candidate token combinations and subsequently impose penalties. The application of this strategy is integrated prior to the model's final layer, ensuring smooth integration with most existing PLMs (T5, GPT, LLaMA) without necessitating further adjustments. Implementing our strategy, we observe a significant decline in non-original sequences comprised of more than three words in the academic AASC dataset and the story-based ROCStories dataset. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: 16 pages, 8 figures. The paper has been accepted by ACL 2024 (Findings), with Kaixin Lan and Tao Fang contributing equally, and Derek F. Wong serving as the corresponding author

arXiv:2405.14039 [pdf, other]

Trajectory Volatility for Out-of-Distribution Detection in Mathematical Reasoning

Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, Zhuosheng Zhang, Rui Wang

Abstract: Real-world data deviating from the independent and identically distributed (i.i.d.) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effe… ▽ More Real-world data deviating from the independent and identically distributed (i.i.d.) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effective in traditional linguistic tasks like summarization and translation. However, another complex generative scenario mathematical reasoning poses significant challenges to embedding-based methods due to its high-density feature of output spaces, but this feature causes larger discrepancies in the embedding shift trajectory between different samples in latent spaces. Hence, we propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning. Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 27 pages, 6 figures, 12 tables

arXiv:2405.04286 [pdf, other]

Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore

Authors: Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xuebo Liu, Lidia S. Chao, Min Zhang

Abstract: The efficacy of an large language model (LLM) generated text detector depends substantially on the availability of sizable training data. White-box zero-shot detectors, which require no such data, are nonetheless limited by the accessibility of the source model of the LLM-generated text. In this paper, we propose an simple but effective black-box zero-shot detection approach, predicated on the obs… ▽ More The efficacy of an large language model (LLM) generated text detector depends substantially on the availability of sizable training data. White-box zero-shot detectors, which require no such data, are nonetheless limited by the accessibility of the source model of the LLM-generated text. In this paper, we propose an simple but effective black-box zero-shot detection approach, predicated on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts. This approach entails computing the Grammar Error Correction Score (GECScore) for the given text to distinguish between human-written and LLM-generated text. Extensive experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods, achieving an average AUROC of 98.7% and showing strong robustness against paraphrase and adversarial perturbation attacks. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.02925 [pdf, other]

A Two-Stage Prediction-Aware Contrastive Learning Framework for Multi-Intent NLU

Authors: Guanhua Chen, Yutong Yao, Derek F. Wong, Lidia S. Chao

Abstract: Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, whi… ▽ More Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, which is beneficial to constructing a better embedding space, especially in low-data scenarios. We introduce a two-stage Prediction-Aware Contrastive Learning (PACL) framework for multi-intent NLU to harness this valuable knowledge. Our approach capitalizes on shared intent information by integrating word-level pre-training and prediction-aware contrastive fine-tuning. We construct a pre-training dataset using a word-level data augmentation strategy. Subsequently, our framework dynamically assigns roles to instances during contrastive fine-tuning while introducing a prediction-aware contrastive loss to maximize the impact of contrastive learning. We present experimental results and empirical analysis conducted on three widely used datasets, demonstrating that our method surpasses the performance of three prominent baselines on both low-data and full-data scenarios. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: LREC-COLING 2024

arXiv:2404.18413 [pdf, other]

3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset

Authors: Xinyu Ma, Xuebo Liu, Derek F. Wong, Jun Rao, Bei Li, Liang Ding, Lidia S. Chao, Dacheng Tao, Min Zhang

Abstract: Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT researc… ▽ More Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT research. This paper presents a novel solution to this issue by introducing 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese, each with corresponding images. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. We utilize a word sense disambiguation model to select ambiguous data from vision-and-language datasets, resulting in a more challenging dataset. We further benchmark several state-of-the-art MMT models on our proposed dataset. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets. Our work provides a valuable resource for researchers in the field of multimodal learning and encourages further exploration in this area. The data, code and scripts are freely available at https://github.com/MaxyLee/3AM. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.16766 [pdf, other]

Prefix Text as a Yarn: Eliciting Non-English Alignment in Foundation Language Model

Authors: Runzhe Zhan, Xinyi Yang, Derek F. Wong, Lidia S. Chao, Yue Zhang

Abstract: While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effective… ▽ More While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across eight languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can achieve performance comparable to their SFT counterparts. This method presents a cost-effective alternative to SFT and advances the democratization of multilingual LLMs. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2403.15012 [pdf, other]

Empirical investigation of multi-source cross-validation in clinical machine learning

Authors: Tuija Leinonen, David Wong, Ali Wahab, Ramesh Nadarajah, Matti Kaisti, Antti Airola

Abstract: Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to ac… ▽ More Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 14 pages, 3 figures

arXiv:2403.12381 [pdf, other]

Explainable AutoML (xAutoML) with adaptive modeling for yield enhancement in semiconductor smart manufacturing

Authors: Weihong Zhai, Xiupeng Shi, Yiik Diew Wong, Qing Han, Lisheng Chen

Abstract: Enhancing yield is recognized as a paramount driver to reducing production costs in semiconductor smart manufacturing. However, optimizing and ensuring high yield rates is a highly complex and technical challenge, especially while maintaining reliable yield diagnosis and prognosis, and this shall require understanding all the confounding factors in a complex condition. This study proposes a domain… ▽ More Enhancing yield is recognized as a paramount driver to reducing production costs in semiconductor smart manufacturing. However, optimizing and ensuring high yield rates is a highly complex and technical challenge, especially while maintaining reliable yield diagnosis and prognosis, and this shall require understanding all the confounding factors in a complex condition. This study proposes a domain-specific explainable automated machine learning technique (termed xAutoML), which autonomously self-learns the optimal models for yield prediction, with an extent of explainability, and also provides insights on key diagnosis factors. The xAutoML incorporates tailored problem-solving functionalities in an auto-optimization pipeline to address the intricacies of semiconductor yield enhancement. Firstly, to capture the key diagnosis factors, knowledge-informed feature extraction coupled with model-agnostic key feature selection is designed. Secondly, combined algorithm selection and hyperparameter tuning with adaptive loss are developed to generate optimized classifiers for better defect prediction, and adaptively evolve in response to shifting data patterns. Moreover, a suite of explainability tools is provided throughout the AutoML pipeline, enhancing user understanding and fostering trust in the automated processes. The proposed xAutoML exhibits superior performance, with domain-specific refined countermeasures, adaptive optimization capabilities, and embedded explainability. Findings exhibit that the proposed xAutoML is a compelling solution for semiconductor yield improvement, defect diagnosis, and related applications. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11621 [pdf, other]

Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model

Authors: Haoyun Xu, Runzhe Zhan, Derek F. Wong, Lidia S. Chao

Abstract: Large Language Models (LLMs) are composed of neurons that exhibit various behaviors and roles, which become increasingly diversified as models scale. Recent studies have revealed that not all neurons are active across different datasets, and this sparsity correlates positively with the task-specific ability, leading to advancements in model pruning and training efficiency. Traditional fine-tuning… ▽ More Large Language Models (LLMs) are composed of neurons that exhibit various behaviors and roles, which become increasingly diversified as models scale. Recent studies have revealed that not all neurons are active across different datasets, and this sparsity correlates positively with the task-specific ability, leading to advancements in model pruning and training efficiency. Traditional fine-tuning methods engage all parameters of LLMs, which is computationally expensive and may not be necessary. In contrast, Parameter-Efficient Fine-Tuning (PEFT) approaches aim to minimize the number of trainable parameters, yet they still operate at a relatively macro scale (e.g., layer-level). We introduce Neuron-Level Fine-Tuning (NeFT), a novel approach that refines the granularity of parameter training down to the individual neuron, enabling more precise and computationally efficient model updates. The experimental results show that NeFT not only exceeded the performance of full-parameter fine-tuning and PEFT but also provided insights into the analysis of neurons. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.16705 [pdf, other]

SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection

Authors: Liangxin Liu, Xuebo Liu, Derek F. Wong, Dongfang Li, Ziyi Wang, Baotian Hu, Min Zhang

Abstract: Instruction tuning (IT) is crucial to tailoring large language models (LLMs) towards human-centric interactions. Recent advancements have shown that the careful selection of a small, high-quality subset of IT data can significantly enhance the performance of LLMs. Despite this, common approaches often rely on additional models or data sets, which increases costs and limits widespread adoption. In… ▽ More Instruction tuning (IT) is crucial to tailoring large language models (LLMs) towards human-centric interactions. Recent advancements have shown that the careful selection of a small, high-quality subset of IT data can significantly enhance the performance of LLMs. Despite this, common approaches often rely on additional models or data sets, which increases costs and limits widespread adoption. In this work, we propose a novel approach, termed SelectIT, that capitalizes on the foundational capabilities of the LLM itself. Specifically, we exploit the intrinsic uncertainty present in LLMs to more effectively select high-quality IT data, without the need for extra resources. Furthermore, we introduce a novel IT dataset, the Selective Alpaca, created by applying SelectIT to the Alpaca-GPT4 dataset. Empirical results demonstrate that IT using Selective Alpaca leads to substantial model ability enhancement. The robustness of SelectIT has also been corroborated in various foundation models and domain-specific tasks. Our findings suggest that longer and more computationally intensive IT data may serve as superior sources of IT, offering valuable insights for future research in this area. Data, code, and scripts are freely available at https://github.com/Blue-Raincoat/SelectIT. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.15566 [pdf]

Closing the AI generalization gap by adjusting for dermatology condition distribution differences across clinical settings

Authors: Rajeev V. Rikhye, Aaron Loh, Grace Eunhae Hong, Preeti Singh, Margaret Ann Smith, Vijaytha Muralidharan, Doris Wong, Rory Sayres, Michelle Phung, Nicolas Betancourt, Bradley Fong, Rachna Sahasrabudhe, Khoban Nasim, Alec Eschholz, Basil Mustafa, Jan Freyberg, Terry Spitz, Yossi Matias, Greg S. Corrado, Katherine Chou, Dale R. Webster, Peggy Bui, Yuan Liu, Yun Liu, Justin Ko , et al. (1 additional authors not shown)

Abstract: Recently, there has been great progress in the ability of artificial intelligence (AI) algorithms to classify dermatological conditions from clinical photographs. However, little is known about the robustness of these algorithms in real-world settings where several factors can lead to a loss of generalizability. Understanding and overcoming these limitations will permit the development of generali… ▽ More Recently, there has been great progress in the ability of artificial intelligence (AI) algorithms to classify dermatological conditions from clinical photographs. However, little is known about the robustness of these algorithms in real-world settings where several factors can lead to a loss of generalizability. Understanding and overcoming these limitations will permit the development of generalizable AI that can aid in the diagnosis of skin conditions across a variety of clinical settings. In this retrospective study, we demonstrate that differences in skin condition distribution, rather than in demographics or image capture mode are the main source of errors when an AI algorithm is evaluated on data from a previously unseen source. We demonstrate a series of steps to close this generalization gap, requiring progressively more information about the new source, ranging from the condition distribution to training data enriched for data less frequently seen during training. Our results also suggest comparable performance from end-to-end fine tuning versus fine tuning solely the classification layer on top of a frozen embedding model. Our approach can inform the adaptation of AI algorithms to new settings, based on the information and resources available. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.15293 [pdf, other]

SoK: What don't we know? Understanding Security Vulnerabilities in SNARKs

Authors: Stefanos Chaliasos, Jens Ernstberger, David Theodore, David Wong, Mohammad Jahanara, Benjamin Livshits

Abstract: Zero-knowledge proofs (ZKPs) have evolved from being a theoretical concept providing privacy and verifiability to having practical, real-world implementations, with SNARKs (Succinct Non-Interactive Argument of Knowledge) emerging as one of the most significant innovations. Prior work has mainly focused on designing more efficient SNARK systems and providing security proofs for them. Many think of… ▽ More Zero-knowledge proofs (ZKPs) have evolved from being a theoretical concept providing privacy and verifiability to having practical, real-world implementations, with SNARKs (Succinct Non-Interactive Argument of Knowledge) emerging as one of the most significant innovations. Prior work has mainly focused on designing more efficient SNARK systems and providing security proofs for them. Many think of SNARKs as "just math," implying that what is proven to be correct and secure is correct in practice. In contrast, this paper focuses on assessing end-to-end security properties of real-life SNARK implementations. We start by building foundations with a system model and by establishing threat models and defining adversarial roles for systems that use SNARKs. Our study encompasses an extensive analysis of 141 actual vulnerabilities in SNARK implementations, providing a detailed taxonomy to aid developers and security researchers in understanding the security threats in systems employing SNARKs. Finally, we evaluate existing defense mechanisms and offer recommendations for enhancing the security of SNARK-based systems, paving the way for more robust and reliable implementations in the future. △ Less

Submitted 11 July, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.07616 [pdf, other]

Anchor-based Large Language Models

Authors: Jianhui Pang, Fanghua Ye, Derek Fai Wong, Xin He, Wanshun Chen, Longyue Wang

Abstract: Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading t… ▽ More Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processing. This study introduces Anchor-based LLMs (AnLLMs), which utilize an innovative anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. This approach enables LLMs to compress sequence information into an anchor token, reducing the keys/values cache and enhancing inference efficiency. Experiments on question-answering benchmarks reveal that AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference. Despite a minor compromise in accuracy, the substantial enhancements of AnLLMs employing the AnSAN technique in resource utilization and computational efficiency underscore their potential for practical LLM applications. △ Less

Submitted 1 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: The paper has been accepted by the ACL2024 conference. Work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab

arXiv:2401.16906 [pdf]

doi 10.5121/ijcis.2023.13401

Quantum-Secure Hybrid Blockchain System for DID-based Verifiable Random Function with NTRU Linkable Ring Signature

Authors: Bong Gon Kim, Dennis Wong, Yoon Seok Yang

Abstract: In this study, we present a secure smart contract-based Verifiable Random Function (VRF) model, addressing the shortcomings of existing systems. As quantum computing emerges, conventional public key cryptography faces potential vulnerabilities. To enhance our VRF's robustness, we employ post-quantum Ring-LWE encryption for generating pseudo-random sequences. Given the computational intensity of th… ▽ More In this study, we present a secure smart contract-based Verifiable Random Function (VRF) model, addressing the shortcomings of existing systems. As quantum computing emerges, conventional public key cryptography faces potential vulnerabilities. To enhance our VRF's robustness, we employ post-quantum Ring-LWE encryption for generating pseudo-random sequences. Given the computational intensity of this approach and associated on-chain gas costs, we propose a hybrid architecture of VRF system where on-chain and off-chain can communicate in a scalable and secure way. To ensure the validity and integrity of the off-chain computations (e.g., Ring-LWE encryption), we employ a quantum-secure linkable ring signature scheme on NTRU lattice and also delegated key generation (DKG) with a secure key encapsulation mechanism (KEM). Our decentralized VRF employs multi-party computation (MPC) with blockchain-based decentralized identifiers (DID), ensuring the collective efforts of enhanced randomness and security. We show the security and privacy advantages of our proposed VRF model with the approximated estimation of overall temporal and spatial complexities. We also evaluate our VRF MPC model's entropy and outline its Solidity smart contract integration. This research also provides a method to produce and verify the VRF output's proof, optimal for scenarios necessitating randomness and validation. Lastly, using NIST SP800-22 test suite for randomness, we demonstrate the commendable result with a 97.73% overall pass rate on 11 standard tests and 0.5459 of average p-value for the total 176 tests. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 25 pages, 5 figures, 2023 International Journal on Cryptography and Information Security (IJCIS). arXiv admin note: text overlap with arXiv:2311.11734

Journal ref: Volume 13, Number 4, December 2023

arXiv:2401.16529 [pdf, other]

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks

Authors: Yidi Wang, Cong Liu, Daniel Wong, Hyoseung Kim

Abstract: Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of preemption, extended blo… ▽ More Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose two novel techniques, namely the kernel thread and IOCTL-based approaches, to enable preemptive priority-based scheduling for real-time GPU tasks. Our approaches exert control over GPU context scheduling at the device driver level and enable preemptive GPU scheduling based on task priorities. The kernel thread-based approach achieves this without requiring modifications to user-level programs, while the IOCTL-based approach needs only a single macro at the boundaries of GPU access segments. In addition, we provide a comprehensive response time analysis that takes into account overlaps between different task segments, mitigating pessimism in worst-case estimates. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and timeliness of real-time tasks. The results highlight significant improvements over prior work, with up to 40\% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.12794 [pdf, other]

Benchmarking LLMs via Uncertainty Quantification

Authors: Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, Zhaopeng Tu

Abstract: The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking… ▽ More The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves eight LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs. △ Less

Submitted 25 April, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: 25 pages, preprints

arXiv:2401.08350 [pdf, other]

Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models

Authors: Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F. Wong, Shuming Shi, Zhaopeng Tu

Abstract: The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, t… ▽ More The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings indicate that LLMs effectively lessen the reliance on parallel data for major languages in the pretraining phase. Additionally, the LLM-based translation system significantly enhances the translation of long sentences that contain approximately 80 words and shows the capability to translate documents of up to 512 words. However, despite these significant improvements, the challenges of domain mismatch and prediction of rare words persist. While the challenges of word alignment and beam search, specifically associated with NMT, may not apply to LLMs, we identify three new challenges for LLMs in translation tasks: inference efficiency, translation of low-resource languages in the pretraining phase, and human-aligned evaluation. The datasets and models are released at https://github.com/pangjh3/LLM4MT. △ Less

Submitted 17 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: 17 pages. Longyue Wang is the Corresponding Author

arXiv:2401.07882 [pdf, other]

On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement

Authors: Tsun-An Hsieh, Jacob Donley, Daniel Wong, Buye Xu, Ashutosh Pandey

Abstract: We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while… ▽ More We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while conceptually similar to the traditional frequency-domain Wiener filter, undergoes a training process optimized for low-latency speech enhancement, involving fine-tuning of both analysis and synthesis transforms. Our research results illustrate that the NWF output, having minimal nonlinear distortions, attains performance levels akin to those of the first DNN, deviating from conventional Wiener filter paradigms. Training all components jointly outperforms sequential training, despite its simplicity. Consequently, this framework achieves superior performance with fewer parameters and reduced computational demands, making it a compelling solution for resource-efficient multichannel speech enhancement. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted for publication at ICASSP

arXiv:2311.11734 [pdf]

doi 10.5121/csit.2023.132104

Private and Secure Post-Quantum Verifiable Random Function with NIZK Proof and Ring-LWE Encryption in Blockchain

Authors: Bong Gon Kim, Dennis Wong, Yoon Seok Yang

Abstract: We present a secure and private blockchain-based Verifiable Random Function (VRF) scheme addressing some limitations of classical VRF constructions. Given the imminent quantum computing adversarial scenario, conventional cryptographic methods face vulnerabilities. To enhance our VRF's secure randomness, we adopt post-quantum Ring-LWE encryption for synthesizing pseudo-random sequences. Considering… ▽ More We present a secure and private blockchain-based Verifiable Random Function (VRF) scheme addressing some limitations of classical VRF constructions. Given the imminent quantum computing adversarial scenario, conventional cryptographic methods face vulnerabilities. To enhance our VRF's secure randomness, we adopt post-quantum Ring-LWE encryption for synthesizing pseudo-random sequences. Considering computational costs and resultant on-chain gas costs, we suggest a bifurcated architecture for VRF design, optimizing interactions between on-chain and off-chain. Our approach employs a secure ring signature supported by NIZK proof and a delegated key generation method, inspired by the Chaum-Pedersen equality proof and the Fiat-Shamir Heuristic. Our VRF scheme integrates multi-party computation (MPC) with blockchain-based decentralized identifiers (DID), ensuring both security and randomness. We elucidate the security and privacy aspects of our VRF scheme, analyzing temporal and spatial complexities. We also approximate the entropy of the VRF scheme and detail its implementation in a Solidity contract. Also, we delineate a method for validating the VRF's proof, matching for the contexts requiring both randomness and verification. Conclusively, using the NIST SP800-22 of the statistical randomness test suite, our results exhibit a 98.86% pass rate over 11 test cases, with an average p-value of 0.5459 from 176 total tests. △ Less

Submitted 7 February, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: 21 pages, 5 figures, In the 2023 Proceedings of International Conference on Cryptography and Blockchain

Journal ref: Proceedings of International Conference on Cryptography and Blockchain, 13(21), 47-67 (2023)

arXiv:2311.03032 [pdf, other]

Reconfigurable, Transformable Soft Pneumatic Actuator with Tunable 3D Deformations for Dexterous Soft Robotics Applications

Authors: Dickson Chiu Yu Wong, Mingtan Li, Shijie Kang, Lifan Luo, Hongyu Yu

Abstract: Numerous soft actuators based on PneuNet design have already been proposed and extensively employed across various soft robotics applications in recent years. Despite their widespread use, a common limitation of most existing designs is that their action is pre-determined during the fabrication process, thereby restricting the ability to modify or alter their function during operation. To address… ▽ More Numerous soft actuators based on PneuNet design have already been proposed and extensively employed across various soft robotics applications in recent years. Despite their widespread use, a common limitation of most existing designs is that their action is pre-determined during the fabrication process, thereby restricting the ability to modify or alter their function during operation. To address this shortcoming, in this article the design of a Reconfigurable, Transformable Soft Pneumatic Actuator (RT-SPA) is proposed. The working principle of the RT-SPA is analogous to the conventional PneuNet. The key distinction between the two lies in the ability of the RT-SPA to undergo controlled transformations, allowing for more versatile bending and twisting motions in various directions. Furthermore, the unique reconfigurable design of the RT-SPA enables the selection of actuation units with different sizes to achieve a diverse range of three-dimensional deformations. This versatility enhances the potential of the RT-SPA for adaptation to a multitude of tasks and environments, setting it apart from traditional PneuNet. The paper begins with a detailed description of the design and fabrication of the RT-SPA. Following this, a series of experiments are conducted to evaluate the performance of the RT-SPA. Finally, the abilities of the RT-SPA for locomotion, gripping, and object manipulation are demonstrated to illustrate the versatility of the RT-SPA across different aspects. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: Submitted to Soft Robotics Journal. 12 pages, 10 figures

arXiv:2310.14724 [pdf, other]

A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions

Authors: Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F. Wong, Lidia S. Chao

Abstract: The powerful ability to understand, follow, and generate complex language emerging from large language models (LLMs) makes LLM-generated text flood many areas of our daily lives at an incredible speed and is widely accepted by humans. As LLMs continue to expand, there is an imperative need to develop detectors that can detect LLM-generated text. This is crucial to mitigate potential misuse of LLMs… ▽ More The powerful ability to understand, follow, and generate complex language emerging from large language models (LLMs) makes LLM-generated text flood many areas of our daily lives at an incredible speed and is widely accepted by humans. As LLMs continue to expand, there is an imperative need to develop detectors that can detect LLM-generated text. This is crucial to mitigate potential misuse of LLMs and safeguard realms like artistic expression and social networks from harmful influence of LLM-generated content. The LLM-generated text detection aims to discern if a piece of text was produced by an LLM, which is essentially a binary classification task. The detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, statistics-based detectors, neural-base detectors, and human-assisted methods. In this survey, we collate recent research breakthroughs in this area and underscore the pressing need to bolster detector research. We also delve into prevalent datasets, elucidating their limitations and developmental requirements. Furthermore, we analyze various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues and the lack of effective evaluation framework. Conclusively, we highlight interesting directions for future research in LLM-generated text detection to advance the implementation of responsible artificial intelligence (AI). Our aim with this survey is to provide a clear and comprehensive introduction for newcomers while also offering seasoned researchers a valuable update in the field of LLM-generated text detection. The useful resources are publicly available at: https://github.com/NLP2CT/LLM-generated-Text-Detection. △ Less

Submitted 19 April, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.08908 [pdf, other]

Human-in-the-loop Machine Translation with Large Language Model

Authors: Xinyi Yang, Runzhe Zhan, Derek F. Wong, Junchao Wu, Lidia S. Chao

Abstract: The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human interventio… ▽ More The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human intervention in the inference process of LLM. The characteristics of LLM, such as in-context learning and prompt engineering, closely mirror human cognitive abilities in language tasks, offering an intuitive solution for human-in-the-loop generation. In this study, we propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions. The pipeline initiates by prompting the LLM to produce a draft translation, followed by the utilization of automatic retrieval or human feedback as supervision signals to enhance the LLM's translation through in-context learning. The human-machine interactions generated in this pipeline are also stored in an external database to expand the in-context retrieval database, enabling us to leverage human supervision in an offline setting. We evaluate the proposed pipeline using GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance compared to direct translation. Additionally, we discuss the results from the following perspectives: 1) the effectiveness of different in-context retrieval methods; 2) the construction of a retrieval database under low-resource scenarios; 3) the observed domains differences; 4) the quantitative analysis of linguistic statistics; and 5) the qualitative analysis of translation cases. The code and data are available at https://github.com/NLP2CT/HIL-MT/. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted to MT Summit 2023

arXiv:2305.19847 [pdf, other]

How Does Pretraining Improve Discourse-Aware Translation?

Authors: Zhihong Huang, Longyue Wang, Siyou Liu, Derek F. Wong

Abstract: Pretrained language models (PLMs) have produced substantial improvements in discourse-aware neural machine translation (NMT), for example, improved coherence in spoken language translation. However, the underlying reasons for their strong performance have not been well explained. To bridge this gap, we introduce a probing task to interpret the ability of PLMs to capture discourse relation knowledg… ▽ More Pretrained language models (PLMs) have produced substantial improvements in discourse-aware neural machine translation (NMT), for example, improved coherence in spoken language translation. However, the underlying reasons for their strong performance have not been well explained. To bridge this gap, we introduce a probing task to interpret the ability of PLMs to capture discourse relation knowledge. We validate three state-of-the-art PLMs across encoder-, decoder-, and encoder-decoder-based models. The analysis shows that (1) the ability of PLMs on discourse modelling varies from architecture and layer; (2) discourse elements in a text lead to different learning difficulties for PLMs. Besides, we investigate the effects of different PLMs on spoken language translation. Through experiments on IWSLT2017 Chinese-English dataset, we empirically reveal that NMT models initialized from different layers of PLMs exhibit the same trends with the probing task. Our findings are instructive to understand how and when discourse knowledge in PLMs should work for downstream tasks. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: Interspeech 2023

arXiv:2305.01951 [pdf, other]

Can LMs Generalize to Future Data? An Empirical Analysis on Text Summarization

Authors: Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li, Yanming Sun, Shudong Liu, Lidia S. Chao

Abstract: Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets. However, existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets. Hence, the strong performance of PLMs may rely on the parametric knowledge that is memorized during pre-training and fine-tuning. Moreover, the knowledge memoriz… ▽ More Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets. However, existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets. Hence, the strong performance of PLMs may rely on the parametric knowledge that is memorized during pre-training and fine-tuning. Moreover, the knowledge memorized by PLMs may quickly become outdated, which affects the generalization performance of PLMs on future data. In this work, we propose TempoSum, a novel benchmark that contains data samples from 2010 to 2022, to understand the temporal generalization ability of abstractive summarization models. Through extensive human evaluation, we show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data. Moreover, existing faithfulness enhancement methods cannot reliably improve the faithfulness of summarization models on future data. Finally, we discuss several recommendations to the research community on how to evaluate and improve the temporal generalization capability of text summarization models. △ Less

Submitted 2 November, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: Accepted at EMNLP 2023

arXiv:2305.01181 [pdf, other]

A Paradigm Shift: The Future of Machine Translation Lies with Large Language Models

Authors: Chenyang Lyu, Zefeng Du, Jitao Xu, Yitao Duan, Minghao Wu, Teresa Lynn, Alham Fikri Aji, Derek F. Wong, Siyou Liu, Longyue Wang

Abstract: Machine Translation (MT) has greatly advanced over the years due to the developments in deep neural networks. However, the emergence of Large Language Models (LLMs) like GPT-4 and ChatGPT is introducing a new phase in the MT domain. In this context, we believe that the future of MT is intricately tied to the capabilities of LLMs. These models not only offer vast linguistic understandings but also… ▽ More Machine Translation (MT) has greatly advanced over the years due to the developments in deep neural networks. However, the emergence of Large Language Models (LLMs) like GPT-4 and ChatGPT is introducing a new phase in the MT domain. In this context, we believe that the future of MT is intricately tied to the capabilities of LLMs. These models not only offer vast linguistic understandings but also bring innovative methodologies, such as prompt-based techniques, that have the potential to further elevate MT. In this paper, we provide an overview of the significant enhancements in MT that are influenced by LLMs and advocate for their pivotal role in upcoming MT research and implementations. We highlight several new MT directions, emphasizing the benefits of LLMs in scenarios such as Long-Document Translation, Stylized Translation, and Interactive Translation. Additionally, we address the important concern of privacy in LLM-driven MT and suggest essential privacy-preserving strategies. By showcasing practical instances, we aim to demonstrate the advantages that LLMs offer, particularly in tasks like translating extended documents. We conclude by emphasizing the critical role of LLMs in guiding the future evolution of MT and offer a roadmap for future exploration in the sector. △ Less

Submitted 1 April, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

Comments: Accepted to LREC-COLING 2024

arXiv:2304.14937 [pdf, other]

Contactless hand tremor amplitude measurement using smartphones: development and pilot evaluation

Authors: James Bungay, Osasenaga Emokpae, Samuel D. Relton, Jane Alty, Stefan Williams, Hui Fang, David C. Wong

Abstract: Background: Physiological tremor is defined as an involuntary and rhythmic shaking. Tremor of the hand is a key symptom of multiple neurological diseases, and its frequency and amplitude differs according to both disease type and disease progression. In routine clinical practice, tremor frequency and amplitude are assessed by expert rating using a 0 to 4 integer scale. Such ratings are subjective… ▽ More Background: Physiological tremor is defined as an involuntary and rhythmic shaking. Tremor of the hand is a key symptom of multiple neurological diseases, and its frequency and amplitude differs according to both disease type and disease progression. In routine clinical practice, tremor frequency and amplitude are assessed by expert rating using a 0 to 4 integer scale. Such ratings are subjective and have poor inter-rater reliability. There is thus a clinical need for a practical and accurate method for objectively assessing hand tremor. Objective: to develop a proof of principle method to measure hand tremor amplitude from smartphone videos. Methods: We created a computer vision pipeline that automatically extracts salient points on the hand and produces a 1-D time series of movement due to tremor, in pixels. Using the smartphones' depth measurement, we convert this measure into real distance units. We assessed the accuracy of the method using 60 videos of simulated tremor of different amplitudes from two healthy adults. Videos were taken at distances of 50, 75 and 100 cm between hand and camera. The participants had skin tone II and VI on the Fitzpatrick scale. We compared our method to a gold-standard measurement from a slide rule. Bland-Altman methods agreement analysis indicated a bias of 0.04 cm and 95% limits of agreement from -1.27 to 1.20 cm. Furthermore, we qualitatively observed that the method was robust to differences in skin tone and limited occlusion, such as a band-aid affixed to the participant's hand. Clinical relevance: We have demonstrated how tremor amplitude can be measured from smartphone videos. In conjunction with tremor frequency, this approach could be used to help diagnose and monitor neurological diseases △ Less

Submitted 28 April, 2023; originally announced April 2023.

Comments: Accepted to IEEE EMBC 2023, Sydney (pre-refereed version)

arXiv:2304.07943 [pdf]

Detecting Domain-Generation Algorithm (DGA) Based Fully-Qualified Domain Names (FQDNs) with Shannon Entropy

Authors: Adam Dorian Wong

Abstract: Domain Name System (DNS) is the backbone of the Internet. However, threat actors have abused the antiquated protocol to facilitate command-and-control (C2) actions, to tunnel, or to exfiltrate sensitive information in novel ways. The FireEye breach and Solarwinds intrusions of late 2020 demonstrated the sophistication of hacker groups. Researchers were eager to reverse-engineer the malware and eag… ▽ More Domain Name System (DNS) is the backbone of the Internet. However, threat actors have abused the antiquated protocol to facilitate command-and-control (C2) actions, to tunnel, or to exfiltrate sensitive information in novel ways. The FireEye breach and Solarwinds intrusions of late 2020 demonstrated the sophistication of hacker groups. Researchers were eager to reverse-engineer the malware and eager to decode the encrypted traffic. Noticeably, organizations were keen on being first to "solve the puzzle". Dr. Eric Cole of SANS Institute routinely expressed "prevention is ideal, but detection is a must". Detection analytics may not always provide the underlying context in encrypted traffic, but will at least give a fighting chance for defenders to detect the anomaly. SUNBURST is an open-source moniker for the backdoor that affected Solarwinds Orion. While analyzing the malware with security vendor research, there is a possible single-point-of-failure in the C2 phase of the Cyber Kill Chain provides an avenue for defenders to exploit and detect the activity itself. One small chance is better than none. The assumption is that encryption increases entropy in strings. SUNBURST relied on encryption to exfiltrate data through DNS queries of which the adversary prepended to registered Fully-Qualified Domain Names (FQDNs). These FQDNs were typo-squatted to mimic Amazon Web Services (AWS) domains. SUNBURST detection is possible through a simple 1-variable t-test across all DNS logs for a given day. The detection code is located on GitHub (https://github.com/MalwareMorghulis/SUNBURST). △ Less

Submitted 16 April, 2023; originally announced April 2023.

arXiv:2304.01746 [pdf, other]

Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation

Authors: Tao Fang, Shu Yang, Kaixin Lan, Derek F. Wong, Jinpeng Hu, Lidia S. Chao, Yue Zhang

Abstract: ChatGPT, a large-scale language model based on the advanced GPT-3.5 architecture, has shown remarkable potential in various Natural Language Processing (NLP) tasks. However, there is currently a dearth of comprehensive study exploring its potential in the area of Grammatical Error Correction (GEC). To showcase its capabilities in GEC, we design zero-shot chain-of-thought (CoT) and few-shot CoT set… ▽ More ChatGPT, a large-scale language model based on the advanced GPT-3.5 architecture, has shown remarkable potential in various Natural Language Processing (NLP) tasks. However, there is currently a dearth of comprehensive study exploring its potential in the area of Grammatical Error Correction (GEC). To showcase its capabilities in GEC, we design zero-shot chain-of-thought (CoT) and few-shot CoT settings using in-context learning for ChatGPT. Our evaluation involves assessing ChatGPT's performance on five official test sets in three different languages, along with three document-level GEC test sets in English. Our experimental results and human evaluations demonstrate that ChatGPT has excellent error detection capabilities and can freely correct errors to make the corrected sentences very fluent, possibly due to its over-correction tendencies and not adhering to the principle of minimal edits. Additionally, its performance in non-English and low-resource settings highlights its potential in multilingual GEC tasks. However, further analysis of various types of errors at the document-level has shown that ChatGPT cannot effectively correct agreement, coreference, tense errors across sentences, and cross-sentence boundary errors. △ Less

Submitted 4 April, 2023; originally announced April 2023.

arXiv:2303.12723 [pdf, other]

AdaOPC: A Self-Adaptive Mask Optimization Framework For Real Design Patterns

Authors: Wenqian Zhao, Xufeng Yao, Ziyang Yu, Guojin Chen, Yuzhe Ma, Bei Yu, Martin D. F. Wong

Abstract: Optical proximity correction (OPC) is a widely-used resolution enhancement technique (RET) for printability optimization. Recently, rigorous numerical optimization and fast machine learning are the research focus of OPC in both academia and industry, each of which complements the other in terms of robustness or efficiency. We inspect the pattern distribution on a design layer and find that differe… ▽ More Optical proximity correction (OPC) is a widely-used resolution enhancement technique (RET) for printability optimization. Recently, rigorous numerical optimization and fast machine learning are the research focus of OPC in both academia and industry, each of which complements the other in terms of robustness or efficiency. We inspect the pattern distribution on a design layer and find that different sub-regions have different pattern complexity. Besides, we also find that many patterns repetitively appear in the design layout, and these patterns may possibly share optimized masks. We exploit these properties and propose a self-adaptive OPC framework to improve efficiency. Firstly we choose different OPC solvers adaptively for patterns of different complexity from an extensible solver pool to reach a speed/accuracy co-optimization. Apart from that, we prove the feasibility of reusing optimized masks for repeated patterns and hence, build a graph-based dynamic pattern library reusing stored masks to further speed up the OPC flow. Experimental results show that our framework achieves substantial improvement in both performance and efficiency. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2303.08999 [pdf, other]

doi 10.1109/ICCAD51958.2021.9643472

A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU

Authors: Wenqian Zhao, Qi Sun, Yang Bai, Wenbo Li, Haisheng Zheng, Bei Yu, Martin D. F. Wong

Abstract: Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and acc… ▽ More Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly, and can achieve real-time performance. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2303.08435 [pdf, other]

Physics-Informed Optical Kernel Regression Using Complex-valued Neural Fields

Authors: Guojin Chen, Zehua Pei, Haoyu Yang, Yuzhe Ma, Bei Yu, Martin D. F. Wong

Abstract: Lithography is fundamental to integrated circuit fabrication, necessitating large computation overhead. The advancement of machine learning (ML)-based lithography models alleviates the trade-offs between manufacturing process expense and capability. However, all previous methods regard the lithography system as an image-to-image black box mapping, utilizing network parameters to learn by rote mapp… ▽ More Lithography is fundamental to integrated circuit fabrication, necessitating large computation overhead. The advancement of machine learning (ML)-based lithography models alleviates the trade-offs between manufacturing process expense and capability. However, all previous methods regard the lithography system as an image-to-image black box mapping, utilizing network parameters to learn by rote mappings from massive mask-to-aerial or mask-to-resist image pairs, resulting in poor generalization capability. In this paper, we propose a new ML-based paradigm disassembling the rigorous lithographic model into non-parametric mask operations and learned optical kernels containing determinant source, pupil, and lithography information. By optimizing complex-valued neural fields to perform optical kernel regression from coordinates, our method can accurately restore lithography system using a small-scale training dataset with fewer parameters, demonstrating superior generalization capability as well. Experiments show that our framework can use 31% of parameters while achieving 69$\times$ smaller mean squared error with 1.3$\times$ higher throughput than the state-of-the-art. △ Less

Submitted 9 April, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted by DAC23

arXiv:2302.08975 [pdf, other]

Towards Fine-Grained Information: Identifying the Type and Location of Translation Errors

Authors: Keqin Bao, Yu Wan, Dayiheng Liu, Baosong Yang, Wenqiang Lei, Xiangnan He, Derek F. Wong, Jun Xie

Abstract: Fine-grained information on translation errors is helpful for the translation evaluation community. Existing approaches can not synchronously consider error position and type, failing to integrate the error information of both. In this paper, we propose Fine-Grained Translation Error Detection (FG-TED) task, aiming at identifying both the position and the type of translation errors on given source… ▽ More Fine-grained information on translation errors is helpful for the translation evaluation community. Existing approaches can not synchronously consider error position and type, failing to integrate the error information of both. In this paper, we propose Fine-Grained Translation Error Detection (FG-TED) task, aiming at identifying both the position and the type of translation errors on given source-hypothesis sentence pairs. Besides, we build an FG-TED model to predict the \textbf{addition} and \textbf{omission} errors -- two typical translation accuracy errors. First, we use a word-level classification paradigm to form our model and use the shortcut learning reduction to relieve the influence of monolingual features. Besides, we construct synthetic datasets for model training, and relieve the disagreement of data labeling in authoritative datasets, making the experimental benchmark concordant. Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results on the restored dataset. Our model also delivers more reliable predictions on low-resource and transfer scenarios than existing baselines. The related datasets and the source code will be released in the future. △ Less

Submitted 17 February, 2023; originally announced February 2023.

arXiv:2301.04320 [pdf, other]

Rethinking complex-valued deep neural networks for monaural speech enhancement

Authors: Haibin Wu, Ke Tan, Buye Xu, Anurag Kumar, Daniel Wong

Abstract: Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investi… ▽ More Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2212.10179 [pdf, other]

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Authors: Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Derek F. Wong, Dacheng Tao

Abstract: The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imp… ▽ More The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and outperform existing top-scoring metrics in 20 out of 25 test settings. We hope our technique can also be extended to other pre-trained model-based metrics. We will release our code and scripts to facilitate the community. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: work in progress

arXiv:2212.04262 [pdf, other]

ConsistTL: Modeling Consistency in Transfer Learning for Low-Resource Neural Machine Translation

Authors: Zhaocong Li, Xuebo Liu, Derek F. Wong, Lidia S. Chao, Min Zhang

Abstract: Transfer learning is a simple and powerful method that can be used to boost model performance of low-resource neural machine translation (NMT). Existing transfer learning methods for NMT are static, which simply transfer knowledge from a parent model to a child model once via parameter initialization. In this paper, we propose a novel transfer learning method for NMT, namely ConsistTL, which can c… ▽ More Transfer learning is a simple and powerful method that can be used to boost model performance of low-resource neural machine translation (NMT). Existing transfer learning methods for NMT are static, which simply transfer knowledge from a parent model to a child model once via parameter initialization. In this paper, we propose a novel transfer learning method for NMT, namely ConsistTL, which can continuously transfer knowledge from the parent model during the training of the child model. Specifically, for each training instance of the child model, ConsistTL constructs the semantically-equivalent instance for the parent model and encourages prediction consistency between the parent and child for this instance, which is equivalent to the child model learning each instance under the guidance of the parent model. Experimental results on five low-resource NMT tasks demonstrate that ConsistTL results in significant improvements over strong transfer learning baselines, with a gain up to 1.7 BLEU over the existing back-translation model on the widely-used WMT17 Turkish-English benchmark. Further analysis reveals that ConsistTL can improve the inference calibration of the child model. Code and scripts are freely available at https://github.com/NLP2CT/ConsistTL. △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: Accepted to EMNLP 2022

arXiv:2211.08624 [pdf, ps, other]

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Mapping for Single-channel Speech Enhancement

Authors: Kuan-Lin Chen, Daniel D. E. Wong, Ke Tan, Buye Xu, Anurag Kumar, Vamsi Krishna Ithapu

Abstract: Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a tempora… ▽ More Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR). △ Less

Submitted 8 March, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: 5 pages. Accepted at ICASSP 2023

arXiv:2210.10049 [pdf, other]

Alibaba-Translate China's Submission for WMT 2022 Quality Estimation Shared Task

Authors: Keqin Bao, Yu Wan, Dayiheng Liu, Baosong Yang, Wenqiang Lei, Xiangnan He, Derek F. Wong, Jun Xie

Abstract: In this paper, we present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE (Unified Translation Evaluation). Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. First, we apply the pseudo-labeled data examples for the continuously pre-training phase.… ▽ More In this paper, we present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE (Unified Translation Evaluation). Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. First, we apply the pseudo-labeled data examples for the continuously pre-training phase. Notably, to reduce the gap between pre-training and fine-tuning, we use data pruning and a ranking-based score normalization strategy. For the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions. Finally, we collect the source-only evaluation results, and ensemble the predictions generated by two UniTE models, whose backbones are XLM-R and InfoXLM, respectively. Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings, showing relatively strong performances in this year's quality estimation competition. △ Less

Submitted 17 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: WMT 2022 QE Shared Task. arXiv admin note: text overlap with arXiv:2210.09683

arXiv:2210.09683 [pdf, other]

Alibaba-Translate China's Submission for WMT 2022 Metrics Shared Task

Authors: Yu Wan, Keqin Bao, Dayiheng Liu, Baosong Yang, Derek F. Wong, Lidia S. Chao, Wenqiang Lei, Jun Xie

Abstract: In this report, we present our submission to the WMT 2022 Metrics Shared Task. We build our system based on the core idea of UNITE (Unified Translation Evaluation), which unifies source-only, reference-only, and source-reference-combined evaluation scenarios into one single model. Specifically, during the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre… ▽ More In this report, we present our submission to the WMT 2022 Metrics Shared Task. We build our system based on the core idea of UNITE (Unified Translation Evaluation), which unifies source-only, reference-only, and source-reference-combined evaluation scenarios into one single model. Specifically, during the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE. Notably, to reduce the gap between pre-training and fine-tuning, we use data cropping and a ranking-based score normalization strategy. During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions. Specially, we collect the results from models with different pre-trained language model backbones, and use different ensembling strategies for involved translation directions. △ Less

Submitted 17 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: WMT 2022 Metrics Shared Task

arXiv:2210.06587 [pdf]

doi 10.48550/arXiv.2210.06587

BLADERUNNER: Rapid Countermeasure for Synthetic (AI-Generated) StyleGAN Faces

Authors: Adam Dorian Wong

Abstract: StyleGAN is the open-sourced TensorFlow implementation made by NVIDIA. It has revolutionized high quality facial image generation. However, this democratization of Artificial Intelligence / Machine Learning (AI/ML) algorithms has enabled hostile threat actors to establish cyber personas or sock-puppet accounts in social media platforms. These ultra-realistic synthetic faces. This report surveys th… ▽ More StyleGAN is the open-sourced TensorFlow implementation made by NVIDIA. It has revolutionized high quality facial image generation. However, this democratization of Artificial Intelligence / Machine Learning (AI/ML) algorithms has enabled hostile threat actors to establish cyber personas or sock-puppet accounts in social media platforms. These ultra-realistic synthetic faces. This report surveys the relevance of AI/ML with respect to Cyber & Information Operations. The proliferation of AI/ML algorithms has led to a rise in DeepFakes and inauthentic social media accounts. Threats are analyzed within the Strategic and Operational Environments. Existing methods of identifying synthetic faces exists, but they rely on human beings to visually scrutinize each photo for inconsistencies. However, through use of the DLIB 68-landmark pre-trained file, it is possible to analyze and detect synthetic faces by exploiting repetitive behaviors in StyleGAN images. Project Blade Runner encompasses two scripts necessary to counter StyleGAN images. Through PapersPlease acting as the analyzer, it is possible to derive indicators-of-attack (IOA) from scraped image samples. These IOAs can be fed back into AmongUs acting as the detector to identify synthetic faces from live operational samples. The opensource copy of Blade Runner may lack additional unit tests and some functionality, but the open-source copy is a redacted version, far leaner, better optimized, and a proof-of-concept for the information security community. The desired end-state will be to incrementally add automation to stay on-par with its closed-source predecessor. △ Less

Submitted 28 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: 29 pages

arXiv:2209.01766 [pdf, ps, other]

Exploring the Verifiability of Code Generated by GitHub Copilot

Authors: Dakota Wong, Austin Kothig, Patrick Lam

Abstract: GitHub's Copilot generates code quickly. We investigate whether it generates good code. Our approach is to identify a set of problems, ask Copilot to generate solutions, and attempt to formally verify these solutions with Dafny. Our formal verification is with respect to hand-crafted specifications. We have carried out this process on 6 problems and succeeded in formally verifying 4 of the created… ▽ More GitHub's Copilot generates code quickly. We investigate whether it generates good code. Our approach is to identify a set of problems, ask Copilot to generate solutions, and attempt to formally verify these solutions with Dafny. Our formal verification is with respect to hand-crafted specifications. We have carried out this process on 6 problems and succeeded in formally verifying 4 of the created solutions. We found evidence which corroborates the current consensus in the literature: Copilot is a powerful tool; however, it should not be "flying the plane" by itself. △ Less

Submitted 27 October, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

Comments: HATRA workshop at SPLASH 2022

arXiv:2207.07572 [pdf, other]

Outlier detection of vital sign trajectories from COVID-19 patients

Authors: Sara Summerton, Ann Tivey, Rohan Shotton, Gavin Brown, Oliver C. Redfern, Rachel Oakley, John Radford, David C. Wong

Abstract: In this work, we present a novel trajectory comparison algorithm to identify abnormal vital sign trends, with the aim of improving recognition of deteriorating health. There is growing interest in continuous wearable vital sign sensors for monitoring patients remotely at home. These monitors are usually coupled to an alerting system, which is triggered when vital sign measurements fall outside a… ▽ More In this work, we present a novel trajectory comparison algorithm to identify abnormal vital sign trends, with the aim of improving recognition of deteriorating health. There is growing interest in continuous wearable vital sign sensors for monitoring patients remotely at home. These monitors are usually coupled to an alerting system, which is triggered when vital sign measurements fall outside a predefined normal range. Trends in vital signs, such as increasing heart rate, are often indicative of deteriorating health, but are rarely incorporated into alerting systems. We introduce a dynamic time warp distance-based measure to compare time series trajectories. We split each multi-variable sign time series into 180 minute, non-overlapping epochs. We then calculate the distance between all pairs of epochs. Each epoch is characterized by its mean pairwise distance (average link distance) to all other epochs, with clusters forming with nearby epochs. We demonstrate in synthetically generated data that this method can identify abnormal epochs and cluster epochs with similar trajectories. We then apply this method to a real-world data set of vital signs from 8 patients who had recently been discharged from hospital after contracting COVID-19. We show how outlier epochs correspond well with the abnormal vital signs and identify patients who were subsequently readmitted to hospital. △ Less

Submitted 20 April, 2023; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: 4 pages, 4 figures, 1 table. Accepted to EMBC 2023, to be indexed in IEEE Xplore and PubMed Medline

arXiv:2207.07195 [pdf]

doi 10.1016/j.trc.2022.103933

COOR-PLT: A hierarchical control model for coordinating adaptive platoons of connected and autonomous vehicles at signal-free intersections based on deep reinforcement learning

Authors: Duowei Li, Jianping Wu, Feng Zhu, Tianyi Chen, Yiik Diew Wong

Abstract: Platooning and coordination are two implementation strategies that are frequently proposed for traffic control of connected and autonomous vehicles (CAVs) at signal-free intersections instead of using conventional traffic signals. However, few studies have attempted to integrate both strategies to better facilitate the CAV control at signal-free intersections. To this end, this study proposes a hi… ▽ More Platooning and coordination are two implementation strategies that are frequently proposed for traffic control of connected and autonomous vehicles (CAVs) at signal-free intersections instead of using conventional traffic signals. However, few studies have attempted to integrate both strategies to better facilitate the CAV control at signal-free intersections. To this end, this study proposes a hierarchical control model, named COOR-PLT, to coordinate adaptive CAV platoons at a signal-free intersection based on deep reinforcement learning (DRL). COOR-PLT has a two-layer framework. The first layer uses a centralized control strategy to form adaptive platoons. The optimal size of each platoon is determined by considering multiple objectives (i.e., efficiency, fairness and energy saving). The second layer employs a decentralized control strategy to coordinate multiple platoons passing through the intersection. Each platoon is labeled with coordinated status or independent status, upon which its passing priority is determined. As an efficient DRL algorithm, Deep Q-network (DQN) is adopted to determine platoon sizes and passing priorities respectively in the two layers. The model is validated and examined on the simulator Simulation of Urban Mobility (SUMO). The simulation results demonstrate that the model is able to: (1) achieve satisfactory convergence performances; (2) adaptively determine platoon size in response to varying traffic conditions; and (3) completely avoid deadlocks at the intersection. By comparison with other control methods, the model manifests its superiority of adopting adaptive platooning and DRL-based coordination strategies. Also, the model outperforms several state-of-the-art methods on reducing travel time and fuel consumption in different traffic conditions. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: This paper has been submitted to Transportation Research Part C: Emerging Technologies and is currently under review

Journal ref: Transportation Research Part C: Emerging Technologies 146 (2023): 103933

arXiv:2207.03522 [pdf, other]

TF-GNN: Graph Neural Networks in TensorFlow

Authors: Oleksandr Ferludin, Arno Eigenwillig, Martin Blais, Dustin Zelle, Jan Pfeifer, Alvaro Sanchez-Gonzalez, Wai Lok Sibon Li, Sami Abu-El-Haija, Peter Battaglia, Neslihan Bulut, Jonathan Halcrow, Filipe Miguel Gonçalves de Almeida, Pedro Gonnet, Liangze Jiang, Parth Kothari, Silvio Lattanzi, André Linhares, Brandon Mayer, Vahab Mirrokni, John Palowitch, Mihir Paradkar, Jennifer She, Anton Tsitsulin, Kevin Villela, Lisa Wang , et al. (2 additional authors not shown)

Abstract: TensorFlow-GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. In addition to enabling machine learning researchers and advanced developers, TF-GNN offers low-code solutions to empower the broader developer community in graph learning. Many… ▽ More TensorFlow-GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. In addition to enabling machine learning researchers and advanced developers, TF-GNN offers low-code solutions to empower the broader developer community in graph learning. Many production models at Google use TF-GNN, and it has been recently released as an open source project. In this paper we describe the TF-GNN data model, its Keras message passing API, and relevant capabilities such as graph sampling and distributed training. △ Less

Submitted 23 July, 2023; v1 submitted 7 July, 2022; originally announced July 2022.

Showing 1–50 of 104 results for author: Wong, D