subscribe to arXiv mailings

Unlocking Continual Learning Abilities in Language Models

Authors: Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka Chun Cheung, Reynold Cheng, Jie Fu

Abstract: Language models (LMs) exhibit impressive performance and generalization capabilities. However, LMs struggle with the persistent challenge of catastrophic forgetting, which undermines their long-term sustainability in continual learning (CL). Existing approaches usually address the issue by incorporating old task data or task-wise inductive bias into LMs. However, old data and accurate task informa… ▽ More Language models (LMs) exhibit impressive performance and generalization capabilities. However, LMs struggle with the persistent challenge of catastrophic forgetting, which undermines their long-term sustainability in continual learning (CL). Existing approaches usually address the issue by incorporating old task data or task-wise inductive bias into LMs. However, old data and accurate task information are often unavailable or costly to collect, hindering the availability of current CL approaches for LMs. To address this limitation, we introduce $\textbf{MIGU}$ ($\textbf{M}$agn$\textbf{I}$tude-based $\textbf{G}$radient $\textbf{U}$pdating for continual learning), a rehearsal-free and task-label-free method that only updates the model parameters with large magnitudes of output in LMs' linear layers. MIGU is based on our observation that the L1-normalized magnitude distribution of the output in LMs' linear layers is different when the LM models deal with different task data. By imposing this simple constraint on the gradient update process, we can leverage the inherent behaviors of LMs, thereby unlocking their innate CL abilities. Our experiments demonstrate that MIGU is universally applicable to all three LM architectures (T5, RoBERTa, and Llama2), delivering state-of-the-art or on-par performance across continual finetuning and continual pre-training settings on four CL benchmarks. For example, MIGU brings a 15.2% average accuracy improvement over conventional parameter-efficient finetuning baselines in a 15-task CL benchmark. MIGU can also seamlessly integrate with all three existing CL types to further enhance performance. Code is available at \href{https://github.com/wenyudu/MIGU}{this https URL}. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: preprint, 19 pages

arXiv:2406.10802 [pdf, other]

KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs

Authors: Aihua Pei, Zehua Yang, Shunan Zhu, Ruoxi Cheng, Ju Jia, Lina Wang

Abstract: Existing frameworks for assessing robustness of large language models (LLMs) overly depend on specific benchmarks, increasing costs and failing to evaluate performance of LLMs in professional domains due to dataset limitations. This paper proposes a framework that systematically evaluates the robustness of LLMs under adversarial attack scenarios by leveraging knowledge graphs (KGs). Our framework… ▽ More Existing frameworks for assessing robustness of large language models (LLMs) overly depend on specific benchmarks, increasing costs and failing to evaluate performance of LLMs in professional domains due to dataset limitations. This paper proposes a framework that systematically evaluates the robustness of LLMs under adversarial attack scenarios by leveraging knowledge graphs (KGs). Our framework generates original prompts from the triplets of knowledge graphs and creates adversarial prompts by poisoning, assessing the robustness of LLMs through the results of these adversarial attacks. We systematically evaluate the effectiveness of this framework and its modules. Experiments show that adversarial robustness of the ChatGPT family ranks as GPT-4-turbo > GPT-4o > GPT-3.5-turbo, and the robustness of large language models is influenced by the professional domains in which they operate. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.07365 [pdf, other]

BvSP: Broad-view Soft Prompting for Few-Shot Aspect Sentiment Quad Prediction

Authors: Yinhao Bai, Yalan Xie, Xiaoyi Liu, Yuhua Zhao, Zhixin Han, Mengting Hu, Hang Gao, Renhong Cheng

Abstract: Aspect sentiment quad prediction (ASQP) aims to predict four aspect-based elements, including aspect term, opinion term, aspect category, and sentiment polarity. In practice, unseen aspects, due to distinct data distribution, impose many challenges for a trained neural model. Motivated by this, this work formulates ASQP into the few-shot scenario, which aims for fast adaptation in real application… ▽ More Aspect sentiment quad prediction (ASQP) aims to predict four aspect-based elements, including aspect term, opinion term, aspect category, and sentiment polarity. In practice, unseen aspects, due to distinct data distribution, impose many challenges for a trained neural model. Motivated by this, this work formulates ASQP into the few-shot scenario, which aims for fast adaptation in real applications. Therefore, we first construct a few-shot ASQP dataset (FSQP) that contains richer categories and is more balanced for the few-shot study. Moreover, recent methods extract quads through a generation paradigm, which involves converting the input sentence into a templated target sequence. However, they primarily focus on the utilization of a single template or the consideration of different template orders, thereby overlooking the correlations among various templates. To tackle this issue, we further propose a Broadview Soft Prompting (BvSP) method that aggregates multiple templates with a broader view by taking into account the correlation between the different templates. Specifically, BvSP uses the pre-trained language model to select the most relevant k templates with Jensen-Shannon divergence. BvSP further introduces soft prompts to guide the pre-trained language model using the selected templates. Then, we aggregate the results of multi-templates by voting mechanism. Empirical results demonstrate that BvSP significantly outperforms the stateof-the-art methods under four few-shot settings and other public datasets. Our code and dataset are available at https://github.com/byinhao/BvSP. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 Main Conference

arXiv:2406.06626 [pdf, other]

Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications

Authors: Zhou Zhou, Guohang He, Zheng Zhang, Luziwei Leng, Qinghai Guo, Jianxing Liao, Xuan Song, Ran Cheng

Abstract: Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks… ▽ More Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.15319 [pdf, other]

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Authors: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu

Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehen… ▽ More LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at $\href{https://llm-stacking.github.io/}{https://llm-stacking.github.io/}$. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Preprint; The project link: $\href{https://llm-stacking.github.io/}{https://llm-stacking.github.io/}$

arXiv:2405.15307 [pdf, other]

Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation

Authors: Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, Reynold Cheng

Abstract: Large Language Models (LLMs) driven by In-Context Learning (ICL) have significantly improved the performance of text-to-SQL. Previous methods generally employ a two-stage reasoning framework, namely 1) schema linking and 2) logical synthesis, making the framework not only effective but also interpretable. Despite these advancements, the inherent bad nature of the generalization of LLMs often resul… ▽ More Large Language Models (LLMs) driven by In-Context Learning (ICL) have significantly improved the performance of text-to-SQL. Previous methods generally employ a two-stage reasoning framework, namely 1) schema linking and 2) logical synthesis, making the framework not only effective but also interpretable. Despite these advancements, the inherent bad nature of the generalization of LLMs often results in hallucinations, which limits the full potential of LLMs. In this work, we first identify and categorize the common types of hallucinations at each stage in text-to-SQL. We then introduce a novel strategy, Task Alignment (TA), designed to mitigate hallucinations at each stage. TA encourages LLMs to take advantage of experiences from similar tasks rather than starting the tasks from scratch. This can help LLMs reduce the burden of generalization, thereby mitigating hallucinations effectively. We further propose TA-SQL, a text-to-SQL framework based on this strategy. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Specifically, it enhances the performance of the GPT-4 baseline by 21.23% relatively on BIRD dev and it yields significant improvements across six models and four mainstream, complex text-to-SQL benchmarks. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Accepted to ACL Findings 2024

arXiv:2405.14517 [pdf, other]

Identity Inference from CLIP Models using Only Textual Data

Authors: Songze Li, Ruoxi Cheng, Xiaojun Jia

Abstract: The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of personally identifiable information (PII). Existing methods for identity inference in CLIP models, i.e., to detect the presence of a person's PII used for training a CLIP model, require querying the model with full PII, including textual descriptions of the person and corresponding images (… ▽ More The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of personally identifiable information (PII). Existing methods for identity inference in CLIP models, i.e., to detect the presence of a person's PII used for training a CLIP model, require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, this may lead to potential privacy breach of the image, as it may have not been seen by the target model yet. Additionally, traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that 1) queries the target model with only text data; and 2) does not require training shadow models. Firstly, we develop a feature extraction algorithm, guided by the CLIP model, to extract features from a text description. TUNI starts with randomly generating textual gibberish that were clearly not utilized for training, and leverages their feature vectors to train a system of anomaly detectors. During inference, the feature vector of each test text is fed into the anomaly detectors to determine if the person's PII is in the training set (abnormal) or not (normal). Moreover, TUNI can be further strengthened integrating real images associated with the tested individuals, if available at the detector. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.12183 [pdf, other]

Multi-order Graph Clustering with Adaptive Node-level Weight Learning

Authors: Ye Liu, Xuelei Lin, Yejia Chen, Reynold Cheng

Abstract: Current graph clustering methods emphasize individual node and edge con nections, while ignoring higher-order organization at the level of motif. Re cently, higher-order graph clustering approaches have been designed by motif based hypergraphs. However, these approaches often suffer from hypergraph fragmentation issue seriously, which degrades the clustering performance greatly. Moreover, real-wor… ▽ More Current graph clustering methods emphasize individual node and edge con nections, while ignoring higher-order organization at the level of motif. Re cently, higher-order graph clustering approaches have been designed by motif based hypergraphs. However, these approaches often suffer from hypergraph fragmentation issue seriously, which degrades the clustering performance greatly. Moreover, real-world graphs usually contain diverse motifs, with nodes participating in multiple motifs. A key challenge is how to achieve precise clustering results by integrating information from multiple motifs at the node level. In this paper, we propose a multi-order graph clustering model (MOGC) to integrate multiple higher-order structures and edge connections at node level. MOGC employs an adaptive weight learning mechanism to au tomatically adjust the contributions of different motifs for each node. This not only tackles hypergraph fragmentation issue but enhances clustering accuracy. MOGC is efficiently solved by an alternating minimization algo rithm. Experiments on seven real-world datasets illustrate the effectiveness of MOGC. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.10422 [pdf, other]

A First Look at Immersive Telepresence on Apple Vision Pro

Authors: Ruizhi Cheng, Nan Wu, Matteo Varvello, Eugene Chai, Songqing Chen, Bo Han

Abstract: Due to the widespread adoption of "work-from-home" policies, videoconferencing applications (e.g., Zoom) have become indispensable for remote communication. However, these systems lack immersiveness, leading to the so-called "Zoom fatigue" and degrading communication efficiency. The recent debut of Apple Vision Pro, a mixed reality headset that supports "spatial persona", aims to offer an immersiv… ▽ More Due to the widespread adoption of "work-from-home" policies, videoconferencing applications (e.g., Zoom) have become indispensable for remote communication. However, these systems lack immersiveness, leading to the so-called "Zoom fatigue" and degrading communication efficiency. The recent debut of Apple Vision Pro, a mixed reality headset that supports "spatial persona", aims to offer an immersive telepresence experience with these applications. In this paper, we conduct a first-of-its-kind in-depth and empirical study to analyze the performance of immersive telepresence with four applications, Apple FaceTime, Cisco Webex, Microsoft Teams, and Zoom, on Vision Pro. We find that only FaceTime provides a truly immersive experience with spatial personas, whereas other applications still operate 2D personas. Our measurement results reveal that (1) FaceTime delivers semantic information to optimize bandwidth consumption, which is even lower than that of 2D persona for other applications, and (2) it employs visibility-aware optimizations to reduce rendering overhead. However, the scalability of FaceTime remains limited, with a simple server allocation strategy that potentially leads to high network delay among users. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.03267 [pdf, other]

Characterizing the Dilemma of Performance and Index Size in Billion-Scale Vector Search and Breaking It with Second-Tier Memory

Authors: Rongxin Cheng, Yifan Peng, Xingda Wei, Hongrui Xie, Rong Chen, Sijie Shen, Haibo Chen

Abstract: Vector searches on large-scale datasets are critical to modern online services like web search and RAG, which necessity storing the datasets and their index on the secondary storage like SSD. In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5.7$\times$ and 1.7$\times$, these indexes… ▽ More Vector searches on large-scale datasets are critical to modern online services like web search and RAG, which necessity storing the datasets and their index on the secondary storage like SSD. In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5.7$\times$ and 1.7$\times$, these indexes have to pay a 5.8$\times$ storage amplification and 7.7$\times$ with respect to the dataset size, respectively. The root cause is that the coarse-grained access of SSD mismatches the fine-grained random read required by vector indexes with small amplification. This paper argues that second-tier memory, such as remote DRAM/NVM connected via RDMA or CXL, is a powerful storage for addressing the problem from a system's perspective, thanks to its fine-grained access granularity. However, putting existing indexes -- primarily designed for SSD -- directly on second-tier memory cannot fully utilize its power. Meanwhile, second-tier memory still behaves more like storage, so using it as DRAM is also inefficient. To this end, we build a graph and cluster index that centers around the performance features of second-tier memory. With careful execution engine and index layout designs, we show that vector indexes can achieve optimal performance with orders of magnitude smaller index amplification, on a variety of second-tier memory devices. Based on our improved graph and vector indexes on second-tier memory, we further conduct a systematic study between them to facilitate developers choosing the right index for their workloads. Interestingly, the findings on the second-tier memory contradict the ones on SSDs. △ Less

Submitted 7 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

arXiv:2404.16266 [pdf, other]

doi 10.1145/3638530.3654389

A Multi-objective Optimization Benchmark Test Suite for Real-time Semantic Segmentation

Authors: Yifan Zhao, Zhenyu Liang, Zhichao Lu, Ran Cheng

Abstract: As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to ba… ▽ More As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to balance multiple optimization objectives, including model accuracy, inference speed, and hardware-specific considerations. Despite its importance, benchmarks have yet to be developed to frame such a challenging task as multi-objective optimization. To bridge the gap, we introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. Building upon the streamline, we present a benchmark test suite, CitySeg/MOP, comprising fifteen MOPs derived from the Cityscapes dataset. The CitySeg/MOP test suite is integrated into the EvoXBench platform to provide seamless interfaces with various programming languages (e.g., Python and MATLAB) for instant fitness evaluations. We comprehensively assessed the CitySeg/MOP test suite on various multi-objective evolutionary algorithms, showcasing its versatility and practicality. Source codes are available at https://github.com/EMI-Group/evoxbench. △ Less

Submitted 28 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

Comments: GECCO 2024

arXiv:2404.15622 [pdf, other]

FR-NAS: Forward-and-Reverse Graph Predictor for Efficient Neural Architecture Search

Authors: Haoming Zhang, Ran Cheng

Abstract: Neural Architecture Search (NAS) has emerged as a key tool in identifying optimal configurations of deep neural networks tailored to specific tasks. However, training and assessing numerous architectures introduces considerable computational overhead. One method to mitigating this is through performance predictors, which offer a means to estimate the potential of an architecture without exhaustive… ▽ More Neural Architecture Search (NAS) has emerged as a key tool in identifying optimal configurations of deep neural networks tailored to specific tasks. However, training and assessing numerous architectures introduces considerable computational overhead. One method to mitigating this is through performance predictors, which offer a means to estimate the potential of an architecture without exhaustive training. Given that neural architectures fundamentally resemble Directed Acyclic Graphs (DAGs), Graph Neural Networks (GNNs) become an apparent choice for such predictive tasks. Nevertheless, the scarcity of training data can impact the precision of GNN-based predictors. To address this, we introduce a novel GNN predictor for NAS. This predictor renders neural architectures into vector representations by combining both the conventional and inverse graph views. Additionally, we incorporate a customized training loss within the GNN predictor to ensure efficient utilization of both types of representations. We subsequently assessed our method through experiments on benchmark datasets including NAS-Bench-101, NAS-Bench-201, and the DARTS search space, with a training dataset ranging from 50 to 400 samples. Benchmarked against leading GNN predictors, the experimental results showcase a significant improvement in prediction accuracy, with a 3%--16% increase in Kendall-tau correlation. Source codes are available at https://github.com/EMI-Group/fr-nas. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: IJCNN'24

arXiv:2404.10160 [pdf, other]

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

Authors: Ruoxi Cheng, Haoxuan Ma, Shuirong Cao, Jiaqi Li, Aihua Pei, Zhiqiang Wang, Pengliang Ji, Haoyu Wang, Jiaqi Huo

Abstract: Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debat… ▽ More Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs demonstrate the effectiveness of our approach in bias mitigation. △ Less

Submitted 18 June, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: The first three authors contributed equally to this work

arXiv:2404.08233 [pdf, other]

Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

Authors: Hui Bai, Ran Cheng

Abstract: Hyperparameter optimization plays a key role in the machine learning domain. Its significance is especially pronounced in reinforcement learning (RL), where agents continuously interact with and adapt to their environments, requiring dynamic adjustments in their learning trajectories. To cater to this dynamicity, the Population-Based Training (PBT) was introduced, leveraging the collective intelli… ▽ More Hyperparameter optimization plays a key role in the machine learning domain. Its significance is especially pronounced in reinforcement learning (RL), where agents continuously interact with and adapt to their environments, requiring dynamic adjustments in their learning trajectories. To cater to this dynamicity, the Population-Based Training (PBT) was introduced, leveraging the collective intelligence of a population of agents learning simultaneously. However, PBT tends to favor high-performing agents, potentially neglecting the explorative potential of agents on the brink of significant advancements. To mitigate the limitations of PBT, we present the Generalized Population-Based Training (GPBT), a refined framework designed for enhanced granularity and flexibility in hyperparameter adaptation. Complementing GPBT, we further introduce Pairwise Learning (PL). Instead of merely focusing on elite agents, PL employs a comprehensive pairwise strategy to identify performance differentials and provide holistic guidance to underperforming agents. By integrating the capabilities of GPBT and PL, our approach significantly improves upon traditional PBT in terms of adaptability and computational efficiency. Rigorous empirical evaluations across a range of RL benchmarks confirm that our approach consistently outperforms not only the conventional PBT but also its Bayesian-optimized variant. △ Less

Submitted 22 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: IEEE Transactions on Emerging Topics in Computational Intelligence

arXiv:2404.07387 [pdf, other]

BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks

Authors: Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, Jeffrey Nichols

Abstract: Novices frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generation w… ▽ More Novices frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generation with an additional ephemeral UI step, offering users UI-based scaffolds as an intermediate stage between user prompts and code generation. We present this workflow in BISCUIT, an extension for JupyterLab that provides users with ephemeral UIs generated by LLMs based on the context of their code and intentions, scaffolding users to understand, guide, and explore with LLM-generated code. Through a user study where 10 novices used BISCUIT for machine learning tutorials, we discover that BISCUIT offers user semantic representation of code to aid their understanding, reduces the complexity of prompt engineering, and creates a playground for users to explore different variables and iterate on their ideas. We discuss the implications of our findings for UI-centric interactive paradigm in code generation LLMs. △ Less

Submitted 12 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.06290 [pdf, other]

Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models

Authors: Beichen Huang, Xingyu Wu, Yu Zhou, Jibin Wu, Liang Feng, Ran Cheng, Kay Chen Tan

Abstract: Large language models (LLMs) have gained widespread popularity and demonstrated exceptional performance not only in natural language processing (NLP) tasks but also in non-linguistic domains. Their potential as artificial general intelligence extends beyond NLP, showcasing promising capabilities in diverse optimization scenarios. Despite this rising trend, whether the integration of LLMs into thes… ▽ More Large language models (LLMs) have gained widespread popularity and demonstrated exceptional performance not only in natural language processing (NLP) tasks but also in non-linguistic domains. Their potential as artificial general intelligence extends beyond NLP, showcasing promising capabilities in diverse optimization scenarios. Despite this rising trend, whether the integration of LLMs into these black-box optimization problems is genuinely beneficial remains unexplored. This paper endeavors to tackle this issue by offering deeper insights into the potential of LLMs in optimization tasks through a comprehensive investigation. Our approach involves a comprehensive evaluation, covering both discrete and continuous optimization problems, aiming to assess the efficacy and distinctive characteristics that LLMs bring to the realm of optimization. Our findings reveal both the limitations and advantages of LLMs in optimization. On one hand, despite consuming the significant power required to run the model, LLMs exhibit subpar performance and lack desirable properties in pure numerical tasks, primarily due to a mismatch between the problem domain and their processing capabilities. On the other hand, although LLMs may not be ideal for traditional numerical optimization, their potential in broader optimization contexts remains promising. LLMs exhibit the ability to solve problems in non-numerical domains and can leverage heuristics from the prompt to enhance their performance. To the best of our knowledge, this work presents the first systematic evaluation of LLMs for numerical optimization, offering a progressive, wide-coverage, and behavioral analysis. Our findings pave the way for a deeper understanding of LLMs' role in optimization and guide future application in diverse scenarios for LLMs. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.04895 [pdf, other]

Tensorized Ant Colony Optimization for GPU Acceleration

Authors: Luming Yang, Tao Jiang, Ran Cheng

Abstract: Ant Colony Optimization (ACO) is renowned for its effectiveness in solving Traveling Salesman Problems, yet it faces computational challenges in CPU-based environments, particularly with large-scale instances. In response, we introduce a Tensorized Ant Colony Optimization (TensorACO) to utilize the advancements of GPU acceleration. As the core, TensorACO fully transforms ant system and ant path in… ▽ More Ant Colony Optimization (ACO) is renowned for its effectiveness in solving Traveling Salesman Problems, yet it faces computational challenges in CPU-based environments, particularly with large-scale instances. In response, we introduce a Tensorized Ant Colony Optimization (TensorACO) to utilize the advancements of GPU acceleration. As the core, TensorACO fully transforms ant system and ant path into tensor forms, a process we refer to as tensorization. For the tensorization of ant system, we propose a preprocessing method to reduce the computational overhead by calculating the probability transition matrix. In the tensorization of ant path, we propose an index mapping method to accelerate the update of pheromone matrix by replacing the mechanism of sequential path update with parallel matrix operations. Additionally, we introduce an Adaptive Independent Roulette (AdaIR) method to overcome the challenges of parallelizing ACO's selection mechanism on GPUs. Comprehensive experiments demonstrate the superior performance of TensorACO achieving up to 1921$\times$ speedup over standard ACO. Moreover, the AdaIR method further improves TensorACO's convergence speed by 80% and solution quality by 2%. Source codes are available at https://github.com/EMI-Group/tensoraco. △ Less

Submitted 12 April, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

Comments: Genetic and Evolutionary Computation Conference (GECCO '24)

arXiv:2404.01817 [pdf, other]

Tensorized NeuroEvolution of Augmenting Topologies for GPU Acceleration

Authors: Lishuang Wang, Mengfei Zhao, Enyu Liu, Kebin Sun, Ran Cheng

Abstract: The NeuroEvolution of Augmenting Topologies (NEAT) algorithm has received considerable recognition in the field of neuroevolution. Its effectiveness is derived from initiating with simple networks and incrementally evolving both their topologies and weights. Although its capability across various challenges is evident, the algorithm's computational efficiency remains an impediment, limiting its sc… ▽ More The NeuroEvolution of Augmenting Topologies (NEAT) algorithm has received considerable recognition in the field of neuroevolution. Its effectiveness is derived from initiating with simple networks and incrementally evolving both their topologies and weights. Although its capability across various challenges is evident, the algorithm's computational efficiency remains an impediment, limiting its scalability potential. In response, this paper introduces a tensorization method for the NEAT algorithm, enabling the transformation of its diverse network topologies and associated operations into uniformly shaped tensors for computation. This advancement facilitates the execution of the NEAT algorithm in a parallelized manner across the entire population. Furthermore, we develop TensorNEAT, a library that implements the tensorized NEAT algorithm and its variants, such as CPPN and HyperNEAT. Building upon JAX, TensorNEAT promotes efficient parallel computations via automated function vectorization and hardware acceleration. Moreover, the TensorNEAT library supports various benchmark environments including Gym, Brax, and gymnax. Through evaluations across a spectrum of robotics control environments in Brax, TensorNEAT achieves up to 500x speedups compared to the existing implementations such as NEAT-Python. Source codes are available at: https://github.com/EMI-Group/tensorneat. △ Less

Submitted 11 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: Genetic and Evolutionary Computation Conference (GECCO '24)

arXiv:2404.01159 [pdf, other]

doi 10.1145/3638529.3654223

GPU-accelerated Evolutionary Multiobjective Optimization Using Tensorized RVEA

Authors: Zhenyu Liang, Tao Jiang, Kebin Sun, Ran Cheng

Abstract: Evolutionary multiobjective optimization has witnessed remarkable progress during the past decades. However, existing algorithms often encounter computational challenges in large-scale scenarios, primarily attributed to the absence of hardware acceleration. In response, we introduce a Tensorized Reference Vector Guided Evolutionary Algorithm (TensorRVEA) for harnessing the advancements of GPU acce… ▽ More Evolutionary multiobjective optimization has witnessed remarkable progress during the past decades. However, existing algorithms often encounter computational challenges in large-scale scenarios, primarily attributed to the absence of hardware acceleration. In response, we introduce a Tensorized Reference Vector Guided Evolutionary Algorithm (TensorRVEA) for harnessing the advancements of GPU acceleration. In TensorRVEA, the key data structures and operators are fully transformed into tensor forms for leveraging GPU-based parallel computing. In numerical benchmark tests involving large-scale populations and problem dimensions, TensorRVEA consistently demonstrates high computational performance, achieving up to over 1000$\times$ speedups. Then, we applied TensorRVEA to the domain of multiobjective neuroevolution for addressing complex challenges in robotic control tasks. Furthermore, we assessed TensorRVEA's extensibility by altering several tensorized reproduction operators. Experimental results demonstrate promising scalability and robustness of TensorRVEA. Source codes are available at \url{https://github.com/EMI-Group/tensorrvea}. △ Less

Submitted 11 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: Genetic and Evolutionary Computation Conference (GECCO '24)

arXiv:2403.13286 [pdf, other]

A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

Authors: Yun Wang, Chrysanthi Kosyfaki, Sihem Amer-Yahia, Reynold Cheng

Abstract: Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework,… ▽ More Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.11073 [pdf]

Tokensome: Towards a Genetic Vision-Language GPT for Explainable and Cognitive Karyotyping

Authors: Haoxi Zhang, Xinxu Zhang, Yuanxin Lin, Maiqi Wang, Yi Lai, Yu Wang, Linfeng Yu, Yufeng Xu, Ran Cheng, Edward Szczerbicki

Abstract: Automatic karyotype analysis is often defined as a visual perception task focused solely on chromosomal object-level modeling. This definition has led most existing methods to overlook componential and holistic information, significantly constraining model performance. Moreover, the lack of interpretability in current technologies hinders clinical adoption. In this paper, we introduce Tokensome, a… ▽ More Automatic karyotype analysis is often defined as a visual perception task focused solely on chromosomal object-level modeling. This definition has led most existing methods to overlook componential and holistic information, significantly constraining model performance. Moreover, the lack of interpretability in current technologies hinders clinical adoption. In this paper, we introduce Tokensome, a novel vision-language model based on chromosome tokenization for explainable and cognitive karyotyping. Tokensome elevates the method from the conventional visual perception layer to the cognitive decision-making layer. This elevation enables the integration of domain knowledge and cognitive reasoning via knowledge graphs and LLMs, markedly enhancing model's explainability and facilitating abnormality detection. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: Preprint. Work in progress

arXiv:2403.08600 [pdf]

Evaluation of Control/User-Plane Denial-of-Service (DoS) Attack on O-RAN Fronthaul Interface

Authors: Ferlinda Feliana, Ting-Wei Hung, Binbin Chen, Ray-Guang Cheng

Abstract: The open fronthaul interface defined by O-RAN ALLIANCE aims to support the interoperability between multi-vendor open radio access network (O-RAN) radio units (O-RU) and O-RAN distributed units (O-DU). This paper introduces a new tool that could be used to evaluate Denial-of-Service (DoS) attacks against the open fronthaul interface. We launched an array of control/user planes (C/U-Planes) attacks… ▽ More The open fronthaul interface defined by O-RAN ALLIANCE aims to support the interoperability between multi-vendor open radio access network (O-RAN) radio units (O-RU) and O-RAN distributed units (O-DU). This paper introduces a new tool that could be used to evaluate Denial-of-Service (DoS) attacks against the open fronthaul interface. We launched an array of control/user planes (C/U-Planes) attacks with the tool under different traffic types and data rates, and we evaluated their impacts on the throughput and block error rate (BLER) of real-world O-RAN systems with commercial hardware. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE INFOCOM Workshop: Next-generation Open and Programmable Radio Access Networks (NG-OPERA)

arXiv:2403.05680 [pdf, other]

How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses

Authors: Qingqing Zhu, Benjamin Hou, Tejas S. Mathai, Pritam Mukherjee, Qiao Jin, Xiuying Chen, Zhizheng Wang, Ruida Cheng, Ronald M. Summers, Zhiyong Lu

Abstract: Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini… ▽ More Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis. △ Less

Submitted 18 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.05307 [pdf, other]

Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents

Authors: Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Yurong Wu, Chenhao Ma, Jian-Guang Lou, Reynold Cheng

Abstract: Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Large Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM ag… ▽ More Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Large Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.5%. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 30 pages, 7 figures

arXiv:2403.04796 [pdf, other]

Blockchain-Enhanced UAV Networks for Post-Disaster Communication: A Decentralized Flocking Approach

Authors: Sana Hafeez, Runze Cheng, Lina Mohjazi, Yao Sun, Muhammad Ali Imran

Abstract: Unmanned Aerial Vehicles (UAVs) have significant potential for agile communication and relief coordination in post-disaster scenarios, particularly when ground infrastructure is compromised. However, efficiently coordinating and securing flocks of heterogeneous UAVs from different service providers poses significant challenges related to privacy, scalability, lightweight consensus protocols, and c… ▽ More Unmanned Aerial Vehicles (UAVs) have significant potential for agile communication and relief coordination in post-disaster scenarios, particularly when ground infrastructure is compromised. However, efficiently coordinating and securing flocks of heterogeneous UAVs from different service providers poses significant challenges related to privacy, scalability, lightweight consensus protocols, and comprehensive cybersecurity mechanisms. This study introduces a robust blockchain-enabled framework designed to tackle these technical challenges through a combination of consensus protocols, smart contracts, and cryptographic techniques. First, we propose a consortium blockchain architecture that ensures secure and private multi-agency coordination by controlling access and safeguarding the privacy of sensitive data. Second, we develop an optimized hybrid consensus protocol that merges Delegated Proof of Stake and Practical Byzantine Fault Tolerance (DPOS-PBFT), aiming to achieve an effective balance between efficiency, security, and resilience against node failures. Finally, we introduce decentralized flocking algorithms that facilitate adaptable and autonomous operations among specialized UAV clusters, ensuring critical disaster relief functions under conditions of uncertain connectivity. Comprehensive simulations demonstrate the system achieved linear scaling of throughput up to 500 UAV nodes, with only a 50ms increase in latency from 10 to 500 nodes. The framework maintained high throughput and low latency despite spoofing, denial-of-service (DoS), and tampering attacks, showing strong cyber resilience. Communication latencies were kept under 10ms for diverse UAV operations through self-optimizing network intelligence, with median values around 2-3ms. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 11 pages, 9 figures, Digital Communications and Networks Open access

arXiv:2402.17237 [pdf, other]

Image-Text Matching with Multi-View Attention

Authors: Rui Cheng, Wanqing Cui

Abstract: Existing two-stream models for image-text matching show good performance while ensuring retrieval speed and have received extensive attention from industry and academia. These methods use a single representation to encode image and text separately and get a matching score with cosine similarity or the inner product of vectors. However, the performance of the two-stream model is often sub-optimal.… ▽ More Existing two-stream models for image-text matching show good performance while ensuring retrieval speed and have received extensive attention from industry and academia. These methods use a single representation to encode image and text separately and get a matching score with cosine similarity or the inner product of vectors. However, the performance of the two-stream model is often sub-optimal. On the one hand, a single representation is challenging to cover complex content comprehensively. On the other hand, in this framework of lack of interaction, it is challenging to match multiple meanings which leads to information being ignored. To address the problems mentioned above and facilitate the performance of the two-stream model, we propose a multi-view attention approach for two-stream image-text matching MVAM (\textbf{M}ulti-\textbf{V}iew \textbf{A}ttention \textbf{M}odel). It first learns multiple image and text representations by diverse attention heads with different view codes. And then concatenate these representations into one for matching. A diversity objective is also used to promote diversity between attention heads. With this method, models are able to encode images and text from different views and attend to more key points. So we can get representations that contain more information. When doing retrieval tasks, the matching scores between images and texts can be calculated from different aspects, leading to better matching performance. Experiment results on MSCOCO and Flickr30K show that our proposed model brings improvements over existing models. Further case studies show that different attention heads can focus on different contents and finally obtain a more comprehensive representation. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.15331 [pdf, other]

A Blockchain-Enabled Framework of UAV Coordination for Post-Disaster Networks

Authors: Sana Hafeez, Runze Cheng, Lina Mohjazi, Muhammad Ali Imran, Yao Sun

Abstract: Emergency communication is critical but challenging after natural disasters when ground infrastructure is devastated. Unmanned aerial vehicles (UAVs) offer enormous potential for agile relief coordination in these scenarios. However, effectively leveraging UAV fleets poses additional challenges around security, privacy, and efficient collaboration across response agencies. This paper presents a ro… ▽ More Emergency communication is critical but challenging after natural disasters when ground infrastructure is devastated. Unmanned aerial vehicles (UAVs) offer enormous potential for agile relief coordination in these scenarios. However, effectively leveraging UAV fleets poses additional challenges around security, privacy, and efficient collaboration across response agencies. This paper presents a robust blockchain-enabled framework to address these challenges by integrating a consortium blockchain model, smart contracts, and cryptographic techniques to securely coordinate UAV fleets for disaster response. Specifically, we make two key contributions: a consortium blockchain architecture for secure and private multi-agency coordination; and an optimized consensus protocol balancing efficiency and fault tolerance using a delegated proof of stake practical byzantine fault tolerance (DPoS-PBFT). Comprehensive simulations showcase the framework's ability to enhance transparency, automation, scalability, and cyber-attack resilience for UAV coordination in post-disaster networks. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: 6 pages, 4 figures,IEEE 99th Vehicular Technology Conference: VTC2024-Spring, Singapore

arXiv:2402.13116 [pdf, other]

A Survey on Knowledge Distillation of Large Language Models

Authors: Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou

Abstract: In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employi… ▽ More In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in KD and proposing future research directions. Importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. △ Less

Submitted 8 March, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 44 pages

arXiv:2402.07425 [pdf, other]

doi 10.1145/3589334.3645421

Debiasing Recommendation with Personal Popularity

Authors: Wentao Ning, Reynold Cheng, Xiao Yan, Ben Kao, Nan Huo, Nur AI Hasan Haldar, Bo Tang

Abstract: Global popularity (GP) bias is the phenomenon that popular items are recommended much more frequently than they should be, which goes against the goal of providing personalized recommendations and harms user experience and recommendation accuracy. Many methods have been proposed to reduce GP bias but they fail to notice the fundamental problem of GP, i.e., it considers popularity from a \textit{gl… ▽ More Global popularity (GP) bias is the phenomenon that popular items are recommended much more frequently than they should be, which goes against the goal of providing personalized recommendations and harms user experience and recommendation accuracy. Many methods have been proposed to reduce GP bias but they fail to notice the fundamental problem of GP, i.e., it considers popularity from a \textit{global} perspective of \textit{all users} and uses a single set of popular items, and thus cannot capture the interests of individual users. As such, we propose a user-aware version of item popularity named \textit{personal popularity} (PP), which identifies different popular items for each user by considering the users that share similar interests. As PP models the preferences of individual users, it naturally helps to produce personalized recommendations and mitigate GP bias. To integrate PP into recommendation, we design a general \textit{personal popularity aware counterfactual} (PPAC) framework, which adapts easily to existing recommendation models. In particular, PPAC recognizes that PP and GP have both direct and indirect effects on recommendations and controls direct effects with counterfactual inference techniques for unbiased recommendations. All codes and datasets are available at \url{https://github.com/Stevenn9981/PPAC}. △ Less

Submitted 21 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: Accepted by WWW'24 as a research full paper

arXiv:2402.06071 [pdf, other]

Keyframer: Empowering Animation Design using Large Language Models

Authors: Tiffany Tseng, Ruijia Cheng, Jeffrey Nichols

Abstract: Large language models (LLMs) have the potential to impact a wide range of creative domains, but the application of LLMs to animation is underexplored and presents novel challenges such as how users might effectively describe motion in natural language. In this paper, we present Keyframer, a design tool for animating static images (SVGs) with natural language. Informed by interviews with profession… ▽ More Large language models (LLMs) have the potential to impact a wide range of creative domains, but the application of LLMs to animation is underexplored and presents novel challenges such as how users might effectively describe motion in natural language. In this paper, we present Keyframer, a design tool for animating static images (SVGs) with natural language. Informed by interviews with professional animation designers and engineers, Keyframer supports exploration and refinement of animations through the combination of prompting and direct editing of generated output. The system also enables users to request design variants, supporting comparison and ideation. Through a user study with 13 participants, we contribute a characterization of user prompting strategies, including a taxonomy of semantic prompt types for describing motion and a 'decomposed' prompting style where users continually adapt their goals in response to generated output.We share how direct editing along with prompting enables iteration beyond one-shot prompting interfaces common in generative tools today. Through this work, we propose how LLMs might empower a range of audiences to engage with animation creation. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.03362 [pdf, other]

NanoNER: Named Entity Recognition for nanobiology using experts' knowledge and distant supervision

Authors: Martin Lentschat, Cyril Labbé, Ran Cheng

Abstract: Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in Natural Language Processing (NLP) and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the es… ▽ More Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in Natural Language Processing (NLP) and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the essential knowledge of the domain. Relying on ontologies, which provide us with a domain vocabulary and taxonomy, we implemented an iterative process enabling experts to determine the entities relevant to the domain at hand. We then delve into the potential of distant supervision learning in NER, supporting how this method can increase the quantity of annotated data with minimal additional manpower. On our full corpus of 728 full-text nanobiology articles, containing more than 120k entity occurrences, NanoNER obtained a F1-score of 0.98 on the recognition of previously known entities. Our model also demonstrated its ability to discover new entities in the text, with precision scores ranging from 0.77 to 0.81. Ablation experiments further confirmed this and allowed us to assess the dependency of our approach on the external resources. It highlighted the dependency of the approach to the resource, while also confirming its ability to rediscover up to 30% of the ablated terms. This paper details the methodology employed, experimental design, and key findings, providing valuable insights and directions for future related researches on NER in specialized domain. Furthermore, since our approach require minimal manpower , we believe that it can be generalized to other specialized fields. △ Less

Submitted 30 January, 2024; originally announced February 2024.

arXiv:2401.01474 [pdf, other]

doi 10.15607/RSS.2023.XIX.055

Demonstrating Mobile Manipulation in the Wild: A Metrics-Driven Approach

Authors: Max Bajracharya, James Borders, Richard Cheng, Dan Helmick, Lukas Kaul, Dan Kruse, John Leichty, Jeremy Ma, Carolyn Matl, Frank Michel, Chavdar Papazov, Josh Petersen, Krishna Shankar, Mark Tjersland

Abstract: We present our general-purpose mobile manipulation system consisting of a custom robot platform and key algorithms spanning perception and planning. To extensively test the system in the wild and benchmark its performance, we choose a grocery shopping scenario in an actual, unmodified grocery store. We derive key performance metrics from detailed robot log data collected during six week-long field… ▽ More We present our general-purpose mobile manipulation system consisting of a custom robot platform and key algorithms spanning perception and planning. To extensively test the system in the wild and benchmark its performance, we choose a grocery shopping scenario in an actual, unmodified grocery store. We derive key performance metrics from detailed robot log data collected during six week-long field tests, spread across 18 months. These objective metrics, gained from complex yet repeatable tests, drive the direction of our research efforts and let us continuously improve our system's performance. We find that thorough end-to-end system-level testing of a complex mobile manipulation system can serve as a reality-check for state-of-the-art methods in robotics. This effectively grounds robotics research efforts in real world needs and challenges, which we deem highly useful for the advancement of the field. To this end, we share our key insights and takeaways to inspire and accelerate similar system-level research projects. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: Presented at RSS 2023 [Best Demo Paper Award]

arXiv:2312.10890 [pdf, other]

Low-latency Space-time Supersampling for Real-time Rendering

Authors: Ruian He, Shili Zhou, Yuqi Sun, Ri Cheng, Weimin Tan, Bo Yan

Abstract: With the rise of real-time rendering and the evolution of display devices, there is a growing demand for post-processing methods that offer high-resolution content in a high frame rate. Existing techniques often suffer from quality and latency issues due to the disjointed treatment of frame supersampling and extrapolation. In this paper, we recognize the shared context and mechanisms between frame… ▽ More With the rise of real-time rendering and the evolution of display devices, there is a growing demand for post-processing methods that offer high-resolution content in a high frame rate. Existing techniques often suffer from quality and latency issues due to the disjointed treatment of frame supersampling and extrapolation. In this paper, we recognize the shared context and mechanisms between frame supersampling and extrapolation, and present a novel framework, Space-time Supersampling (STSS). By integrating them into a unified framework, STSS can improve the overall quality with lower latency. To implement an efficient architecture, we treat the aliasing and warping holes unified as reshading regions and put forth two key components to compensate the regions, namely Random Reshading Masking (RRM) and Efficient Reshading Module (ERM). Extensive experiments demonstrate that our approach achieves superior visual fidelity compared to state-of-the-art (SOTA) methods. Notably, the performance is achieved within only 4ms, saving up to 75\% of time against the conventional two-stage pipeline that necessitates 17ms. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI 2024

arXiv:2312.07180 [pdf, other]

Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation

Authors: Ri Cheng, Ruian He, Xuhao Jiang, Shili Zhou, Weimin Tan, Bo Yan

Abstract: Existing recurrent optical flow estimation networks are computationally expensive since they use a fixed large number of iterations to update the flow field for each sample. An efficient network should skip iterations when the flow improvement is limited. In this paper, we develop a Context-Aware Iteration Policy Network for efficient optical flow estimation, which determines the optimal number of… ▽ More Existing recurrent optical flow estimation networks are computationally expensive since they use a fixed large number of iterations to update the flow field for each sample. An efficient network should skip iterations when the flow improvement is limited. In this paper, we develop a Context-Aware Iteration Policy Network for efficient optical flow estimation, which determines the optimal number of iterations per sample. The policy network achieves this by learning contextual information to realize whether flow improvement is bottlenecked or minimal. On the one hand, we use iteration embedding and historical hidden cell, which include previous iterations information, to convey how flow has changed from previous iterations. On the other hand, we use the incremental loss to make the policy network implicitly perceive the magnitude of optical flow improvement in the subsequent iteration. Furthermore, the computational complexity in our dynamic network is controllable, allowing us to satisfy various resource preferences with a single trained model. Our policy network can be easily integrated into state-of-the-art optical flow networks. Extensive experiments show that our method maintains performance while reducing FLOPs by about 40%/20% for the Sintel/KITTI datasets. △ Less

Submitted 5 January, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: 2024, Association for the Advancement of Artificial Intelligence

arXiv:2310.17705 [pdf, other]

A Wireless AI-Generated Content (AIGC) Provisioning Framework Empowered by Semantic Communication

Authors: Runze Cheng, Yao Sun, Dusit Niyato, Lan Zhang, Lei Zhang, Muhammad Ali Imran

Abstract: Generative AI applications have been recently catering to a vast user base by creating diverse and high-quality AI-generated content (AIGC). With the proliferation of mobile devices and rapid growth of mobile traffic, providing ubiquitous access to high-quality AIGC services via wireless communication networks is becoming the future direction. However, it is challenging to provide qualified AIGC s… ▽ More Generative AI applications have been recently catering to a vast user base by creating diverse and high-quality AI-generated content (AIGC). With the proliferation of mobile devices and rapid growth of mobile traffic, providing ubiquitous access to high-quality AIGC services via wireless communication networks is becoming the future direction. However, it is challenging to provide qualified AIGC services in wireless networks with unstable channels, limited bandwidth resources, and unevenly distributed computational resources. To tackle these challenges, we propose a semantic communication (SemCom)-empowered AIGC (SemAIGC) generation and transmission framework, where only semantic information of the content rather than all the binary bits should be generated and transmitted by using SemCom. Specifically, SemAIGC integrates diffusion models within the semantic encoder and decoder to design a workload-adjustable transceiver thereby allowing adjustment of computational resource utilization in edge and local. In addition, a Resource-aware wOrk lOad Trade-off (ROOT) scheme is devised to intelligently make workload adaptation decisions for the transceiver, thus efficiently generating, transmitting, and fine-tuning content as per dynamic wireless channel conditions and service requirements. Simulations verify the superiority of our proposed SemAIGC framework in terms of latency and content quality compared to conventional approaches. △ Less

Submitted 29 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.16951 [pdf, other]

The Teenager's Problem: Efficient Garment Decluttering With Grasp Optimization

Authors: Aviv Adler, Ayah Ahmad, Shengyin Wang, Wisdom C. Agboh, Edith Llontop, Tianshuang Qiu, Jeffrey Ichnowski, Mehmet Dogar, Thomas Kollar, Richard Cheng, Ken Goldberg

Abstract: This paper addresses the ''Teenager's Problem'': efficiently removing scattered garments from a planar surface. As grasping and transporting individual garments is highly inefficient, we propose analytical policies to select grasp locations for multiple garments using an overhead camera. Two classes of methods are considered: depth-based, which use overhead depth data to find efficient grasps, and… ▽ More This paper addresses the ''Teenager's Problem'': efficiently removing scattered garments from a planar surface. As grasping and transporting individual garments is highly inefficient, we propose analytical policies to select grasp locations for multiple garments using an overhead camera. Two classes of methods are considered: depth-based, which use overhead depth data to find efficient grasps, and segment-based, which use segmentation on the RGB overhead image (without requiring any depth data); grasp efficiency is measured by Objects per Transport, which denotes the average number of objects removed per trip to the laundry basket. Experiments suggest that both depth- and segment-based methods easily reduce Objects per Transport (OpT) by $20\%$; furthermore, these approaches complement each other, with combined hybrid methods yielding improvements of $34\%$. Finally, a method employing consolidation (with segmentation) is considered, which manipulates the garments on the work surface to increase OpT; this yields an improvement of $67\%$ over the baseline, though at a cost of additional physical actions. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.09690 [pdf, other]

Configuration Validation with Large Language Models

Authors: Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu

Abstract: Misconfigurations are major causes of software failures. Existing practices rely on developer-written rules or test cases to validate configurations, which are expensive. Machine learning (ML) for configuration validation is considered a promising direction, but has been facing challenges such as the need of large-scale field data and system-specific models. Recent advances in Large Language Model… ▽ More Misconfigurations are major causes of software failures. Existing practices rely on developer-written rules or test cases to validate configurations, which are expensive. Machine learning (ML) for configuration validation is considered a promising direction, but has been facing challenges such as the need of large-scale field data and system-specific models. Recent advances in Large Language Models (LLMs) show promise in addressing some of the long-lasting limitations of ML-based configuration validation. We present a first analysis on the feasibility and effectiveness of using LLMs for configuration validation. We empirically evaluate LLMs as configuration validators by developing a generic LLM-based configuration validation framework, named Ciri. Ciri employs effective prompt engineering with few-shot learning based on both valid configuration and misconfiguration data. Ciri checks outputs from LLMs when producing results, addressing hallucination and nondeterminism of LLMs. We evaluate Ciri's validation effectiveness on eight popular LLMs using configuration data of ten widely deployed open-source systems. Our analysis (1) confirms the potential of using LLMs for configuration validation, (2) explores design space of LLMbased validators like Ciri, and (3) reveals open challenges such as ineffectiveness in detecting certain types of misconfigurations and biases towards popular configuration parameters. △ Less

Submitted 2 April, 2024; v1 submitted 14 October, 2023; originally announced October 2023.

arXiv:2310.04069 [pdf, other]

Spatio-temporal flow patterns

Authors: Chrysanthi Kosyfaki, Nikos Mamoulis, Reynold Cheng, Ben Kao

Abstract: Transportation companies and organizations routinely collect huge volumes of passenger transportation data. By aggregating these data (e.g., counting the number of passengers going from a place to another in every 30 minute interval), it becomes possible to analyze the movement behavior of passengers in a metropolitan area. In this paper, we study the problem of finding important trends in passeng… ▽ More Transportation companies and organizations routinely collect huge volumes of passenger transportation data. By aggregating these data (e.g., counting the number of passengers going from a place to another in every 30 minute interval), it becomes possible to analyze the movement behavior of passengers in a metropolitan area. In this paper, we study the problem of finding important trends in passenger movements at varying granularities, which is useful in a wide range of applications such as target marketing, scheduling, and travel intent prediction. Specifically, we study the extraction of movement patterns between regions that have significant flow. The huge number of possible patterns render their detection computationally hard. We propose algorithms that greatly reduce the search space and the computational cost of pattern detection. We study variants of patterns that could be useful to different problem instances, such as constrained patterns and top-k ranked patterns. △ Less

Submitted 12 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.02424 [pdf, other]

doi 10.1145/3613904.3642777

AXNav: Replaying Accessibility Tests from Natural Language

Authors: Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, Jeffrey Nichols

Abstract: Developers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs, however t… ▽ More Developers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs, however to our knowledge no one has yet explored their use in controlling assistive technologies for the purposes of supporting accessibility testing. In this paper, we explore the requirements of a natural language based accessibility testing workflow, starting with a formative study. From this we build a system that takes as input a manual accessibility test (e.g., ``Search for a show in VoiceOver'') and uses an LLM combined with pixel-based UI Understanding models to execute the test and produce a chaptered, navigable video. In each video, to help QA testers we apply heuristics to detect and flag accessibility issues (e.g., Text size not increasing with Large Text enabled, VoiceOver navigation loops). We evaluate this system through a 10 participant user study with accessibility QA professionals who indicated that the tool would be very useful in their current work and performed tests similarly to how they would manually test the features. The study also reveals insights for future work on using LLMs for accessibility testing. △ Less

Submitted 4 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: Accepted into Conference on Human Factors in Computing Systems (CHI) 2024, 22 pages, 7 figures

ACM Class: I.2

arXiv:2308.08856 [pdf, other]

MV-ROPE: Multi-view Constraints for Robust Category-level Object Pose and Size Estimation

Authors: Jiaqi Yang, Yucong Chen, Xiangting Meng, Chenxin Yan, Min Li, Ran Cheng, Lige Liu, Tao Sun, Laurent Kneip

Abstract: Recently there has been a growing interest in category-level object pose and size estimation, and prevailing methods commonly rely on single view RGB-D images. However, one disadvantage of such methods is that they require accurate depth maps which cannot be produced by consumer-grade sensors. Furthermore, many practical real-world situations involve a moving camera that continuously observes its… ▽ More Recently there has been a growing interest in category-level object pose and size estimation, and prevailing methods commonly rely on single view RGB-D images. However, one disadvantage of such methods is that they require accurate depth maps which cannot be produced by consumer-grade sensors. Furthermore, many practical real-world situations involve a moving camera that continuously observes its surroundings, and the temporal information of the input video streams is simply overlooked by single-view methods. We propose a novel solution that makes use of RGB video streams. Our framework consists of three modules: a scale-aware monocular dense SLAM solution, a lightweight object pose predictor, and an object-level pose graph optimizer. The SLAM module utilizes a video stream and additional scale-sensitive readings to estimate camera poses and metric depth. The object pose predictor then generates canonical object representations from RGB images. The object pose is estimated through geometric registration of these canonical object representations with estimated object depth points. All per-view estimates finally undergo optimization within a pose graph, culminating in the output of robust and accurate canonical object poses. Our experimental results demonstrate that when utilizing public dataset sequences with high-quality depth information, the proposed method exhibits comparable performance to state-of-the-art RGB-D methods. We also collect and evaluate on new datasets containing depth maps of varying quality to further quantitatively benchmark the proposed method alongside previous RGB-D based methods. We demonstrate a significant advantage in scenarios where depth input is absent or the quality of depth sensing is limited. △ Less

Submitted 22 March, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

arXiv:2308.06515 [pdf, other]

Seed Feature Maps-based CNN Models for LEO Satellite Remote Sensing Services

Authors: Zhichao Lu, Chuntao Ding, Shangguang Wang, Ran Cheng, Felix Juefei-Xu, Vishnu Naresh Boddeti

Abstract: Deploying high-performance convolutional neural network (CNN) models on low-earth orbit (LEO) satellites for rapid remote sensing image processing has attracted significant interest from industry and academia. However, the limited resources available on LEO satellites contrast with the demands of resource-intensive CNN models, necessitating the adoption of ground-station server assistance for trai… ▽ More Deploying high-performance convolutional neural network (CNN) models on low-earth orbit (LEO) satellites for rapid remote sensing image processing has attracted significant interest from industry and academia. However, the limited resources available on LEO satellites contrast with the demands of resource-intensive CNN models, necessitating the adoption of ground-station server assistance for training and updating these models. Existing approaches often require large floating-point operations (FLOPs) and substantial model parameter transmissions, presenting considerable challenges. To address these issues, this paper introduces a ground-station server-assisted framework. With the proposed framework, each layer of the CNN model contains only one learnable feature map (called the seed feature map) from which other feature maps are generated based on specific rules. The hyperparameters of these rules are randomly generated instead of being trained, thus enabling the generation of multiple feature maps from the seed feature map and significantly reducing FLOPs. Furthermore, since the random hyperparameters can be saved using a few random seeds, the ground station server assistance can be facilitated in updating the CNN model deployed on the LEO satellite. Experimental results on the ISPRS Vaihingen, ISPRS Potsdam, UAVid, and LoveDA datasets for semantic segmentation services demonstrate that the proposed framework outperforms existing state-of-the-art approaches. In particular, the SineFM-based model achieves a higher mIoU than the UNetFormer on the UAVid dataset, with 3.3x fewer parameters and 2.2x fewer FLOPs. △ Less

Submitted 12 August, 2023; originally announced August 2023.

Comments: 11 pages

arXiv:2308.05640 [pdf, other]

A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective Optimization

Authors: Yansong Huang, Zherui Zhang, Ao Jiao, Yuxin Ma, Ran Cheng

Abstract: Evolutionary multi-objective optimization (EMO) algorithms have been demonstrated to be effective in solving multi-criteria decision-making problems. In real-world applications, analysts often employ several algorithms concurrently and compare their solution sets to gain insight into the characteristics of different algorithms and explore a broader range of feasible solutions. However, EMO algorit… ▽ More Evolutionary multi-objective optimization (EMO) algorithms have been demonstrated to be effective in solving multi-criteria decision-making problems. In real-world applications, analysts often employ several algorithms concurrently and compare their solution sets to gain insight into the characteristics of different algorithms and explore a broader range of feasible solutions. However, EMO algorithms are typically treated as black boxes, leading to difficulties in performing detailed analysis and comparisons between the internal evolutionary processes. Inspired by the successful application of visual analytics tools in explainable AI, we argue that interactive visualization can significantly enhance the comparative analysis between multiple EMO algorithms. In this paper, we present a visual analytics framework that enables the exploration and comparison of evolutionary processes in EMO algorithms. Guided by a literature review and expert interviews, the proposed framework addresses various analytical tasks and establishes a multi-faceted visualization design to support the comparative analysis of intermediate generations in the evolution as well as solution sets. We demonstrate the effectiveness of our framework through case studies on benchmarking and real-world multi-objective optimization problems to elucidate how analysts can leverage our framework to inspect and compare diverse algorithms. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted by IEEE VIS 2023 (will appear in IEEE TVCG)

arXiv:2308.05508 [pdf, other]

doi 10.1145/3583780.3614977

Multi-domain Recommendation with Embedding Disentangling and Domain Alignment

Authors: Wentao Ning, Xiao Yan, Weiwen Liu, Reynold Cheng, Rui Zhang, Bo Tang

Abstract: Multi-domain recommendation (MDR) aims to provide recommendations for different domains (e.g., types of products) with overlapping users/items and is common for platforms such as Amazon, Facebook, and LinkedIn that host multiple services. Existing MDR models face two challenges: First, it is difficult to disentangle knowledge that generalizes across domains (e.g., a user likes cheap items) and kno… ▽ More Multi-domain recommendation (MDR) aims to provide recommendations for different domains (e.g., types of products) with overlapping users/items and is common for platforms such as Amazon, Facebook, and LinkedIn that host multiple services. Existing MDR models face two challenges: First, it is difficult to disentangle knowledge that generalizes across domains (e.g., a user likes cheap items) and knowledge specific to a single domain (e.g., a user likes blue clothing but not blue cars). Second, they have limited ability to transfer knowledge across domains with small overlaps. We propose a new MDR method named EDDA with two key components, i.e., embedding disentangling recommender and domain alignment, to tackle the two challenges respectively. In particular, the embedding disentangling recommender separates both the model and embedding for the inter-domain part and the intra-domain part, while most existing MDR methods only focus on model-level disentangling. The domain alignment leverages random walks from graph processing to identify similar user/item pairs from different domains and encourages similar user/item pairs to have similar embeddings, enhancing knowledge transfer. We compare EDDA with 12 state-of-the-art baselines on 3 real datasets. The results show that EDDA consistently outperforms the baselines on all datasets and domains. All datasets and codes are available at https://github.com/Stevenn9981/EDDA. △ Less

Submitted 13 August, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted by CIKM'23 as a Long paper

arXiv:2308.02066 [pdf, other]

Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing with Non-Learnable Primitives

Authors: Chuntao Ding, Zhichao Lu, Shangguang Wang, Ran Cheng, Vishnu Naresh Boddeti

Abstract: Multi-task learning (MTL) seeks to learn a single model to accomplish multiple tasks by leveraging shared information among the tasks. Existing MTL models, however, have been known to suffer from negative interference among tasks. Efforts to mitigate task interference have focused on either loss/gradient balancing or implicit parameter partitioning with partial overlaps among the tasks. In this pa… ▽ More Multi-task learning (MTL) seeks to learn a single model to accomplish multiple tasks by leveraging shared information among the tasks. Existing MTL models, however, have been known to suffer from negative interference among tasks. Efforts to mitigate task interference have focused on either loss/gradient balancing or implicit parameter partitioning with partial overlaps among the tasks. In this paper, we propose ETR-NLP to mitigate task interference through a synergistic combination of non-learnable primitives (NLPs) and explicit task routing (ETR). Our key idea is to employ non-learnable primitives to extract a diverse set of task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task. The non-learnable primitives and the explicit decoupling of learnable parameters into shared and task-specific ones afford the flexibility needed for minimizing task interference. We evaluate the efficacy of ETR-NLP networks for both image-level classification and pixel-level dense prediction MTL problems. Experimental results indicate that ETR-NLP significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets. Code is available at this \href{https://github.com/zhichao-lu/etr-nlp-mtl}. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: CVPR 2023

arXiv:2307.16555 [pdf, other]

doi 10.1145/3581783.3611752

Uncertainty-Guided Spatial Pruning Architecture for Efficient Frame Interpolation

Authors: Ri Cheng, Xuhao Jiang, Ruian He, Shili Zhou, Weimin Tan, Bo Yan

Abstract: The video frame interpolation (VFI) model applies the convolution operation to all locations, leading to redundant computations in regions with easy motion. We can use dynamic spatial pruning method to skip redundant computation, but this method cannot properly identify easy regions in VFI tasks without supervision. In this paper, we develop an Uncertainty-Guided Spatial Pruning (UGSP) architectur… ▽ More The video frame interpolation (VFI) model applies the convolution operation to all locations, leading to redundant computations in regions with easy motion. We can use dynamic spatial pruning method to skip redundant computation, but this method cannot properly identify easy regions in VFI tasks without supervision. In this paper, we develop an Uncertainty-Guided Spatial Pruning (UGSP) architecture to skip redundant computation for efficient frame interpolation dynamically. Specifically, pixels with low uncertainty indicate easy regions, where the calculation can be reduced without bringing undesirable visual results. Therefore, we utilize uncertainty-generated mask labels to guide our UGSP in properly locating the easy region. Furthermore, we propose a self-contrast training strategy that leverages an auxiliary non-pruning branch to improve the performance of our UGSP. Extensive experiments show that UGSP maintains performance but reduces FLOPs by 34%/52%/30% compared to baseline without pruning on Vimeo90K/UCF101/MiddleBury datasets. In addition, our method achieves state-of-the-art performance with lower FLOPs on multiple benchmarks. △ Less

Submitted 27 October, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: ACM Multimedia 2023

arXiv:2307.12900 [pdf, other]

doi 10.1109/TCDS.2024.3410371

Automotive Object Detection via Learning Sparse Events by Spiking Neurons

Authors: Hu Zhang, Yanchen Li, Luziwei Leng, Kaiwei Che, Qian Liu, Qinghai Guo, Jianxing Liao, Ran Cheng

Abstract: Event-based sensors, distinguished by their high temporal resolution of 1 $\mathrmμ\text{s}$ and a dynamic range of 120 $\text{dB}$, stand out as ideal tools for deployment in fast-paced settings like vehicles and drones. Traditional object detection techniques that utilize Artificial Neural Networks (ANNs) face challenges due to the sparse and asynchronous nature of the events these sensors captu… ▽ More Event-based sensors, distinguished by their high temporal resolution of 1 $\mathrmμ\text{s}$ and a dynamic range of 120 $\text{dB}$, stand out as ideal tools for deployment in fast-paced settings like vehicles and drones. Traditional object detection techniques that utilize Artificial Neural Networks (ANNs) face challenges due to the sparse and asynchronous nature of the events these sensors capture. In contrast, Spiking Neural Networks (SNNs) offer a promising alternative, providing a temporal representation that is inherently aligned with event-based data. This paper explores the unique membrane potential dynamics of SNNs and their ability to modulate sparse events. We introduce an innovative spike-triggered adaptive threshold mechanism designed for stable training. Building on these insights, we present a specialized spiking feature pyramid network (SpikeFPN) optimized for automotive event-based object detection. Comprehensive evaluations demonstrate that SpikeFPN surpasses both traditional SNNs and advanced ANNs enhanced with attention mechanisms. Evidently, SpikeFPN achieves a mean Average Precision (mAP) of 0.477 on the GEN1 Automotive Detection (GAD) benchmark dataset, marking significant increases over the selected SNN baselines. Moreover, the efficient design of SpikeFPN ensures robust performance while optimizing computational resources, attributed to its innate sparse computation capabilities. Source codes are publicly accessible at https://github.com/EMI-Group/spikefpn. △ Less

Submitted 10 June, 2024; v1 submitted 24 July, 2023; originally announced July 2023.

Comments: IEEE Transactions on Cognitive and Developmental Systems

arXiv:2307.12326 [pdf, other]

Scale jump-aware pose graph relaxation for monocular SLAM with re-initializations

Authors: Runze Yuan, Ran Cheng, Lige Liu, Tao Sun, Laurent Kneip

Abstract: Pose graph relaxation has become an indispensable addition to SLAM enabling efficient global registration of sensor reference frames under the objective of satisfying pair-wise relative transformation constraints. The latter may be given by incremental motion estimation or global place recognition. While the latter case enables loop closures and drift compensation, care has to be taken in the mono… ▽ More Pose graph relaxation has become an indispensable addition to SLAM enabling efficient global registration of sensor reference frames under the objective of satisfying pair-wise relative transformation constraints. The latter may be given by incremental motion estimation or global place recognition. While the latter case enables loop closures and drift compensation, care has to be taken in the monocular case in which local estimates of structure and displacements can differ from reality not just in terms of noise, but also in terms of a scale factor. Owing to the accumulation of scale propagation errors, this scale factor is drifting over time, hence scale-drift aware pose graph relaxation has been introduced. We extend this idea to cases in which the relative scale between subsequent sensor frames is unknown, a situation that can easily occur if monocular SLAM enters re-initialization and no reliable overlap between successive local maps can be identified. The approach is realized by a hybrid pose graph formulation that combines the regular similarity consistency terms with novel, scale-blind constraints. We apply the technique to the practically relevant case of small indoor service robots capable of effectuating purely rotational displacements, a condition that can easily cause tracking failures. We demonstrate that globally consistent trajectories can be recovered even if multiple re-initializations occur along the loop, and present an in-depth study of success and failure cases. △ Less

Submitted 23 July, 2023; originally announced July 2023.

Comments: 8 pages, 23 figures, International Conference on Intelligent Robots and Systems 2023

arXiv:2307.05717 [pdf, other]

Towards Mobility Data Science (Vision Paper)

Authors: Mohamed Mokbel, Mahmoud Sakr, Li Xiong, Andreas Züfle, Jussara Almeida, Taylor Anderson, Walid Aref, Gennady Andrienko, Natalia Andrienko, Yang Cao, Sanjay Chawla, Reynold Cheng, Panos Chrysanthis, Xiqi Fei, Gabriel Ghinita, Anita Graser, Dimitrios Gunopulos, Christian Jensen, Joon-Seok Kim, Kyoung-Sook Kim, Peer Kröger, John Krumm, Johannes Lauer, Amr Magdy, Mario Nascimento , et al. (23 additional authors not shown)

Abstract: Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences… ▽ More Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences. In this paper, we present the emerging domain of mobility data science. Towards a unified approach to mobility data science, we envision a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state of the art and describe open challenges for the research community in the coming years. △ Less

Submitted 7 March, 2024; v1 submitted 21 June, 2023; originally announced July 2023.

Comments: Updated to reflect the major revision for ACM Transactions on Spatial Algorithms and Systems (TSAS). This version reflects the final version accepted by ACM TSAS

arXiv:2306.12465 [pdf, other]

doi 10.1109/TNNLS.2024.3394837

Efficient Deep Spiking Multi-Layer Perceptrons with Multiplication-Free Inference

Authors: Boyan Li, Luziwei Leng, Shuaijie Shen, Kaixuan Zhang, Jianguo Zhang, Jianxing Liao, Ran Cheng

Abstract: Advancements in adapting deep convolution architectures for Spiking Neural Networks (SNNs) have significantly enhanced image classification performance and reduced computational burdens. However, the inability of Multiplication-Free Inference (MFI) to align with attention and transformer mechanisms, which are critical to superior performance on high-resolution vision tasks, imposing limitations on… ▽ More Advancements in adapting deep convolution architectures for Spiking Neural Networks (SNNs) have significantly enhanced image classification performance and reduced computational burdens. However, the inability of Multiplication-Free Inference (MFI) to align with attention and transformer mechanisms, which are critical to superior performance on high-resolution vision tasks, imposing limitations on these gains. To address this, our research explores a new pathway, drawing inspiration from the progress made in Multi-Layer Perceptrons (MLPs). We propose an innovative spiking MLP architecture that uses batch normalization to retain MFI compatibility and introducing a spiking patch encoding layer to enhance local feature extraction capabilities. As a result, we establish an efficient multi-stage spiking MLP network that blends effectively global receptive fields with local feature extraction for comprehensive spike-based computation. Without relying on pre-training or sophisticated SNN training techniques, our network secures a top-1 accuracy of 66.39% on the ImageNet-1K dataset, surpassing the directly trained spiking ResNet-34 by 2.67%. Furthermore, we curtail computational costs, model parameters, and simulation steps. An expanded version of our network compares with the performance of the spiking VGG-16 network with a 71.64% top-1 accuracy, all while operating with a model capacity 2.1 times smaller. Our findings highlight the potential of our deep SNN architecture in effectively integrating global and local learning abilities. Interestingly, the trained receptive field in our network mirrors the activity patterns of cortical cells. Source codes are publicly accessible at https://github.com/EMI-Group/mixer-snn. △ Less

Submitted 26 April, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: IEEE TNNLS

arXiv:2306.12098 [pdf, other]

MSW-Transformer: Multi-Scale Shifted Windows Transformer Networks for 12-Lead ECG Classification

Authors: Renjie Cheng, Zhemin Zhuang, Shuxin Zhuang, Lei Xie, Jingfeng Guo

Abstract: Automatic classification of electrocardiogram (ECG) signals plays a crucial role in the early prevention and diagnosis of cardiovascular diseases. While ECG signals can be used for the diagnosis of various diseases, their pathological characteristics exhibit minimal variations, posing a challenge to automatic classification models. Existing methods primarily utilize convolutional neural networks t… ▽ More Automatic classification of electrocardiogram (ECG) signals plays a crucial role in the early prevention and diagnosis of cardiovascular diseases. While ECG signals can be used for the diagnosis of various diseases, their pathological characteristics exhibit minimal variations, posing a challenge to automatic classification models. Existing methods primarily utilize convolutional neural networks to extract ECG signal features for classification, which may not fully capture the pathological feature differences of different diseases. Transformer networks have advantages in feature extraction for sequence data, but the complete network is complex and relies on large-scale datasets. To address these challenges, we propose a single-layer Transformer network called Multi-Scale Shifted Windows Transformer Networks (MSW-Transformer), which uses a multi-window sliding attention mechanism at different scales to capture features in different dimensions. The self-attention is restricted to non-overlapping local windows via shifted windows, and different window scales have different receptive fields. A learnable feature fusion method is then proposed to integrate features from different windows to further enhance model performance. Furthermore, we visualize the attention mechanism of the multi-window shifted mechanism to achieve better clinical interpretation in the ECG classification task. The proposed model achieves state-of-the-art performance on five classification tasks of the PTBXL-2020 12-lead ECG dataset, which includes 5 diagnostic superclasses, 23 diagnostic subclasses, 12 rhythm classes, 17 morphology classes, and 44 diagnosis classes, with average macro-F1 scores of 77.85%, 47.57%, 66.13%, 34.60%, and 34.29%, and average sample-F1 scores of 81.26%, 68.27%, 91.32%, 50.07%, and 63.19%, respectively. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Showing 1–50 of 160 results for author: Cheng, R