-
AirSketch: Generative Motion to Sketch
Authors:
Hui Xian Grace Lim,
Xuanming Cui,
Yogesh S Rawat,
Ser-Nam Lim
Abstract:
Illustration is a fundamental mode of human expression and communication. Certain types of motion that accompany speech can provide this illustrative mode of communication. While Augmented and Virtual Reality technologies (AR/VR) have introduced tools for producing drawings with hand motions (air drawing), they typically require costly hardware and additional digital markers, thereby limiting thei…
▽ More
Illustration is a fundamental mode of human expression and communication. Certain types of motion that accompany speech can provide this illustrative mode of communication. While Augmented and Virtual Reality technologies (AR/VR) have introduced tools for producing drawings with hand motions (air drawing), they typically require costly hardware and additional digital markers, thereby limiting their accessibility and portability. Furthermore, air drawing demands considerable skill to achieve aesthetic results. To address these challenges, we introduce the concept of AirSketch, aimed at generating faithful and visually coherent sketches directly from hand motions, eliminating the need for complicated headsets or markers. We devise a simple augmentation-based self-supervised training procedure, enabling a controllable image diffusion model to learn to translate from highly noisy hand tracking images to clean, aesthetically pleasing sketches, while preserving the essential visual cues from the original tracking data. We present two air drawing datasets to study this problem. Our findings demonstrate that beyond producing photo-realistic images from precise spatial inputs, controllable image diffusion can effectively produce a refined, clear sketch from a noisy input. Our work serves as an initial step towards marker-less air drawing and reveals distinct applications of controllable diffusion models to AirSketch and AR/VR in general.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
Authors:
Xuannan Liu,
Zekun Li,
Peipei Li,
Shuhan Xia,
Xing Cui,
Linzhi Huang,
Huaibo Huang,
Weihong Deng,
Zhaofeng He
Abstract:
Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MM…
▽ More
Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Annotating FrameNet via Structure-Conditioned Language Generation
Authors:
Xinyue Cui,
Swabha Swayamdipta
Abstract:
Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semant…
▽ More
Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
△ Less
Submitted 24 June, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation
Authors:
Zijie Zhong,
Hanwen Liu,
Xiaoya Cui,
Xiaofan Zhang,
Zengchang Qin
Abstract:
Integrating information from different reference data sources is a major challenge for Retrieval-Augmented Generation (RAG) systems because each knowledge source adopts a unique data structure and follows different conventions. Retrieving from multiple knowledge sources with one fixed strategy usually leads to under-exploitation of information. To mitigate this drawback, inspired by Mix-of-Expert,…
▽ More
Integrating information from different reference data sources is a major challenge for Retrieval-Augmented Generation (RAG) systems because each knowledge source adopts a unique data structure and follows different conventions. Retrieving from multiple knowledge sources with one fixed strategy usually leads to under-exploitation of information. To mitigate this drawback, inspired by Mix-of-Expert, we introduce Mix-of-Granularity (MoG), a method that dynamically determines the optimal granularity of a knowledge database based on input queries using a router. The router is efficiently trained with a newly proposed loss function employing soft labels. We further extend MoG to Mix-of-Granularity-Graph (MoGG), where reference documents are pre-processed into graphs, enabling the retrieval of relevant information from distantly situated chunks. Extensive experiments demonstrate that both MoG and MoGG effectively predict optimal granularity levels, significantly enhancing the performance of the RAG system in downstream tasks. The code of both MoG and MoGG will be made public.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner
Authors:
Xing Cui,
Peipei Li,
Zekun Li,
Xuannan Liu,
Yueying Zou,
Zhaofeng He
Abstract:
Flexible and accurate drag-based editing is a challenging task that has recently garnered significant attention. Current methods typically model this problem as automatically learning ``how to drag'' through point dragging and often produce one deterministic estimation, which presents two key limitations: 1) Overlooking the inherently ill-posed nature of drag-based editing, where multiple results…
▽ More
Flexible and accurate drag-based editing is a challenging task that has recently garnered significant attention. Current methods typically model this problem as automatically learning ``how to drag'' through point dragging and often produce one deterministic estimation, which presents two key limitations: 1) Overlooking the inherently ill-posed nature of drag-based editing, where multiple results may correspond to a given input, as illustrated in Fig.1; 2) Ignoring the constraint of image quality, which may lead to unexpected distortion. To alleviate this, we propose LucidDrag, which shifts the focus from ``how to drag'' to a paradigm of ``what-then-how''. LucidDrag comprises an intention reasoner and a collaborative guidance sampling mechanism. The former infers several optimal editing strategies, identifying what content and what semantic direction to be edited. Based on the former, the latter addresses "how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance. Specifically, semantic guidance is derived by establishing a semantic editing direction based on reasoned intentions, while quality guidance is achieved through classifier guidance using an image fidelity discriminator. Both qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods. The code will be released.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Node Injection Attack Based on Label Propagation Against Graph Neural Network
Authors:
Peican Zhu,
Zechen Pan,
Keke Tang,
Xiaodong Cui,
Jinhuan Wang,
Qi Xuan
Abstract:
Graph Neural Network (GNN) has achieved remarkable success in various graph learning tasks, such as node classification, link prediction and graph classification. The key to the success of GNN lies in its effective structure information representation through neighboring aggregation. However, the attacker can easily perturb the aggregation process through injecting fake nodes, which reveals that G…
▽ More
Graph Neural Network (GNN) has achieved remarkable success in various graph learning tasks, such as node classification, link prediction and graph classification. The key to the success of GNN lies in its effective structure information representation through neighboring aggregation. However, the attacker can easily perturb the aggregation process through injecting fake nodes, which reveals that GNN is vulnerable to the graph injection attack. Existing graph injection attack methods primarily focus on damaging the classical feature aggregation process while overlooking the neighborhood aggregation process via label propagation. To bridge this gap, we propose the label-propagation-based global injection attack (LPGIA) which conducts the graph injection attack on the node classification task. Specifically, we analyze the aggregation process from the perspective of label propagation and transform the graph injection attack problem into a global injection label specificity attack problem. To solve this problem, LPGIA utilizes a label propagation-based strategy to optimize the combinations of the nodes connected to the injected node. Then, LPGIA leverages the feature mapping to generate malicious features for injected nodes. In extensive experiments against representative GNNs, LPGIA outperforms the previous best-performing injection attack method in various datasets, demonstrating its superiority and transferability.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Optimizing Search Advertising Strategies: Integrating Reinforcement Learning with Generalized Second-Price Auctions for Enhanced Ad Ranking and Bidding
Authors:
Chang Zhou,
Yang Zhao,
Jin Cao,
Yi Shen,
Xiaoling Cui,
Chiyu Cheng
Abstract:
This paper explores the integration of strategic optimization methods in search advertising, focusing on ad ranking and bidding mechanisms within E-commerce platforms. By employing a combination of reinforcement learning and evolutionary strategies, we propose a dynamic model that adjusts to varying user interactions and optimizes the balance between advertiser cost, user relevance, and platform r…
▽ More
This paper explores the integration of strategic optimization methods in search advertising, focusing on ad ranking and bidding mechanisms within E-commerce platforms. By employing a combination of reinforcement learning and evolutionary strategies, we propose a dynamic model that adjusts to varying user interactions and optimizes the balance between advertiser cost, user relevance, and platform revenue. Our results suggest significant improvements in ad placement accuracy and cost efficiency, demonstrating the model's applicability in real-world scenarios.
△ Less
Submitted 29 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Swarm Learning: A Survey of Concepts, Applications, and Trends
Authors:
Elham Shammar,
Xiaohui Cui,
Mohammed A. A. Al-qaness
Abstract:
Deep learning models have raised privacy and security concerns due to their reliance on large datasets on central servers. As the number of Internet of Things (IoT) devices increases, artificial intelligence (AI) will be crucial for resource management, data processing, and knowledge acquisition. To address those issues, federated learning (FL) has introduced a novel approach to building a versati…
▽ More
Deep learning models have raised privacy and security concerns due to their reliance on large datasets on central servers. As the number of Internet of Things (IoT) devices increases, artificial intelligence (AI) will be crucial for resource management, data processing, and knowledge acquisition. To address those issues, federated learning (FL) has introduced a novel approach to building a versatile, large-scale machine learning framework that operates in a decentralized and hardware-agnostic manner. However, FL faces network bandwidth limitations and data breaches. To reduce the central dependency in FL and increase scalability, swarm learning (SL) has been proposed in collaboration with Hewlett Packard Enterprise (HPE). SL represents a decentralized machine learning framework that leverages blockchain technology for secure, scalable, and private data management. A blockchain-based network enables the exchange and aggregation of model parameters among participants, thus mitigating the risk of a single point of failure and eliminating communication bottlenecks. To the best of our knowledge, this survey is the first to introduce the principles of Swarm Learning, its architectural design, and its fields of application. In addition, it highlights numerous research avenues that require further exploration by academic and industry communities to unlock the full potential and applications of SL.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Confidence-Aware RGB-D Face Recognition via Virtual Depth Synthesis
Authors:
Zijian Chen,
Mei Wang,
Weihong Deng,
Hongzhi Shi,
Dongchao Wen,
Yingjie Zhang,
Xingchen Cui,
Jian Zhao
Abstract:
2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose. Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information. However, collecting sufficient paired RGB-D training data is expensive and time-consuming, hindering wide deployment. In this work, we first construct a diverse depth datase…
▽ More
2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose. Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information. However, collecting sufficient paired RGB-D training data is expensive and time-consuming, hindering wide deployment. In this work, we first construct a diverse depth dataset generated by 3D Morphable Models for depth model pre-training. Then, we propose a domain-independent pre-training framework that utilizes readily available pre-trained RGB and depth models to separately perform face recognition without needing additional paired data for retraining. To seamlessly integrate the two distinct networks and harness the complementary benefits of RGB and depth information for improved accuracy, we propose an innovative Adaptive Confidence Weighting (ACW). This mechanism is designed to learn confidence estimates for each modality to achieve modality fusion at the score level. Our method is simple and lightweight, only requiring ACW training beyond the backbone models. Experiments on multiple public RGB-D face recognition benchmarks demonstrate state-of-the-art performance surpassing previous methods based on depth estimation and feature fusion, validating the efficacy of our approach.
△ Less
Submitted 16 March, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation
Authors:
Guangyang Wu,
Xiaohong Liu,
Jun Jia,
Xuehao Cui,
Guangtao Zhai
Abstract:
In the digital era, QR codes serve as a linchpin connecting virtual and physical realms. Their pervasive integration across various applications highlights the demand for aesthetically pleasing codes without compromised scannability. However, prevailing methods grapple with the intrinsic challenge of balancing customization and scannability. Notably, stable-diffusion models have ushered in an epoc…
▽ More
In the digital era, QR codes serve as a linchpin connecting virtual and physical realms. Their pervasive integration across various applications highlights the demand for aesthetically pleasing codes without compromised scannability. However, prevailing methods grapple with the intrinsic challenge of balancing customization and scannability. Notably, stable-diffusion models have ushered in an epoch of high-quality, customizable content generation. This paper introduces Text2QR, a pioneering approach leveraging these advancements to address a fundamental challenge: concurrently achieving user-defined aesthetics and scanning robustness. To ensure stable generation of aesthetic QR codes, we introduce the QR Aesthetic Blueprint (QAB) module, generating a blueprint image exerting control over the entire generation process. Subsequently, the Scannability Enhancing Latent Refinement (SELR) process refines the output iteratively in the latent space, enhancing scanning robustness. This approach harnesses the potent generation capabilities of stable-diffusion models, navigating the trade-off between image aesthetics and QR code scannability. Our experiments demonstrate the seamless fusion of visual appeal with the practical utility of aesthetic QR codes, markedly outperforming prior methods. Codes are available at \url{https://github.com/mulns/Text2QR}
△ Less
Submitted 12 March, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
FakeNewsGPT4: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs
Authors:
Xuannan Liu,
Peipei Li,
Huaibo Huang,
Zekun Li,
Xing Cui,
Jiahao Liang,
Lixiong Qin,
Weihong Deng,
Zhaofeng He
Abstract:
The massive generation of multimodal fake news exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training within specific domains restricts the capability of classical detectors to obtain open-world facts. In this paper, we propose FakeNewsGPT4, a novel framework that augments Large Vision-Language Models (LVLMs) with fo…
▽ More
The massive generation of multimodal fake news exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training within specific domains restricts the capability of classical detectors to obtain open-world facts. In this paper, we propose FakeNewsGPT4, a novel framework that augments Large Vision-Language Models (LVLMs) with forgery-specific knowledge for manipulation reasoning while inheriting extensive world knowledge as complementary. Knowledge augmentation in FakeNewsGPT4 involves acquiring two types of forgery-specific knowledge, i.e., semantic correlation and artifact trace, and merging them into LVLMs. Specifically, we design a multi-level cross-modal reasoning module that establishes interactions across modalities for extracting semantic correlations. Concurrently, a dual-branch fine-grained verification module is presented to comprehend localized details to encode artifact traces. The generated knowledge is translated into refined embeddings compatible with LVLMs. We also incorporate candidate answer heuristics and soft prompts to enhance input informativeness. Extensive experiments on the public benchmark demonstrate that FakeNewsGPT4 achieves superior cross-domain performance compared to previous methods. Code will be available.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Taking Second-life Batteries from Exhausted to Empowered using Experiments, Data Analysis, and Health Estimation
Authors:
Xiaofan Cui,
Muhammad Aadil Khan,
Gabriele Pozzato,
Surinder Singh,
Ratnesh Sharma,
Simona Onori
Abstract:
The reuse of retired electric vehicle batteries in grid energy storage offers environmental and economic benefits. This study concentrates on health monitoring algorithms for retired batteries deployed in grid storage. Over 15 months of testing, we collect, analyze, and publicize a dataset of second-life batteries, implementing a cycling protocol simulating grid energy storage load profiles within…
▽ More
The reuse of retired electric vehicle batteries in grid energy storage offers environmental and economic benefits. This study concentrates on health monitoring algorithms for retired batteries deployed in grid storage. Over 15 months of testing, we collect, analyze, and publicize a dataset of second-life batteries, implementing a cycling protocol simulating grid energy storage load profiles within a 3-4 V voltage window. Four machine-learning-based health estimation models, relying on online-accessible features and initial capacity, are compared, with the selected model achieving a mean absolute percentage error below 2.3% on test data. Additionally, an adaptive online health estimation algorithm is proposed by integrating a clustering-based method, thus limiting estimation errors during online deployment. These results showcase the feasibility of repurposing retired batteries for second-life applications. Based on obtained data and power demand, these second-life batteries exhibit potential for over a decade of grid energy storage use.
△ Less
Submitted 8 June, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Sinkhorn Distance Minimization for Knowledge Distillation
Authors:
Xiao Cui,
Yulei Qin,
Yuting Gao,
Enwei Zhang,
Zihan Xu,
Tong Wu,
Ke Li,
Xing Sun,
Wengang Zhou,
Houqiang Li
Abstract:
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few dis…
▽ More
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?
Authors:
Hongkang Li,
Meng Wang,
Songtao Lu,
Xiaodong Cui,
Pin-Yu Chen
Abstract:
Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the…
▽ More
Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments.
△ Less
Submitted 16 June, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Knowledge Graph-based Session Recommendation with Adaptive Propagation
Authors:
Yu Wang,
Amin Javari,
Janani Balaji,
Walid Shalaby,
Tyler Derr,
Xiquan Cui
Abstract:
Session-based recommender systems (SBRSs) predict users' next interacted items based on their historical activities. While most SBRSs capture purchasing intentions locally within each session, capturing items' global information across different sessions is crucial in characterizing their general properties. Previous works capture this cross-session information by constructing graphs and incorpora…
▽ More
Session-based recommender systems (SBRSs) predict users' next interacted items based on their historical activities. While most SBRSs capture purchasing intentions locally within each session, capturing items' global information across different sessions is crucial in characterizing their general properties. Previous works capture this cross-session information by constructing graphs and incorporating neighbor information. However, this incorporation cannot vary adaptively according to the unique intention of each session, and the constructed graphs consist of only one type of user-item interaction. To address these limitations, we propose knowledge graph-based session recommendation with session-adaptive propagation. Specifically, we build a knowledge graph by connecting items with multi-typed edges to characterize various user-item interactions. Then, we adaptively aggregate items' neighbor information considering user intention within the learned session. Experimental results demonstrate that equipping our constructed knowledge graph and session-adaptive propagation enhances session recommendation backbones by 10%-20%. Moreover, we provide an industrial case study showing our proposed framework achieves 2% performance boost over an existing well-deployed model at The Home Depot e-platform.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
Quantum-Inspired Machine Learning for Molecular Docking
Authors:
Runqiu Shu,
Bowen Liu,
Zhaoping Xiong,
Xiaopeng Cui,
Yunting Li,
Wei Cui,
Man-Hong Yung,
Nan Qiao
Abstract:
Molecular docking is an important tool for structure-based drug design, accelerating the efficiency of drug development. Complex and dynamic binding processes between proteins and small molecules require searching and sampling over a wide spatial range. Traditional docking by searching for possible binding sites and conformations is computationally complex and results poorly under blind docking. Q…
▽ More
Molecular docking is an important tool for structure-based drug design, accelerating the efficiency of drug development. Complex and dynamic binding processes between proteins and small molecules require searching and sampling over a wide spatial range. Traditional docking by searching for possible binding sites and conformations is computationally complex and results poorly under blind docking. Quantum-inspired algorithms combining quantum properties and annealing show great advantages in solving combinatorial optimization problems. Inspired by this, we achieve an improved in blind docking by using quantum-inspired combined with gradients learned by deep learning in the encoded molecular space. Numerical simulation shows that our method outperforms traditional docking algorithms and deep learning-based algorithms over 10\%. Compared to the current state-of-the-art deep learning-based docking algorithm DiffDock, the success rate of Top-1 (RMSD<2) achieves an improvement from 33\% to 35\% in our same setup. In particular, a 6\% improvement is realized in the high-precision region(RMSD<1) on molecules data unseen in DiffDock, which demonstrates the well-generalized of our method.
△ Less
Submitted 21 February, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding
Authors:
Qi Sun,
Xiao Cui,
Wengang Zhou,
Houqiang Li
Abstract:
In this study, we tackle the challenge of classifying the object category in point clouds, which previous works like PointCLIP struggle to address due to the inherent limitations of the CLIP architecture. Our approach leverages GPT-4 Vision (GPT-4V) to overcome these challenges by employing its advanced generative abilities, enabling a more adaptive and robust classification process. We adapt the…
▽ More
In this study, we tackle the challenge of classifying the object category in point clouds, which previous works like PointCLIP struggle to address due to the inherent limitations of the CLIP architecture. Our approach leverages GPT-4 Vision (GPT-4V) to overcome these challenges by employing its advanced generative abilities, enabling a more adaptive and robust classification process. We adapt the application of GPT-4V to process complex 3D data, enabling it to achieve zero-shot recognition capabilities without altering the underlying model architecture. Our methodology also includes a systematic strategy for point cloud image visualization, mitigating domain gap and enhancing GPT-4V's efficiency. Experimental validation demonstrates our approach's superiority in diverse scenarios, setting a new benchmark in zero-shot point cloud classification.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization
Authors:
A F M Saif,
Xiaodong Cui,
Han Shen,
Songtao Lu,
Brian Kingsbury,
Tianyi Chen
Abstract:
In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. {BL-JUST employs a lower and upper level optimization with an unsupervised loss and a supervised loss respectively, leveraging recent advances in penalty-based bilevel op…
▽ More
In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. {BL-JUST employs a lower and upper level optimization with an unsupervised loss and a supervised loss respectively, leveraging recent advances in penalty-based bilevel optimization to solve this challenging ASR problem with affordable complexity and rigorous convergence guarantees.} To evaluate BL-JUST, extensive experiments on the LibriSpeech and TED-LIUM v2 datasets have been conducted. BL-JUST achieves superior performance over the commonly used pre-training followed by fine-tuning strategy.
△ Less
Submitted 13 January, 2024;
originally announced January 2024.
-
Benchmarking PathCLIP for Pathology Image Analysis
Authors:
Sunyi Zheng,
Xiaonan Cui,
Yuxuan Sun,
Jingxiong Li,
Honglin Li,
Yunlong Zhang,
Pingyi Chen,
Xueping Jing,
Zhaoxiang Ye,
Lin Yang
Abstract:
Accurate image classification and retrieval are of importance for clinical diagnosis and treatment decision-making. The recent contrastive language-image pretraining (CLIP) model has shown remarkable proficiency in understanding natural images. Drawing inspiration from CLIP, PathCLIP is specifically designed for pathology image analysis, utilizing over 200,000 image and text pairs in training. Whi…
▽ More
Accurate image classification and retrieval are of importance for clinical diagnosis and treatment decision-making. The recent contrastive language-image pretraining (CLIP) model has shown remarkable proficiency in understanding natural images. Drawing inspiration from CLIP, PathCLIP is specifically designed for pathology image analysis, utilizing over 200,000 image and text pairs in training. While the performance the PathCLIP is impressive, its robustness under a wide range of image corruptions remains unknown. Therefore, we conduct an extensive evaluation to analyze the performance of PathCLIP on various corrupted images from the datasets of Osteosarcoma and WSSS4LUAD. In our experiments, we introduce seven corruption types including brightness, contrast, Gaussian blur, resolution, saturation, hue, and markup at four severity levels. Through experiments, we find that PathCLIP is relatively robustness to image corruptions and surpasses OpenAI-CLIP and PLIP in zero-shot classification. Among the seven corruptions, blur and resolution can cause server performance degradation of the PathCLIP. This indicates that ensuring the quality of images is crucial before conducting a clinical test. Additionally, we assess the robustness of PathCLIP in the task of image-image retrieval, revealing that PathCLIP performs less effectively than PLIP on Osteosarcoma but performs better on WSSS4LUAD under diverse corruptions. Overall, PathCLIP presents impressive zero-shot classification and retrieval performance for pathology images, but appropriate care needs to be taken when using it. We hope this study provides a qualitative impression of PathCLIP and helps understand its differences from other CLIP models.
△ Less
Submitted 12 June, 2024; v1 submitted 5 January, 2024;
originally announced January 2024.
-
Enhancing Generalization of Invisible Facial Privacy Cloak via Gradient Accumulation
Authors:
Xuannan Liu,
Yaoyao Zhong,
Weihong Deng,
Hongzhi Shi,
Xingchen Cui,
Yunfeng Yin,
Dongchao Wen
Abstract:
The blooming of social media and face recognition (FR) systems has increased people's concern about privacy and security. A new type of adversarial privacy cloak (class-universal) can be applied to all the images of regular users, to prevent malicious FR systems from acquiring their identity information. In this work, we discover the optimization dilemma in the existing methods -- the local optima…
▽ More
The blooming of social media and face recognition (FR) systems has increased people's concern about privacy and security. A new type of adversarial privacy cloak (class-universal) can be applied to all the images of regular users, to prevent malicious FR systems from acquiring their identity information. In this work, we discover the optimization dilemma in the existing methods -- the local optima problem in large-batch optimization and the gradient information elimination problem in small-batch optimization. To solve these problems, we propose Gradient Accumulation (GA) to aggregate multiple small-batch gradients into a one-step iterative gradient to enhance the gradient stability and reduce the usage of quantization operations. Experiments show that our proposed method achieves high performance on the Privacy-Commons dataset against black-box face recognition models.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Exploring 3D-aware Lifespan Face Aging via Disentangled Shape-Texture Representations
Authors:
Qianrui Teng,
Rui Wang,
Xing Cui,
Peipei Li,
Zhaofeng He
Abstract:
Existing face aging methods often focus on modeling either texture aging or using an entangled shape-texture representation to achieve face aging. However, shape and texture are two distinct factors that mutually affect the human face aging process. In this paper, we propose 3D-STD, a novel 3D-aware Shape-Texture Disentangled face aging network that explicitly disentangles the facial image into sh…
▽ More
Existing face aging methods often focus on modeling either texture aging or using an entangled shape-texture representation to achieve face aging. However, shape and texture are two distinct factors that mutually affect the human face aging process. In this paper, we propose 3D-STD, a novel 3D-aware Shape-Texture Disentangled face aging network that explicitly disentangles the facial image into shape and texture representations using 3D face reconstruction. Additionally, to facilitate high-fidelity texture synthesis, we propose a novel texture generation method based on Empirical Mode Decomposition (EMD). Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance in terms of shape and texture transformation. Moreover, our method supports producing plausible 3D face aging results, which is rarely accomplished by current methods.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning
Authors:
Xiangyu Cui,
Xun Li,
Yun Shi,
Si Zhao
Abstract:
This paper studies a discrete-time mean-variance model based on reinforcement learning. Compared with its continuous-time counterpart in \cite{zhou2020mv}, the discrete-time model makes more general assumptions about the asset's return distribution. Using entropy to measure the cost of exploration, we derive the optimal investment strategy, whose density function is also Gaussian type. Additionall…
▽ More
This paper studies a discrete-time mean-variance model based on reinforcement learning. Compared with its continuous-time counterpart in \cite{zhou2020mv}, the discrete-time model makes more general assumptions about the asset's return distribution. Using entropy to measure the cost of exploration, we derive the optimal investment strategy, whose density function is also Gaussian type. Additionally, we design the corresponding reinforcement learning algorithm. Both simulation experiments and empirical analysis indicate that our discrete-time model exhibits better applicability when analyzing real-world data than the continuous-time model.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
AdvCloak: Customized Adversarial Cloak for Privacy Protection
Authors:
Xuannan Liu,
Yaoyao Zhong,
Xing Cui,
Yuhang Zhang,
Peipei Li,
Weihong Deng
Abstract:
With extensive face images being shared on social media, there has been a notable escalation in privacy concerns. In this paper, we propose AdvCloak, an innovative framework for privacy protection using generative models. AdvCloak is designed to automatically customize class-wise adversarial masks that can maintain superior image-level naturalness while providing enhanced feature-level generalizat…
▽ More
With extensive face images being shared on social media, there has been a notable escalation in privacy concerns. In this paper, we propose AdvCloak, an innovative framework for privacy protection using generative models. AdvCloak is designed to automatically customize class-wise adversarial masks that can maintain superior image-level naturalness while providing enhanced feature-level generalization ability. Specifically, AdvCloak sequentially optimizes the generative adversarial networks by employing a two-stage training strategy. This strategy initially focuses on adapting the masks to the unique individual faces via image-specific training and then enhances their feature-level generalization ability to diverse facial variations of individuals via person-specific training. To fully utilize the limited training data, we combine AdvCloak with several general geometric modeling methods, to better describe the feature subspace of source identities. Extensive quantitative and qualitative evaluations on both common and celebrity datasets demonstrate that AdvCloak outperforms existing state-of-the-art methods in terms of efficiency and effectiveness.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Authors:
Xuanming Cui,
Alejandro Aparcedo,
Young Kyun Jang,
Ser-Nam Lim
Abstract:
Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, ima…
▽ More
Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, image captioning, and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However, our findings suggest that context provided to the model via prompts, such as questions in a QA pair helps to mitigate the effects of visual adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under-explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.
△ Less
Submitted 8 December, 2023; v1 submitted 5 December, 2023;
originally announced December 2023.
-
InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser
Authors:
Xing Cui,
Zekun Li,
Pei Pei Li,
Huaibo Huang,
Xuannan Liu,
Zhaofeng He
Abstract:
Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by a few reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with onl…
▽ More
Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by a few reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio. We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the "style" noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style. To address this, we introduce a learnable style token via prompt refinement, which enhances the accuracy of the style description for the reference image. Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise.
△ Less
Submitted 12 July, 2024; v1 submitted 25 November, 2023;
originally announced November 2023.
-
Soft Random Sampling: A Theoretical and Empirical Analysis
Authors:
Xiaodong Cui,
Ashish Mittal,
Songtao Lu,
Wei Zhang,
George Saon,
Brian Kingsbury
Abstract:
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. N…
▽ More
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost.
△ Less
Submitted 23 November, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Dual-channel Prototype Network for few-shot Classification of Pathological Images
Authors:
Hao Quan,
Xinjia Li,
Dayu Hu,
Tianhang Nan,
Xiaoyu Cui
Abstract:
In pathology, the rarity of certain diseases and the complexity in annotating pathological images significantly hinder the creation of extensive, high-quality datasets. This limitation impedes the progress of deep learning-assisted diagnostic systems in pathology. Consequently, it becomes imperative to devise a technology that can discern new disease categories from a minimal number of annotated e…
▽ More
In pathology, the rarity of certain diseases and the complexity in annotating pathological images significantly hinder the creation of extensive, high-quality datasets. This limitation impedes the progress of deep learning-assisted diagnostic systems in pathology. Consequently, it becomes imperative to devise a technology that can discern new disease categories from a minimal number of annotated examples. Such a technology would substantially advance deep learning models for rare diseases. Addressing this need, we introduce the Dual-channel Prototype Network (DCPN), rooted in the few-shot learning paradigm, to tackle the challenge of classifying pathological images with limited samples. DCPN augments the Pyramid Vision Transformer (PVT) framework for few-shot classification via self-supervised learning and integrates it with convolutional neural networks. This combination forms a dual-channel architecture that extracts multi-scale, highly precise pathological features. The approach enhances the versatility of prototype representations and elevates the efficacy of prototype networks in few-shot pathological image classification tasks. We evaluated DCPN using three publicly available pathological datasets, configuring small-sample classification tasks that mirror varying degrees of clinical scenario domain shifts. Our experimental findings robustly affirm DCPN's superiority in few-shot pathological image classification, particularly in tasks within the same domain, where it achieves the benchmarks of supervised learning.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Ask more, know better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models
Authors:
Xue Yan,
Yan Song,
Xinyu Cui,
Filippos Christianos,
Haifeng Zhang,
David Henry Mguni,
Jun Wang
Abstract:
Large language models (LLMs) demonstrate their promise in tackling complicated practical challenges by combining action-based policies with chain of thought (CoT) reasoning. Having high-quality prompts on hand, however, is vital to the framework's effectiveness. Currently, these prompts are handcrafted utilising extensive human labor, resulting in CoT policies that frequently fail to generalise. H…
▽ More
Large language models (LLMs) demonstrate their promise in tackling complicated practical challenges by combining action-based policies with chain of thought (CoT) reasoning. Having high-quality prompts on hand, however, is vital to the framework's effectiveness. Currently, these prompts are handcrafted utilising extensive human labor, resulting in CoT policies that frequently fail to generalise. Human intervention is also required to develop grounding functions that ensure low-level controllers appropriately process CoT reasoning. In this paper, we propose a comprehensive training framework for complex task-solving, incorporating human prior knowledge into the learning of action policies. To that purpose, we offer a new leader-follower bilevel framework that is capable of learning to ask relevant questions (prompts) and subsequently undertaking reasoning to guide the learning of actions. The prompt policy is employed to make introspective revisions based on historical findings, leading the CoT process to consider the anticipated goals and generate outputs that lead to decisive, high-performing actions. The action policy subsequently learns to comprehend and integrate the CoT outputs to take actions. Our empirical data reveal that our framework outperforms leading methods in $5$ decision-making tasks such as Overcooked and FourRoom.
△ Less
Submitted 28 February, 2024; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Bidirectional Knowledge Reconfiguration for Lightweight Point Cloud Analysis
Authors:
Peipei Li,
Xing Cui,
Yibo Hu,
Man Zhang,
Ting Yao,
Tao Mei
Abstract:
Point cloud analysis faces computational system overhead, limiting its application on mobile or edge devices. Directly employing small models may result in a significant drop in performance since it is difficult for a small model to adequately capture local structure and global shape information simultaneously, which are essential clues for point cloud analysis. This paper explores feature distill…
▽ More
Point cloud analysis faces computational system overhead, limiting its application on mobile or edge devices. Directly employing small models may result in a significant drop in performance since it is difficult for a small model to adequately capture local structure and global shape information simultaneously, which are essential clues for point cloud analysis. This paper explores feature distillation for lightweight point cloud models. To mitigate the semantic gap between the lightweight student and the cumbersome teacher, we propose bidirectional knowledge reconfiguration (BKR) to distill informative contextual knowledge from the teacher to the student. Specifically, a top-down knowledge reconfiguration and a bottom-up knowledge reconfiguration are developed to inherit diverse local structure information and consistent global shape knowledge from the teacher, respectively. However, due to the farthest point sampling in most point cloud models, the intermediate features between teacher and student are misaligned, deteriorating the feature distillation performance. To eliminate it, we propose a feature mover's distance (FMD) loss based on optimal transportation, which can measure the distance between unordered point cloud features effectively. Extensive experiments conducted on shape classification, part segmentation, and semantic segmentation benchmarks demonstrate the universality and superiority of our method.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Hierarchical Multi-Task Learning Framework for Session-based Recommendations
Authors:
Sejoon Oh,
Walid Shalaby,
Amir Afsharinejad,
Xiquan Cui
Abstract:
While session-based recommender systems (SBRSs) have shown superior recommendation performance, multi-task learning (MTL) has been adopted by SBRSs to enhance their prediction accuracy and generalizability further. Hierarchical MTL (H-MTL) sets a hierarchical structure between prediction tasks and feeds outputs from auxiliary tasks to main tasks. This hierarchy leads to richer input features for m…
▽ More
While session-based recommender systems (SBRSs) have shown superior recommendation performance, multi-task learning (MTL) has been adopted by SBRSs to enhance their prediction accuracy and generalizability further. Hierarchical MTL (H-MTL) sets a hierarchical structure between prediction tasks and feeds outputs from auxiliary tasks to main tasks. This hierarchy leads to richer input features for main tasks and higher interpretability of predictions, compared to existing MTL frameworks. However, the H-MTL framework has not been investigated in SBRSs yet. In this paper, we propose HierSRec which incorporates the H-MTL architecture into SBRSs. HierSRec encodes a given session with a metadata-aware Transformer and performs next-category prediction (i.e., auxiliary task) with the session encoding. Next, HierSRec conducts next-item prediction (i.e., main task) with the category prediction result and session encoding. For scalable inference, HierSRec creates a compact set of candidate items (e.g., 4% of total items) per test example using the category prediction. Experiments show that HierSRec outperforms existing SBRSs as per next-item prediction accuracy on two session-based recommendation datasets. The accuracy of HierSRec measured with the carefully-curated candidate items aligns with the accuracy of HierSRec calculated with all items, which validates the usefulness of our candidate generation scheme via H-MTL.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation
Authors:
Xiaohan Cui,
Long Ma,
Tengyu Ma,
Jinyuan Liu,
Xin Fan,
Risheng Liu
Abstract:
Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector…
▽ More
Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
A Multi-Task Semantic Decomposition Framework with Task-specific Pre-training for Few-Shot NER
Authors:
Guanting Dong,
Zechen Wang,
Jinxu Zhao,
Gang Zhao,
Daichi Guo,
Dayuan Fu,
Tingfeng Hui,
Chen Zeng,
Keqing He,
Xuefeng Li,
Liwen Wang,
Xinyue Cui,
Weiran Xu
Abstract:
The objective of few-shot named entity recognition is to identify named entities with limited labeled instances. Previous works have primarily focused on optimizing the traditional token-wise classification framework, while neglecting the exploration of information based on NER data characteristics. To address this issue, we propose a Multi-Task Semantic Decomposition Framework via Joint Task-spec…
▽ More
The objective of few-shot named entity recognition is to identify named entities with limited labeled instances. Previous works have primarily focused on optimizing the traditional token-wise classification framework, while neglecting the exploration of information based on NER data characteristics. To address this issue, we propose a Multi-Task Semantic Decomposition Framework via Joint Task-specific Pre-training (MSDP) for few-shot NER. Drawing inspiration from demonstration-based and contrastive learning, we introduce two novel pre-training tasks: Demonstration-based Masked Language Modeling (MLM) and Class Contrastive Discrimination. These tasks effectively incorporate entity boundary information and enhance entity representation in Pre-trained Language Models (PLMs). In the downstream main task, we introduce a multi-task joint optimization framework with the semantic decomposing method, which facilitates the model to integrate two different semantic information for entity classification. Experimental results of two few-shot NER benchmarks demonstrate that MSDP consistently outperforms strong baselines by a large margin. Extensive analyses validate the effectiveness and generalization of MSDP.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids
Authors:
Xinyi Wang,
Xuan Cui,
Danxu Li,
Fang Liu,
Licheng Jiao
Abstract:
Drones have been widely used in many areas of our daily lives. It relieves people of the burden of holding a controller all the time and makes drone control easier to use for people with disabilities or occupied hands. However, the control of aerial robots is more complicated compared to normal robots due to factors such as uncontrollable height. Therefore, it is crucial to develop an intelligent…
▽ More
Drones have been widely used in many areas of our daily lives. It relieves people of the burden of holding a controller all the time and makes drone control easier to use for people with disabilities or occupied hands. However, the control of aerial robots is more complicated compared to normal robots due to factors such as uncontrollable height. Therefore, it is crucial to develop an intelligent UAV that has the ability to talk to humans and follow natural language commands. In this report, we present an aerial navigation task for the 2023 ICCV Conversation History. Based on the AVDN dataset containing more than 3k recorded navigation trajectories and asynchronous human-robot conversations, we propose an effective method of fusion training of Human Attention Aided Transformer model (HAA-Transformer) and Human Attention Aided LSTM (HAA-LSTM) model, which achieves the prediction of the navigation routing points and human attention. The method not only achieves high SR and SPL metrics, but also shows a 7% improvement in GP metrics compared to the baseline model.
△ Less
Submitted 27 August, 2023;
originally announced August 2023.
-
How Can Context Help? Exploring Joint Retrieval of Passage and Personalized Context
Authors:
Hui Wan,
Hongkang Li,
Songtao Lu,
Xiaodong Cui,
Marina Danilevsky
Abstract:
The integration of external personalized context information into document-grounded conversational systems has significant potential business value, but has not been well-studied. Motivated by the concept of personalized context-aware document-grounded conversational systems, we introduce the task of context-aware passage retrieval. We also construct a dataset specifically curated for this purpose…
▽ More
The integration of external personalized context information into document-grounded conversational systems has significant potential business value, but has not been well-studied. Motivated by the concept of personalized context-aware document-grounded conversational systems, we introduce the task of context-aware passage retrieval. We also construct a dataset specifically curated for this purpose. We describe multiple baseline systems to address this task, and propose a novel approach, Personalized Context-Aware Search (PCAS), that effectively harnesses contextual information during passage retrieval. Experimental evaluations conducted on multiple popular dense retrieval systems demonstrate that our proposed approach not only outperforms the baselines in retrieving the most relevant passage but also excels at identifying the pertinent context among all the available contexts. We envision that our contributions will serve as a catalyst for inspiring future research endeavors in this promising direction.
△ Less
Submitted 26 August, 2023;
originally announced August 2023.
-
CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction
Authors:
Xiang Wei,
Yufeng Chen,
Ning Cheng,
Xingyu Cui,
Jinan Xu,
Wenjuan Han
Abstract:
In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG const…
▽ More
In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG construction. Specifically, for the multi-task issue, CollabKG unifies different IE subtasks, including named entity recognition (NER), entity-relation triple extraction (RE), and event extraction (EE), and supports both KG and EKG. Then, combining advanced prompting-based IE technology, the human-machine-cooperation mechanism with LLMs as the assistant machine is presented which can provide a lower cost as well as a higher performance. Lastly, owing to the two-way interaction between the human and machine, CollabKG with learning ability allows self-renewal. Besides, CollabKG has several appealing features (e.g., customization, training-free, propagation, etc.) that make the system powerful, easy-to-use, and high-productivity. We holistically compare our toolkit with other existing tools on these features. Human evaluation quantitatively illustrates that CollabKG significantly improves annotation quality, efficiency, and stability simultaneously.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Intelligence of Astronomical Optical Telescope: Present Status and Future Perspectives
Authors:
Kang Huang,
Tianzhu Hu,
Jingyi Cai,
Xiushan Pang,
Yonghui Hou,
Yong Zhang,
Huaiqing Wang,
Xiangqun Cui
Abstract:
Artificial intelligence technology has been widely used in astronomy, and new artificial intelligence technologies and application scenarios are constantly emerging. There have been a large number of papers reviewing the application of artificial intelligence technology in astronomy. However, relevant articles seldom mention telescope intelligence separately, and it is difficult to understand the…
▽ More
Artificial intelligence technology has been widely used in astronomy, and new artificial intelligence technologies and application scenarios are constantly emerging. There have been a large number of papers reviewing the application of artificial intelligence technology in astronomy. However, relevant articles seldom mention telescope intelligence separately, and it is difficult to understand the current development status and research hotspots of telescope intelligence from these papers. This paper combines the development history of artificial intelligence technology and the difficulties of critical technologies of telescopes, comprehensively introduces the development and research hotspots of telescope intelligence, then conducts statistical analysis on various research directions of telescope intelligence and defines the research directions' merits. All kinds of research directions are evaluated, and the research trend of each telescope's intelligence is pointed out. Finally, according to the advantages of artificial intelligence technology and the development trend of telescopes, future research hotspots of telescope intelligence are given.
△ Less
Submitted 16 January, 2024; v1 submitted 29 June, 2023;
originally announced June 2023.
-
Local Boosting for Weakly-Supervised Learning
Authors:
Rongzhi Zhang,
Yue Yu,
Jiaming Shen,
Xiquan Cui,
Chao Zhang
Abstract:
Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boos…
▽ More
Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boosting approaches. In this work, we show that the standard implementation of the convex combination of base learners can hardly work due to the presence of noisy labels. Instead, we propose $\textit{LocalBoost}$, a novel framework for weakly-supervised boosting. LocalBoost iteratively boosts the ensemble model from two dimensions, i.e., intra-source and inter-source. The intra-source boosting introduces locality to the base learners and enables each base learner to focus on a particular feature regime by training new base learners on granularity-varying error regions. For the inter-source boosting, we leverage a conditional function to indicate the weak source where the sample is more likely to appear. To account for the weak labels, we further design an estimate-then-modify approach to compute the model weights. Experiments on seven datasets show that our method significantly outperforms vanilla boosting methods and other weakly-supervised methods.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
NashFormer: Leveraging Local Nash Equilibria for Semantically Diverse Trajectory Prediction
Authors:
Justin Lidard,
Oswin So,
Yanxia Zhang,
Jonathan DeCastro,
Xiongyi Cui,
Xin Huang,
Yen-Ling Kuo,
John Leonard,
Avinash Balachandran,
Naomi Leonard,
Guy Rosman
Abstract:
Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-…
▽ More
Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-theoretic inverse reinforcement learning to improve coverage of multi-modal predictions. We use a training-time game-theoretic analysis as an auxiliary loss resulting in improved coverage and accuracy without presuming a taxonomy of actions for the agents. We demonstrate our approach on the interactive split of the Waymo Open Motion Dataset, including four subsets involving scenarios with high interaction complexity. Experiment results show that our predictor produces accurate predictions while covering $33\%$ more potential interactions versus a baseline model.
△ Less
Submitted 11 November, 2023; v1 submitted 27 May, 2023;
originally announced May 2023.
-
Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information
Authors:
Kun Zhao,
Bohao Yang,
Chenghua Lin,
Wenge Rong,
Aline Villavicencio,
Xiaohui Cui
Abstract:
The long-standing one-to-many issue of the open-domain dialogues poses significant challenges for automatic evaluation methods, i.e., there may be multiple suitable responses which differ in semantics for a given conversational context. To tackle this challenge, we propose a novel learning-based automatic evaluation metric (CMN), which can robustly evaluate open-domain dialogues by augmenting Cond…
▽ More
The long-standing one-to-many issue of the open-domain dialogues poses significant challenges for automatic evaluation methods, i.e., there may be multiple suitable responses which differ in semantics for a given conversational context. To tackle this challenge, we propose a novel learning-based automatic evaluation metric (CMN), which can robustly evaluate open-domain dialogues by augmenting Conditional Variational Autoencoders (CVAEs) with a Next Sentence Prediction (NSP) objective and employing Mutual Information (MI) to model the semantic similarity of text in the latent space. Experimental results on two open-domain dialogue datasets demonstrate the superiority of our method compared with a wide range of baselines, especially in handling responses which are distant to the golden reference responses in semantics.
△ Less
Submitted 10 June, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via Dialogue
Authors:
Xing Cui,
Zekun Li,
Peipei Li,
Yibo Hu,
Hailin Shi,
Zhaofeng He
Abstract:
This paper explores interactive facial image editing via dialogue and introduces the ChatEdit benchmark dataset for evaluating image editing and conversation abilities in this context. ChatEdit is constructed from the CelebA-HQ dataset, incorporating annotated multi-turn dialogues corresponding to user edit requests on the images. The dataset is challenging, as it requires the system to dynamicall…
▽ More
This paper explores interactive facial image editing via dialogue and introduces the ChatEdit benchmark dataset for evaluating image editing and conversation abilities in this context. ChatEdit is constructed from the CelebA-HQ dataset, incorporating annotated multi-turn dialogues corresponding to user edit requests on the images. The dataset is challenging, as it requires the system to dynamically track user requests, edit images, and generate appropriate responses. Accordingly, we propose three benchmark tasks: (i) user edit request tracking, (ii) image editing, and (iii) response generation. We present a novel baseline framework that integrates a dialogue module for both tracking user requests and generating responses and an image editing module for image editing. Unlike previous approaches, our framework directly tracks user edit requests from the entire dialogue history up to the current turn and modifies the original image rather than adjusting the previous turn's output, thereby reducing error accumulation and preventing attribute forgetfulness. Extensive experiments on the ChatEdit dataset underline our framework's superior performance against prior models, while also highlighting potential room for further research. We will release the code and data publicly to facilitate advancements in complex interactive facial image editing.
△ Less
Submitted 16 October, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
Diagonal State Space Augmented Transformers for Speech Recognition
Authors:
George Saon,
Ankit Gupta,
Xiaodong Cui
Abstract:
We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metr…
▽ More
We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metric and support is controlled by the eigenvalues of the transition matrix. We compare neural transducers with either conformer or our proposed DSS-augmented transformer (DSSformer) encoders on three public corpora: Switchboard English conversational telephone speech 300 hours, Switchboard+Fisher 2000 hours, and a spoken archive of holocaust survivor testimonials called MALACH 176 hours. On Switchboard 300/2000 hours, we reach a single model performance of 8.9%/6.7% WER on the combined test set of the Hub5 2000 evaluation, respectively, and on MALACH we improve the WER by 7% relative over the previous best published result. In addition, we present empirical evidence suggesting that DSS layers learn damped Fourier basis functions where the attenuation coefficients are layer specific whereas the frequency coefficients converge to almost identical linearly-spaced values across all layers.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
A Prototypical Semantic Decoupling Method via Joint Contrastive Learning for Few-Shot Name Entity Recognition
Authors:
Guanting Dong,
Zechen Wang,
Liwen Wang,
Daichi Guo,
Dayuan Fu,
Yuxiang Wu,
Chen Zeng,
Xuefeng Li,
Tingfeng Hui,
Keqing He,
Xinyue Cui,
Qixiang Gao,
Weiran Xu
Abstract:
Few-shot named entity recognition (NER) aims at identifying named entities based on only few labeled instances. Most existing prototype-based sequence labeling models tend to memorize entity mentions which would be easily confused by close prototypes. In this paper, we proposed a Prototypical Semantic Decoupling method via joint Contrastive learning (PSDC) for few-shot NER. Specifically, we decoup…
▽ More
Few-shot named entity recognition (NER) aims at identifying named entities based on only few labeled instances. Most existing prototype-based sequence labeling models tend to memorize entity mentions which would be easily confused by close prototypes. In this paper, we proposed a Prototypical Semantic Decoupling method via joint Contrastive learning (PSDC) for few-shot NER. Specifically, we decouple class-specific prototypes and contextual semantic prototypes by two masking strategies to lead the model to focus on two different semantic information for inference. Besides, we further introduce joint contrastive learning objectives to better integrate two kinds of decoupling information and prevent semantic collapse. Experimental results on two few-shot NER benchmarks demonstrate that PSDC consistently outperforms the previous SOTA methods in terms of overall performance. Extensive analysis further validates the effectiveness and generalization of PSDC.
△ Less
Submitted 12 April, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Revisit Out-Of-Vocabulary Problem for Slot Filling: A Unified Contrastive Frameword with Multi-level Data Augmentations
Authors:
Daichi Guo,
Guanting Dong,
Dayuan Fu,
Yuxiang Wu,
Chen Zeng,
Tingfeng Hui,
Liwen Wang,
Xuefeng Li,
Zechen Wang,
Keqing He,
Xinyue Cui,
Weiran Xu
Abstract:
In real dialogue scenarios, the existing slot filling model, which tends to memorize entity patterns, has a significantly reduced generalization facing Out-of-Vocabulary (OOV) problems. To address this issue, we propose an OOV robust slot filling model based on multi-level data augmentations to solve the OOV problem from both word and slot perspectives. We present a unified contrastive learning fr…
▽ More
In real dialogue scenarios, the existing slot filling model, which tends to memorize entity patterns, has a significantly reduced generalization facing Out-of-Vocabulary (OOV) problems. To address this issue, we propose an OOV robust slot filling model based on multi-level data augmentations to solve the OOV problem from both word and slot perspectives. We present a unified contrastive learning framework, which pull representations of the origin sample and augmentation samples together, to make the model resistant to OOV problems. We evaluate the performance of the model from some specific slots and carefully design test data with OOV word perturbation to further demonstrate the effectiveness of OOV words. Experiments on two datasets show that our approach outperforms the previous sota methods in terms of both OOV slots and words.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT
Authors:
Xiang Wei,
Xingyu Cui,
Ning Cheng,
Xiaobin Wang,
Xin Zhang,
Shen Huang,
Pengjun Xie,
Jinan Xu,
Yufeng Chen,
Meishan Zhang,
Yong Jiang,
Wenjuan Han
Abstract:
Zero-shot information extraction (IE) aims to build IE systems from the unannotated text. It is challenging due to involving little human intervention. Challenging but worthwhile, zero-shot IE reduces the time and effort that data labeling takes. Recent efforts on large language models (LLMs, e.g., GPT-3, ChatGPT) show promising performance on zero-shot settings, thus inspiring us to explore promp…
▽ More
Zero-shot information extraction (IE) aims to build IE systems from the unannotated text. It is challenging due to involving little human intervention. Challenging but worthwhile, zero-shot IE reduces the time and effort that data labeling takes. Recent efforts on large language models (LLMs, e.g., GPT-3, ChatGPT) show promising performance on zero-shot settings, thus inspiring us to explore prompt-based methods. In this work, we ask whether strong IE models can be constructed by directly prompting LLMs. Specifically, we transform the zero-shot IE task into a multi-turn question-answering problem with a two-stage framework (ChatIE). With the power of ChatGPT, we extensively evaluate our framework on three IE tasks: entity-relation triple extract, named entity recognition, and event extraction. Empirical results on six datasets across two languages show that ChatIE achieves impressive performance and even surpasses some full-shot models on several datasets (e.g., NYT11-HRL). We believe that our work could shed light on building IE models with limited resources.
△ Less
Submitted 27 May, 2024; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Dive into the Resolution Augmentations and Metrics in Low Resolution Face Recognition: A Plain yet Effective New Baseline
Authors:
Xu Ling,
Yichen Lu,
Wenqi Xu,
Weihong Deng,
Yingjie Zhang,
Xingchen Cui,
Hongzhi Shi,
Dongchao Wen
Abstract:
Although deep learning has significantly improved Face Recognition (FR), dramatic performance deterioration may occur when processing Low Resolution (LR) faces. To alleviate this, approaches based on unified feature space are proposed with the sacrifice under High Resolution (HR) circumstances. To deal with the huge domain gap between HR and LR domains and achieve the best on both domains, we firs…
▽ More
Although deep learning has significantly improved Face Recognition (FR), dramatic performance deterioration may occur when processing Low Resolution (LR) faces. To alleviate this, approaches based on unified feature space are proposed with the sacrifice under High Resolution (HR) circumstances. To deal with the huge domain gap between HR and LR domains and achieve the best on both domains, we first took a closer look at the impacts of several resolution augmentations and then analyzed the difficulty of LR samples from the perspective of the model gradient produced by different resolution samples. Besides, we also find that the introduction of some resolutions could help the learning of lower resolutions. Based on these, we divide the LR samples into three difficulties according to the resolution and propose a more effective Multi-Resolution Augmentation. Then, due to the rapidly increasing domain gap as the resolution decreases, we carefully design a novel and effective metric loss based on a LogExp distance function that provides decent gradients to prevent oscillation near the convergence point or tolerance to small distance errors; it could also dynamically adjust the penalty for errors in different dimensions, allowing for more optimization of dimensions with large errors. Combining these two insights, our model could learn more general knowledge in a wide resolution range of images and balanced results can be achieved by our extremely simple framework. Moreover, the augmentations and metrics are the cornerstones of LRFR, so our method could be considered a new baseline for the LRFR task. Experiments on the LRFR datasets: SCface, XQLFW, and large-scale LRFR dataset: TinyFace demonstrate the effectiveness of our methods, while the degradation on HRFR datasets is significantly reduced.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
Cluster-CAM: Cluster-Weighted Visual Interpretation of CNNs' Decision in Image Classification
Authors:
Zhenpeng Feng,
Hongbing Ji,
Milos Dakovic,
Xiyang Cui,
Mingzhe Zhu,
Ljubisa Stankovic
Abstract:
Despite the tremendous success of convolutional neural networks (CNNs) in computer vision, the mechanism of CNNs still lacks clear interpretation. Currently, class activation mapping (CAM), a famous visualization technique to interpret CNN's decision, has drawn increasing attention. Gradient-based CAMs are efficient while the performance is heavily affected by gradient vanishing and exploding. In…
▽ More
Despite the tremendous success of convolutional neural networks (CNNs) in computer vision, the mechanism of CNNs still lacks clear interpretation. Currently, class activation mapping (CAM), a famous visualization technique to interpret CNN's decision, has drawn increasing attention. Gradient-based CAMs are efficient while the performance is heavily affected by gradient vanishing and exploding. In contrast, gradient-free CAMs can avoid computing gradients to produce more understandable results. However, existing gradient-free CAMs are quite time-consuming because hundreds of forward interference per image are required. In this paper, we proposed Cluster-CAM, an effective and efficient gradient-free CNN interpretation algorithm. Cluster-CAM can significantly reduce the times of forward propagation by splitting the feature maps into clusters in an unsupervised manner. Furthermore, we propose an artful strategy to forge a cognition-base map and cognition-scissors from clustered feature maps. The final salience heatmap will be computed by merging the above cognition maps. Qualitative results conspicuously show that Cluster-CAM can produce heatmaps where the highlighted regions match the human's cognition more precisely than existing CAMs. The quantitative evaluation further demonstrates the superiority of Cluster-CAM in both effectiveness and efficiency.
△ Less
Submitted 3 February, 2023;
originally announced February 2023.
-
ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for E-Commerce Product Search
Authors:
Xuange Cui,
Wei Xiong,
Songlin Wang
Abstract:
In this paper, we propose a robust multilingual model to improve the quality of search results. Our model not only leverage the processed class-balanced dataset, but also benefit from multitask pre-training that leads to more general representations. In pre-training stage, we adopt mlm task, classification task and contrastive learning task to achieve considerably performance. In fine-tuning stage…
▽ More
In this paper, we propose a robust multilingual model to improve the quality of search results. Our model not only leverage the processed class-balanced dataset, but also benefit from multitask pre-training that leads to more general representations. In pre-training stage, we adopt mlm task, classification task and contrastive learning task to achieve considerably performance. In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop) to improve the model's generalization and robustness. Moreover, we use a multi-granular semantic unit to discover the queries and products textual metadata for enhancing the representation of the model. Our approach obtained competitive results and ranked top-8 in three tasks. We release the source code and pre-trained models associated with this work.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
Twitter's Agenda-Setting Role: A Study of Twitter Strategy for Political Diversion
Authors:
Yuyang Chen,
Xiaoyu Cui,
Yunjie Song,
Manli Wu
Abstract:
This study verified the effectiveness of Donald Trump's Twitter campaign in guiding agen-da-setting and deflecting political risk and examined Trump's Twitter communication strategy and explores the communication effects of his tweet content during Covid-19 pandemic. We collected all tweets posted by Trump on the Twitter platform from January 1, 2020 to December 31, 2020.We used Ordinary Least Squ…
▽ More
This study verified the effectiveness of Donald Trump's Twitter campaign in guiding agen-da-setting and deflecting political risk and examined Trump's Twitter communication strategy and explores the communication effects of his tweet content during Covid-19 pandemic. We collected all tweets posted by Trump on the Twitter platform from January 1, 2020 to December 31, 2020.We used Ordinary Least Squares (OLS) regression analysis with a fixed effects model to analyze the existence of the Twitter strategy. The correlation between the number of con-firmed daily Covid-19 diagnoses and the number of particular thematic tweets was investigated using time series analysis. Empirical analysis revealed Twitter's strategy is used to divert public attention from negative Covid-19 reports during the epidemic, and it posts a powerful political communication effect on Twitter. However, findings suggest that Trump did not use false claims to divert political risk and shape public opinion.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
Model and Data Agreement for Learning with Noisy Labels
Authors:
Yuhang Zhang,
Weihong Deng,
Xingchen Cui,
Yunfeng Yin,
Hongzhi Shi,
Dongchao Wen
Abstract:
Learning with noisy labels is a vital topic for practical deep learning as models should be robust to noisy open-world datasets in the wild. The state-of-the-art noisy label learning approach JoCoR fails when faced with a large ratio of noisy labels. Moreover, selecting small-loss samples can also cause error accumulation as once the noisy samples are mistakenly selected as small-loss samples, the…
▽ More
Learning with noisy labels is a vital topic for practical deep learning as models should be robust to noisy open-world datasets in the wild. The state-of-the-art noisy label learning approach JoCoR fails when faced with a large ratio of noisy labels. Moreover, selecting small-loss samples can also cause error accumulation as once the noisy samples are mistakenly selected as small-loss samples, they are more likely to be selected again. In this paper, we try to deal with error accumulation in noisy label learning from both model and data perspectives. We introduce mean point ensemble to utilize a more robust loss function and more information from unselected samples to reduce error accumulation from the model perspective. Furthermore, as the flip images have the same semantic meaning as the original images, we select small-loss samples according to the loss values of flip images instead of the original ones to reduce error accumulation from the data perspective. Extensive experiments on CIFAR-10, CIFAR-100, and large-scale Clothing1M show that our method outperforms state-of-the-art noisy label learning methods with different levels of label noise. Our method can also be seamlessly combined with other noisy label learning methods to further improve their performance and generalize well to other tasks. The code is available in https://github.com/zyh-uaiaaaa/MDA-noisy-label-learning.
△ Less
Submitted 24 December, 2022; v1 submitted 2 December, 2022;
originally announced December 2022.
-
Hybrid MBlur: A Systematic Approach to Augment Rasterization with Ray Tracing for Rendering Motion Blur in Games
Authors:
Yu Wei Tan,
Xiaohan Cui,
Anand Bhojan
Abstract:
Motion blur is commonly used in game cinematics to achieve photorealism by modelling the behaviour of the camera shutter and simulating its effect associated with the relative motion of scene objects. A common real-time post-process approach is spatial sampling, where the directional blur of a moving object is rendered by integrating its colour based on velocity information within a single frame.…
▽ More
Motion blur is commonly used in game cinematics to achieve photorealism by modelling the behaviour of the camera shutter and simulating its effect associated with the relative motion of scene objects. A common real-time post-process approach is spatial sampling, where the directional blur of a moving object is rendered by integrating its colour based on velocity information within a single frame. However, such screen space approaches typically cannot produce accurate partial occlusion semi-transparencies. Our real-time hybrid rendering technique leverages hardware-accelerated ray tracing to correct post-process partial occlusion artifacts by advancing rays recursively into the scene to retrieve background information for motion-blurred regions, with reasonable additional performance cost for rendering game contents. We extend our previous work with details on the design, implementation, and future work of the technique as well as performance comparisons with post-processing.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.