subscribe to arXiv mailings

Collision Avoidance for Multiple UAVs in Unknown Scenarios with Causal Representation Disentanglement

Authors: Jiafan Zhuang, Zihao Xia, Gaofei Han, Boxi Wang, Wenji Li, Dongliang Wang, Zhifeng Hao, Ruichu Cai, Zhun Fan

Abstract: Deep reinforcement learning (DRL) has achieved remarkable progress in online path planning tasks for multi-UAV systems. However, existing DRL-based methods often suffer from performance degradation when tackling unseen scenarios, since the non-causal factors in visual representations adversely affect policy learning. To address this issue, we propose a novel representation learning approach, \ie,… ▽ More Deep reinforcement learning (DRL) has achieved remarkable progress in online path planning tasks for multi-UAV systems. However, existing DRL-based methods often suffer from performance degradation when tackling unseen scenarios, since the non-causal factors in visual representations adversely affect policy learning. To address this issue, we propose a novel representation learning approach, \ie, causal representation disentanglement, which can identify the causal and non-causal factors in representations. After that, we only pass causal factors for subsequent policy learning and thus explicitly eliminate the influence of non-causal factors, which effectively improves the generalization ability of DRL models. Experimental results show that our proposed method can achieve robust navigation performance and effective collision avoidance especially in unseen scenarios, which significantly outperforms existing SOTA algorithms. △ Less

Submitted 15 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.04056 [pdf, other]

Robust Policy Learning for Multi-UAV Collision Avoidance with Causal Feature Selection

Authors: Jiafan Zhuang, Gaofei Han, Zihao Xia, Boxi Wang, Wenji Li, Dongliang Wang, Zhifeng Hao, Ruichu Cai, Zhun Fan

Abstract: In unseen and complex outdoor environments, collision avoidance navigation for unmanned aerial vehicle (UAV) swarms presents a challenging problem. It requires UAVs to navigate through various obstacles and complex backgrounds. Existing collision avoidance navigation methods based on deep reinforcement learning show promising performance but suffer from poor generalization abilities, resulting in… ▽ More In unseen and complex outdoor environments, collision avoidance navigation for unmanned aerial vehicle (UAV) swarms presents a challenging problem. It requires UAVs to navigate through various obstacles and complex backgrounds. Existing collision avoidance navigation methods based on deep reinforcement learning show promising performance but suffer from poor generalization abilities, resulting in performance degradation in unseen environments. To address this issue, we investigate the cause of weak generalization ability in DRL and propose a novel causal feature selection module. This module can be integrated into the policy network and effectively filters out non-causal factors in representations, thereby reducing the influence of spurious correlations between non-causal factors and action predictions. Experimental results demonstrate that our proposed method can achieve robust navigation performance and effective collision avoidance especially in scenarios with unseen backgrounds and obstacles, which significantly outperforms existing state-of-the-art algorithms. △ Less

Submitted 15 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.02013 [pdf, other]

Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning

Authors: Jiahang Cao, Qiang Zhang, Ziqing Wang, Jiaxu Wang, Hao Cheng, Yecheng Shao, Wen Zhao, Gang Han, Yijie Guo, Renjing Xu

Abstract: Sequential modeling has demonstrated remarkable capabilities in offline reinforcement learning (RL), with Decision Transformer (DT) being one of the most notable representatives, achieving significant success. However, RL trajectories possess unique properties to be distinguished from the conventional sequence (e.g., text or audio): (1) local correlation, where the next states in RL are theoretica… ▽ More Sequential modeling has demonstrated remarkable capabilities in offline reinforcement learning (RL), with Decision Transformer (DT) being one of the most notable representatives, achieving significant success. However, RL trajectories possess unique properties to be distinguished from the conventional sequence (e.g., text or audio): (1) local correlation, where the next states in RL are theoretically determined solely by current states and actions based on the Markov Decision Process (MDP), and (2) global correlation, where each step's features are related to long-term historical information due to the time-continuous nature of trajectories. In this paper, we propose a novel action sequence predictor, named Mamba Decision Maker (MambaDM), where Mamba is expected to be a promising alternative for sequence modeling paradigms, owing to its efficient modeling of multi-scale dependencies. In particular, we introduce a novel mixer module that proficiently extracts and integrates both global and local features of the input sequence, effectively capturing interrelationships in RL datasets. Extensive experiments demonstrate that MambaDM achieves state-of-the-art performance in Atari and OpenAI Gym datasets. Furthermore, we empirically investigate the scaling laws of MambaDM, finding that increasing model size does not bring performance improvement, but scaling the dataset amount by 2x for MambaDM can obtain up to 33.7% score improvement on Atari dataset. This paper delves into the sequence modeling capabilities of MambaDM in the RL domain, paving the way for future advancements in robust and efficient decision-making systems. Our code will be available at https://github.com/AndyCao1125/MambaDM. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 16 pages, 5 figures

arXiv:2405.18405 [pdf, other]

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Authors: Jiawei Ma, Yulei Niu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

Abstract: Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images… ▽ More Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation, by only leveraging data in a single domain and without any test prior. Specifically, for each image, we first estimate the language embedding with fine-grained alignment, which can be consequently used to adaptively identify and then remove domain-specific counterpart from the raw visual embedding. WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT. Experimental studies on three domain generalization datasets demonstrate the effectiveness of our approach. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.07052 [pdf, other]

Length-Aware Multi-Kernel Transformer for Long Document Classification

Authors: Guangzeng Han, Jack Tsao, Xiaolei Huang

Abstract: Lengthy documents pose a unique challenge to neural language models due to substantial memory consumption. While existing state-of-the-art (SOTA) models segment long texts into equal-length snippets (e.g., 128 tokens per snippet) or deploy sparse attention networks, these methods have new challenges of context fragmentation and generalizability due to sentence boundaries and varying text lengths.… ▽ More Lengthy documents pose a unique challenge to neural language models due to substantial memory consumption. While existing state-of-the-art (SOTA) models segment long texts into equal-length snippets (e.g., 128 tokens per snippet) or deploy sparse attention networks, these methods have new challenges of context fragmentation and generalizability due to sentence boundaries and varying text lengths. For example, our empirical analysis has shown that SOTA models consistently overfit one set of lengthy documents (e.g., 2000 tokens) while performing worse on texts with other lengths (e.g., 1000 or 4000). In this study, we propose a Length-Aware Multi-Kernel Transformer (LAMKIT) to address the new challenges for the long document classification. LAMKIT encodes lengthy documents by diverse transformer-based kernels for bridging context boundaries and vectorizes text length by the kernels to promote model robustness over varying document lengths. Experiments on five standard benchmarks from health and law domains show LAMKIT outperforms SOTA models up to an absolute 10.9% improvement. We conduct extensive ablation analyses to examine model robustness and effectiveness over varying document lengths. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: Accepted to SEM 2024

arXiv:2405.06983 [pdf, other]

ISAC-Assisted Wireless Rechargeable Sensor Networks with Multiple Mobile Charging Vehicles

Authors: Muhammad Umar Farooq Qaisar, Weijie Yuan, Paolo Bellavista, Guangjie Han, Adeel Ahmed

Abstract: As IoT-based wireless sensor networks (WSNs) become more prevalent, the issue of energy shortages becomes more pressing. One potential solution is the use of wireless power transfer (WPT) technology, which is the key to building a new shape of wireless rechargeable sensor networks (WRSNs). However, efficient charging and scheduling are critical for WRSNs to function properly. Motivated by the fact… ▽ More As IoT-based wireless sensor networks (WSNs) become more prevalent, the issue of energy shortages becomes more pressing. One potential solution is the use of wireless power transfer (WPT) technology, which is the key to building a new shape of wireless rechargeable sensor networks (WRSNs). However, efficient charging and scheduling are critical for WRSNs to function properly. Motivated by the fact that probabilistic techniques can help enhance the effectiveness of charging scheduling for WRSNs, this article addresses the aforementioned issue and proposes a novel ISAC-assisted WRSN protocol. In particular, our proposed protocol considers several factors to balance the charging load on each mobile charging vehicle (MCV), uses an efficient charging factor strategy to partially charge network devices, and employs the ISAC concept to reduce the traveling cost of each MCV and prevent charging conflicts. Simulation results demonstrate that this protocol outperforms other classic, cutting-edge protocols in multiple areas. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: Accepted for publication in the Special Issue Q1'2024, "Integrating Sensing and Communication for Ubiquitous Internet of Things," IEEE Internet of Things Magazine

arXiv:2404.13654 [pdf, other]

Multi-AUV Cooperative Underwater Multi-Target Tracking Based on Dynamic-Switching-enabled Multi-Agent Reinforcement Learning

Authors: Shengbo Wang, Chuan Lin, Guangjie Han, Shengchao Zhu, Zhixian Li, Zhenyu Wang

Abstract: With the rapid development of underwater communication, sensing, automation, robot technologies, autonomous underwater vehicle (AUV) swarms are gradually becoming popular and have been widely promoted in ocean exploration and underwater tracking or surveillance, etc. However, the complex underwater environment poses significant challenges for AUV swarm-based accurate tracking for the underwater mo… ▽ More With the rapid development of underwater communication, sensing, automation, robot technologies, autonomous underwater vehicle (AUV) swarms are gradually becoming popular and have been widely promoted in ocean exploration and underwater tracking or surveillance, etc. However, the complex underwater environment poses significant challenges for AUV swarm-based accurate tracking for the underwater moving targets. In this paper, we aim at proposing a multi-AUV cooperative underwater multi-target tracking algorithm especially when the real underwater factors are taken into account.We first give normally modelling approach for the underwater sonar-based detection and the ocean current interference on the target tracking process.Then, we regard the AUV swarm as a underwater ad-hoc network and propose a novel Multi-Agent Reinforcement Learning (MARL) architecture towards the AUV swarm based on Software-Defined Networking (SDN).It enhances the flexibility and scalability of the AUV swarm through centralized management and distributed operations.Based on the proposed MARL architecture, we propose the "dynamic-attention switching" and "dynamic-resampling switching" mechanisms, to enhance the efficiency and accuracy of AUV swarm cooperation during task execution.Finally, based on a proposed AUV classification method, we propose an efficient cooperative tracking algorithm called ASMA.Evaluation results demonstrate that our proposed tracking algorithm can perform precise underwater multi-target tracking, comparing with many of recent research products in terms of convergence speed and tracking accuracy. △ Less

Submitted 22 April, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.04656 [pdf, other]

Binary Classifier Optimization for Large Language Model Alignment

Authors: Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, Kyoung-Woon On

Abstract: Aligning Large Language Models (LLMs) to human preferences through preference optimization has been crucial but labor-intensive, necessitating for each prompt a comparison of both a chosen and a rejected text completion by evaluators. Recently, Kahneman-Tversky Optimization (KTO) has demonstrated that LLMs can be aligned using merely binary "thumbs-up" or "thumbs-down" signals on each prompt-compl… ▽ More Aligning Large Language Models (LLMs) to human preferences through preference optimization has been crucial but labor-intensive, necessitating for each prompt a comparison of both a chosen and a rejected text completion by evaluators. Recently, Kahneman-Tversky Optimization (KTO) has demonstrated that LLMs can be aligned using merely binary "thumbs-up" or "thumbs-down" signals on each prompt-completion pair. In this paper, we present theoretical foundations to explain the successful alignment achieved through these binary signals. Our analysis uncovers a new perspective: optimizing a binary classifier, whose logit is a reward, implicitly induces minimizing the Direct Preference Optimization (DPO) loss. In the process of this discovery, we identified two techniques for effective alignment: reward shift and underlying distribution matching. Consequently, we propose a new algorithm, \textit{Binary Classifier Optimization}, that integrates the techniques. We validate our methodology in two settings: first, on a paired preference dataset, where our method performs on par with DPO and KTO; and second, on binary signal datasets simulating real-world conditions with divergent underlying distributions between thumbs-up and thumbs-down data. Our model consistently demonstrates effective and robust alignment across two base LLMs and three different binary signal datasets, showcasing the strength of our approach to learning from binary feedback. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: 18 pages, 9 figures

arXiv:2404.02838 [pdf, other]

I-Design: Personalized LLM Interior Designer

Authors: Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, Xi Wang

Abstract: Interior design allows us to be who we are and live how we want - each design is as unique as our distinct personality. However, it is not trivial for non-professionals to express and materialize this since it requires aligning functional and visual expectations with the constraints of physical space; this renders interior design a luxury. To make it more accessible, we present I-Design, a persona… ▽ More Interior design allows us to be who we are and live how we want - each design is as unique as our distinct personality. However, it is not trivial for non-professionals to express and materialize this since it requires aligning functional and visual expectations with the constraints of physical space; this renders interior design a luxury. To make it more accessible, we present I-Design, a personalized interior designer that allows users to generate and visualize their design goals through natural language communication. I-Design starts with a team of large language model agents that engage in dialogues and logical reasoning with one another, transforming textual user input into feasible scene graph designs with relative object relationships. Subsequently, an effective placement algorithm determines optimal locations for each object within the scene. The final design is then constructed in 3D by retrieving and integrating assets from an existing object database. Additionally, we propose a new evaluation protocol that utilizes a vision-language model and complements the design pipeline. Extensive quantitative and qualitative experiments show that I-Design outperforms existing methods in delivering high-quality 3D design solutions and aligning with abstract concepts that match user input, showcasing its advantages across detailed 3D arrangement and conceptual fidelity. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2403.13786 [pdf, other]

Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts

Authors: Guangzeng Han, Weisi Liu, Xiaolei Huang, Brian Borsari

Abstract: Automatic coding patient behaviors is essential to support decision making for psychotherapists during the motivational interviewing (MI), a collaborative communication intervention approach to address psychiatric issues, such as alcohol and drug addiction. While the behavior coding task has rapidly adapted machine learning to predict patient states during the MI sessions, lacking of domain-specif… ▽ More Automatic coding patient behaviors is essential to support decision making for psychotherapists during the motivational interviewing (MI), a collaborative communication intervention approach to address psychiatric issues, such as alcohol and drug addiction. While the behavior coding task has rapidly adapted machine learning to predict patient states during the MI sessions, lacking of domain-specific knowledge and overlooking patient-therapist interactions are major challenges in developing and deploying those models in real practice. To encounter those challenges, we introduce the Chain-of-Interaction (CoI) prompting method aiming to contextualize large language models (LLMs) for psychiatric decision support by the dyadic interactions. The CoI prompting approach systematically breaks down the coding task into three key reasoning steps, extract patient engagement, learn therapist question strategies, and integrates dyadic interactions between patients and therapists. This approach enables large language models to leverage the coding scheme, patient state, and domain knowledge for patient behavioral coding. Experiments on real-world datasets can prove the effectiveness and flexibility of our prompting method with multiple state-of-the-art LLMs over existing prompting baselines. We have conducted extensive ablation analysis and demonstrate the critical role of dyadic interactions in applying LLMs for psychotherapy behavior understanding. △ Less

Submitted 23 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: Accepted to IEEE ICHI 2024

arXiv:2403.10492 [pdf, other]

Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

Authors: Dongmin Park, Zhaofang Qian, Guangxing Han, Ser-Nam Lim

Abstract: Mitigating hallucinations of Large Vision Language Models,(LVLMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LVLMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended halluci… ▽ More Mitigating hallucinations of Large Vision Language Models,(LVLMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LVLMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues powered by our novel Adversarial Question Generator (AQG), which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LVLMs. On our benchmark, the zero-shot performance of state-of-the-art LVLMs drops significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning (AIT) that robustly fine-tunes LVLMs against hallucinatory dialogues. Extensive experiments show our proposed approach successfully reduces dialogue hallucination while maintaining performance. △ Less

Submitted 25 May, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.10873 [pdf, ps, other]

Probabilistic On-Demand Charging Scheduling for ISAC-Assisted WRSNs with Multiple Mobile Charging Vehicles

Authors: Muhammad Umar Farooq Qaisar, Weijie Yuan, Paolo Bellavista, Guangjie Han, Rabiu Sale Zakariyya, Adeel Ahmed

Abstract: The internet of things (IoT) based wireless sensor networks (WSNs) face an energy shortage challenge that could be overcome by the novel wireless power transfer (WPT) technology. The combination of WSNs and WPT is known as wireless rechargeable sensor networks (WRSNs), with the charging efficiency and charging scheduling being the primary concerns. Therefore, this paper proposes a probabilistic on… ▽ More The internet of things (IoT) based wireless sensor networks (WSNs) face an energy shortage challenge that could be overcome by the novel wireless power transfer (WPT) technology. The combination of WSNs and WPT is known as wireless rechargeable sensor networks (WRSNs), with the charging efficiency and charging scheduling being the primary concerns. Therefore, this paper proposes a probabilistic on-demand charging scheduling for integrated sensing and communication (ISAC)-assisted WRSNs with multiple mobile charging vehicles (MCVs) that addresses three parts. First, it considers the four attributes with their probability distributions to balance the charging load on each MCV. The distributions are residual energy of charging node, distance from MCV to charging node, degree of charging node, and charging node betweenness centrality. Second, it considers the efficient charging factor strategy to partially charge network nodes. Finally, it employs the ISAC concept to efficiently utilize the wireless resources to reduce the traveling cost of each MCV and to avoid the charging conflicts between them. The simulation results show that the proposed protocol outperforms cutting-edge protocols in terms of energy usage efficiency, charging delay, survival rate, and travel distance. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted for publication at the IEEE Global Communications Conference (GLOBECOM) 2023

arXiv:2401.08121 [pdf, other]

CycLight: learning traffic signal cooperation with a cycle-level strategy

Authors: Gengyue Han, Xiaohan Liu, Xianyue Peng, Hao Wang, Yu Han

Abstract: This study introduces CycLight, a novel cycle-level deep reinforcement learning (RL) approach for network-level adaptive traffic signal control (NATSC) systems. Unlike most traditional RL-based traffic controllers that focus on step-by-step decision making, CycLight adopts a cycle-level strategy, optimizing cycle length and splits simultaneously using Parameterized Deep Q-Networks (PDQN) algorithm… ▽ More This study introduces CycLight, a novel cycle-level deep reinforcement learning (RL) approach for network-level adaptive traffic signal control (NATSC) systems. Unlike most traditional RL-based traffic controllers that focus on step-by-step decision making, CycLight adopts a cycle-level strategy, optimizing cycle length and splits simultaneously using Parameterized Deep Q-Networks (PDQN) algorithm. This cycle-level approach effectively reduces the computational burden associated with frequent data communication, meanwhile enhancing the practicality and safety of real-world applications. A decentralized framework is formulated for multi-agent cooperation, while attention mechanism is integrated to accurately assess the impact of the surroundings on the current intersection. CycLight is tested in a large synthetic traffic grid using the microscopic traffic simulation tool, SUMO. Experimental results not only demonstrate the superiority of CycLight over other state-of-the-art approaches but also showcase its robustness against information transmission delays. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2312.12423 [pdf, other]

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Authors: Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

Abstract: The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a… ▽ More The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/. △ Less

Submitted 19 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: CVPR 2024 Highlight

arXiv:2312.12227 [pdf, other]

HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback

Authors: Gaoge Han, Shaoli Huang, Mingming Gong, Jinglei Tang

Abstract: We introduce HuTuMotion, an innovative approach for generating natural human motions that navigates latent motion diffusion models by leveraging few-shot human feedback. Unlike existing approaches that sample latent variables from a standard normal prior distribution, our method adapts the prior distribution to better suit the characteristics of the data, as indicated by human feedback, thus enhan… ▽ More We introduce HuTuMotion, an innovative approach for generating natural human motions that navigates latent motion diffusion models by leveraging few-shot human feedback. Unlike existing approaches that sample latent variables from a standard normal prior distribution, our method adapts the prior distribution to better suit the characteristics of the data, as indicated by human feedback, thus enhancing the quality of motion generation. Furthermore, our findings reveal that utilizing few-shot feedback can yield performance levels on par with those attained through extensive human feedback. This discovery emphasizes the potential and efficiency of incorporating few-shot human-guided optimization within latent diffusion models for personalized and style-aware human motion generation applications. The experimental results show the significantly superior performance of our method over existing state-of-the-art approaches. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024 Main Track

arXiv:2311.01018 [pdf, other]

Expanding Expressiveness of Diffusion Models with Limited Data via Self-Distillation based Fine-Tuning

Authors: Jiwan Hur, Jaehyun Choi, Gyojin Han, Dong-Jae Lee, Junmo Kim

Abstract: Training diffusion models on limited datasets poses challenges in terms of limited generation capacity and expressiveness, leading to unsatisfactory results in various downstream tasks utilizing pretrained diffusion models, such as domain translation and text-guided image manipulation. In this paper, we propose Self-Distillation for Fine-Tuning diffusion models (SDFT), a methodology to address the… ▽ More Training diffusion models on limited datasets poses challenges in terms of limited generation capacity and expressiveness, leading to unsatisfactory results in various downstream tasks utilizing pretrained diffusion models, such as domain translation and text-guided image manipulation. In this paper, we propose Self-Distillation for Fine-Tuning diffusion models (SDFT), a methodology to address these challenges by leveraging diverse features from diffusion models pretrained on large source datasets. SDFT distills more general features (shape, colors, etc.) and less domain-specific features (texture, fine details, etc) from the source model, allowing successful knowledge transfer without disturbing the training process on target datasets. The proposed method is not constrained by the specific architecture of the model and thus can be generally adopted to existing frameworks. Experimental results demonstrate that SDFT enhances the expressiveness of the diffusion model with limited datasets, resulting in improved generation capabilities across various downstream tasks. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: WACV 2024

arXiv:2310.10856 [pdf]

Joint Optimization of Traffic Signal Control and Vehicle Routing in Signalized Road Networks using Multi-Agent Deep Reinforcement Learning

Authors: Xianyue Peng, Hang Gao, Gengyue Han, Hao Wang, Michael Zhang

Abstract: Urban traffic congestion is a critical predicament that plagues modern road networks. To alleviate this issue and enhance traffic efficiency, traffic signal control and vehicle routing have proven to be effective measures. In this paper, we propose a joint optimization approach for traffic signal control and vehicle routing in signalized road networks. The objective is to enhance network performan… ▽ More Urban traffic congestion is a critical predicament that plagues modern road networks. To alleviate this issue and enhance traffic efficiency, traffic signal control and vehicle routing have proven to be effective measures. In this paper, we propose a joint optimization approach for traffic signal control and vehicle routing in signalized road networks. The objective is to enhance network performance by simultaneously controlling signal timings and route choices using Multi-Agent Deep Reinforcement Learning (MADRL). Signal control agents (SAs) are employed to establish signal timings at intersections, whereas vehicle routing agents (RAs) are responsible for selecting vehicle routes. By establishing relevance between agents and enabling them to share observations and rewards, interaction and cooperation among agents are fostered, which enhances individual training. The Multi-Agent Advantage Actor-Critic algorithm is used to handle multi-agent environments, and Deep Neural Network (DNN) structures are designed to facilitate the algorithm's convergence. Notably, our work is the first to utilize MADRL in determining the optimal joint policy for signal control and vehicle routing. Numerical experiments conducted on the modified Sioux network demonstrate that our integration of signal control and vehicle routing outperforms controlling signal timings or vehicles' routes alone in enhancing traffic efficiency. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.06404 [pdf, other]

Hexa: Self-Improving for Knowledge-Grounded Dialogue System

Authors: Daejin Jo, Daniel Wontae Nam, Gunsoo Han, Kyoung-Woon On, Taehwan Kwon, Seungeun Rho, Sungwoong Kim

Abstract: A common practice in knowledge-grounded dialogue generation is to explicitly utilize intermediate steps (e.g., web-search, memory retrieval) with modular approaches. However, data for such steps are often inaccessible compared to those of dialogue responses as they are unobservable in an ordinary dialogue. To fill in the absence of these data, we develop a self-improving method to improve the gene… ▽ More A common practice in knowledge-grounded dialogue generation is to explicitly utilize intermediate steps (e.g., web-search, memory retrieval) with modular approaches. However, data for such steps are often inaccessible compared to those of dialogue responses as they are unobservable in an ordinary dialogue. To fill in the absence of these data, we develop a self-improving method to improve the generative performances of intermediate steps without the ground truth data. In particular, we propose a novel bootstrapping scheme with a guided prompt and a modified loss function to enhance the diversity of appropriate self-generated responses. Through experiments on various benchmark datasets, we empirically demonstrate that our method successfully leverages a self-improving mechanism in generating intermediate and final responses and improves the performances on the task of knowledge-grounded dialogue generation. △ Less

Submitted 2 April, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2309.03509 [pdf, other]

BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised Applications

Authors: Jiatai Lin, Guoqiang Han, Xuemiao Xu, Changhong Liang, Tien-Tsin Wong, C. L. Philip Chen, Zaiyi Liu, Chu Han

Abstract: Class activation mapping~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward info… ▽ More Class activation mapping~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward information), gradients~(backward information), etc. However, when with small-scale data, unstable training may lead to less effective model outcomes and generate unreliable weights, finally resulting in incorrect activation and noisy CAM seeds. In this paper, we propose an outcome-agnostic CAM approach, called BroadCAM, for small-scale weakly supervised applications. Since broad learning system (BLS) is independent to the model learning, BroadCAM can avoid the weights being affected by the unreliable model outcomes when with small-scale data. By evaluating BroadCAM on VOC2012 (natural images) and BCSS-WSSS (medical images) for WSSS and OpenImages30k for WSOL, BroadCAM demonstrates superior performance than existing CAM methods with small-scale data (less than 5\%) in different CNN architectures. It also achieves SOTA performance with large-scale training data. Extensive qualitative comparisons are conducted to demonstrate how BroadCAM activates the high class-relevance feature maps and generates reliable CAMs when with small-scale training data. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2308.00783 [pdf, other]

Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking

Authors: Mingzhan Yang, Guangxin Han, Bin Yan, Wenhua Zhang, Jinqing Qi, Huchuan Lu, Dong Wang

Abstract: Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously d… ▽ More Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously due to the high overlap among objects. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Also, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, with both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and severe occlusion frequently happen with complex motions. The code and models are available at https://github.com/ymzis69/HybridSORT. △ Less

Submitted 20 January, 2024; v1 submitted 1 August, 2023; originally announced August 2023.

Comments: Accepted to AAAI 2024

arXiv:2307.08671 [pdf, other]

Deep Cross-Modal Steganography Using Neural Representations

Authors: Gyojin Han, Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Junmo Kim

Abstract: Steganography is the process of embedding secret data into another message or data, in such a way that it is not easily noticeable. With the advancement of deep learning, Deep Neural Networks (DNNs) have recently been utilized in steganography. However, existing deep steganography techniques are limited in scope, as they focus on specific data types and are not effective for cross-modal steganogra… ▽ More Steganography is the process of embedding secret data into another message or data, in such a way that it is not easily noticeable. With the advancement of deep learning, Deep Neural Networks (DNNs) have recently been utilized in steganography. However, existing deep steganography techniques are limited in scope, as they focus on specific data types and are not effective for cross-modal steganography. Therefore, We propose a deep cross-modal steganography framework using Implicit Neural Representations (INRs) to hide secret data of various formats in cover images. The proposed framework employs INRs to represent the secret data, which can handle data of various modalities and resolutions. Experiments on various secret datasets of diverse types demonstrate that the proposed approach is expandable and capable of accommodating different modalities. △ Less

Submitted 7 October, 2023; v1 submitted 2 July, 2023; originally announced July 2023.

Comments: ICIP 2023 Oral

arXiv:2307.05889 [pdf, other]

Rethinking Mitosis Detection: Towards Diverse Data and Feature Representation

Authors: Hao Wang, Jiatai Lin, Danyi Li, Jing Wang, Bingchao Zhao, Zhenwei Shi, Xipeng Pan, Huadeng Wang, Bingbing Li, Changhong Liang, Guoqiang Han, Li Liang, Chu Han, Zaiyi Liu

Abstract: Mitosis detection is one of the fundamental tasks in computational pathology, which is extremely challenging due to the heterogeneity of mitotic cell. Most of the current studies solve the heterogeneity in the technical aspect by increasing the model complexity. However, lacking consideration of the biological knowledge and the complex model design may lead to the overfitting problem while limited… ▽ More Mitosis detection is one of the fundamental tasks in computational pathology, which is extremely challenging due to the heterogeneity of mitotic cell. Most of the current studies solve the heterogeneity in the technical aspect by increasing the model complexity. However, lacking consideration of the biological knowledge and the complex model design may lead to the overfitting problem while limited the generalizability of the detection model. In this paper, we systematically study the morphological appearances in different mitotic phases as well as the ambiguous non-mitotic cells and identify that balancing the data and feature diversity can achieve better generalizability. Based on this observation, we propose a novel generalizable framework (MitDet) for mitosis detection. The data diversity is considered by the proposed diversity-guided sample balancing (DGSB). And the feature diversity is preserved by inter- and intra- class feature diversity-preserved module (InCDP). Stain enhancement (SE) module is introduced to enhance the domain-relevant diversity of both data and features simultaneously. Extensive experiments have demonstrated that our proposed model outperforms all the SOTA approaches in several popular mitosis detection datasets in both internal and external test sets using minimal annotation efforts with point annotations only. Comprehensive ablation studies have also proven the effectiveness of the rethinking of data and feature diversity balancing. By analyzing the results quantitatively and qualitatively, we believe that our proposed model not only achieves SOTA performance but also might inspire the future studies in new perspectives. Source code is at https://github.com/Onehour0108/MitDet. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2306.10079 [pdf, other]

doi 10.1145/3580305.3599862

M3PT: A Multi-Modal Model for POI Tagging

Authors: Jingsong Yang, Guanzhou Han, Deqing Yang, Jingping Liu, Yanghua Xiao, Xiang Xu, Baohua Wu, Shenghua Ni

Abstract: POI tagging aims to annotate a point of interest (POI) with some informative tags, which facilitates many services related to POIs, including search, recommendation, and so on. Most of the existing solutions neglect the significance of POI images and seldom fuse the textual and visual features of POIs, resulting in suboptimal tagging performance. In this paper, we propose a novel Multi-Modal Model… ▽ More POI tagging aims to annotate a point of interest (POI) with some informative tags, which facilitates many services related to POIs, including search, recommendation, and so on. Most of the existing solutions neglect the significance of POI images and seldom fuse the textual and visual features of POIs, resulting in suboptimal tagging performance. In this paper, we propose a novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced POI tagging through fusing the target POI's textual and visual features, and the precise matching between the multi-modal representations. Specifically, we first devise a domain-adaptive image encoder (DIE) to obtain the image embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image fusion module (TIF), the textual and visual representations are fully fused into the POIs' content embeddings for the subsequent matching. In addition, we adopt a contrastive learning strategy to further bridge the gap between the representations of different modalities. To evaluate the tagging models' performance, we have constructed two high-quality POI tagging datasets from the real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the extensive experiments to demonstrate our model's advantage over the baselines of uni-modality and multi-modality, and verify the effectiveness of important components in M3PT, including DIE, TIF and the contrastive learning strategy. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted by KDD 2023

ACM Class: H.3.0

arXiv:2306.07289 [pdf, other]

Multi-Interactive-Modality based Modeling for Myopia Pro-Gression of Adolescent Student

Authors: Xiangyu Yan, Gongen Han, Can Fang, Xuan Jing

Abstract: Myopia is a common visual disorder that affects millions of people worldwide and its prevalence has been increasing in recent years. Environmental factors, such as reading time, viewing distance, and ambient lighting, have been identified as potential factors in the development of myopia. In this study, we investigated the relationship between three major factors and myopia in 120 adolescents. By… ▽ More Myopia is a common visual disorder that affects millions of people worldwide and its prevalence has been increasing in recent years. Environmental factors, such as reading time, viewing distance, and ambient lighting, have been identified as potential factors in the development of myopia. In this study, we investigated the relationship between three major factors and myopia in 120 adolescents. By collecting environmental images of the adolescents in the learning state as well as retinal fundus images, we proposed an environmental visual load (EVL) model to extract the potential information in these images. Through experimental data analysis, we found that these three major factors are closely related to the severity of myopia, and that the simultaneous exacerbation of these factors sharply increases the myopia of the eye. Our results suggest that interventions targeting these environmental factors may help prevent and manage myopia. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: 9 pages, 5 figures

arXiv:2306.02393 [pdf, other]

Accessible Robot Control in Mixed Reality

Authors: Ganlin Zhang, Deheng Zhang, Longteng Duan, Guo Han

Abstract: A novel method to control the Spot robot of Boston Dynamics by Hololens 2 is proposed. This method is mainly designed for people with physical disabilities, users can control the robot's movement and robot arm without using their hands. The eye gaze tracking and head motion tracking technologies of Hololens 2 are utilized for sending control commands. The movement of the robot would follow the eye… ▽ More A novel method to control the Spot robot of Boston Dynamics by Hololens 2 is proposed. This method is mainly designed for people with physical disabilities, users can control the robot's movement and robot arm without using their hands. The eye gaze tracking and head motion tracking technologies of Hololens 2 are utilized for sending control commands. The movement of the robot would follow the eye gaze and the robot arm would mimic the pose of the user's head. Through our experiment, our method is comparable with the traditional control method by joystick in both time efficiency and user experience. Demo can be found on our project webpage: https://zhangganlin.github.io/Holo-Spot-Page/index.html △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Course Project of Mixed Reality at ETH Zurich

arXiv:2305.13973 [pdf, other]

Effortless Integration of Memory Management into Open-Domain Conversation Systems

Authors: Eunbi Choi, Kyoung-Woon On, Gunsoo Han, Sungwoong Kim, Daniel Wontae Nam, Daejin Jo, Seung Eun Rho, Taehwan Kwon, Minjoon Seo

Abstract: Open-domain conversation systems integrate multiple conversation skills into a single system through a modular approach. One of the limitations of the system, however, is the absence of management capability for external memory. In this paper, we propose a simple method to improve BlenderBot3 by integrating memory management ability into it. Since no training data exists for this purpose, we propo… ▽ More Open-domain conversation systems integrate multiple conversation skills into a single system through a modular approach. One of the limitations of the system, however, is the absence of management capability for external memory. In this paper, we propose a simple method to improve BlenderBot3 by integrating memory management ability into it. Since no training data exists for this purpose, we propose an automating dataset creation for memory management. Our method 1) requires little cost for data construction, 2) does not affect performance in other tasks, and 3) reduces external memory. We show that our proposed model BlenderBot3-M^3, which is multi-task trained with memory management, outperforms BlenderBot3 with a relative 4% performance gain in terms of F1 score. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2304.04625 [pdf, other]

Reinforcement Learning-Based Black-Box Model Inversion Attacks

Authors: Gyojin Han, Jaehyun Choi, Haeil Lee, Junmo Kim

Abstract: Model inversion attacks are a type of privacy attack that reconstructs private data used to train a machine learning model, solely by accessing the model. Recently, white-box model inversion attacks leveraging Generative Adversarial Networks (GANs) to distill knowledge from public datasets have been receiving great attention because of their excellent attack performance. On the other hand, current… ▽ More Model inversion attacks are a type of privacy attack that reconstructs private data used to train a machine learning model, solely by accessing the model. Recently, white-box model inversion attacks leveraging Generative Adversarial Networks (GANs) to distill knowledge from public datasets have been receiving great attention because of their excellent attack performance. On the other hand, current black-box model inversion attacks that utilize GANs suffer from issues such as being unable to guarantee the completion of the attack process within a predetermined number of query accesses or achieve the same level of performance as white-box attacks. To overcome these limitations, we propose a reinforcement learning-based black-box model inversion attack. We formulate the latent space search as a Markov Decision Process (MDP) problem and solve it with reinforcement learning. Our method utilizes the confidence scores of the generated images to provide rewards to an agent. Finally, the private data can be reconstructed using the latent vectors found by the agent trained in the MDP. The experiment results on various datasets and models demonstrate that our attack successfully recovers the private information of the target model by achieving state-of-the-art attack performance. We emphasize the importance of studies on privacy-preserving machine learning by proposing a more advanced black-box model inversion attack. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: CVPR 2023, Accepted

arXiv:2303.15466 [pdf, other]

Supervised Masked Knowledge Distillation for Few-Shot Transformers

Authors: Han Lin, Guangxing Han, Jiawei Ma, Shiyuan Huang, Xudong Lin, Shih-Fu Chang

Abstract: Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works i… ▽ More Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: https://github.com/HL-hanlin/SMKD. △ Less

Submitted 28 March, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

Comments: To appear in CVPR 2023

arXiv:2303.09674 [pdf, other]

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

Authors: Jiawei Ma, Yulei Niu, Jincheng Xu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

Abstract: Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the re… ▽ More Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (VOC, COCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: CVPR 2023 Camera Ready (Supp Attached). Code Link: https://github.com/Phoenix-V/DiGeo

arXiv:2302.14139 [pdf, other]

Scalable End-to-End ML Platforms: from AutoML to Self-serve

Authors: Igor L. Markov, Pavlos A. Apostolopoulos, Mia R. Garrard, Tanya Qie, Yin Huang, Tanvi Gupta, Anika Li, Cesar Cardoso, George Han, Ryan Maghsoudian, Norm Zhou

Abstract: ML platforms help enable intelligent data-driven applications and maintain them with limited engineering effort. Upon sufficiently broad adoption, such platforms reach economies of scale that bring greater component reuse while improving efficiency of system development and maintenance. For an end-to-end ML platform with broad adoption, scaling relies on pervasive ML automation and system integrat… ▽ More ML platforms help enable intelligent data-driven applications and maintain them with limited engineering effort. Upon sufficiently broad adoption, such platforms reach economies of scale that bring greater component reuse while improving efficiency of system development and maintenance. For an end-to-end ML platform with broad adoption, scaling relies on pervasive ML automation and system integration to reach the quality we term self-serve that we define with ten requirements and six optional capabilities. With this in mind, we identify long-term goals for platform development, discuss related tradeoffs and future work. Our reasoning is illustrated on two commercially-deployed end-to-end ML platforms that host hundreds of real-time use cases -- one general-purpose and one specialized. △ Less

Submitted 3 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: 10 pages, 1 figure, 2 tables

arXiv:2302.13073 [pdf, other]

Feedback Capacity of the Continuous-Time ARMA(1,1) Gaussian Channel

Authors: Jun Su, Guangyue Han, Shlomo Shamai

Abstract: We consider the continuous-time ARMA(1,1) Gaussian channel and derive its feedback capacity in closed form. More specifically, the channel is given by $\boldsymbol{y}(t) =\boldsymbol{x}(t) +\boldsymbol{z}(t)$, where the channel input $\{\boldsymbol{x}(t) \}$ satisfies average power constraint $P$ and the noise $\{\boldsymbol{z}(t)\}$ is a first-order {\em autoregressive moving average} (ARMA(1,1))… ▽ More We consider the continuous-time ARMA(1,1) Gaussian channel and derive its feedback capacity in closed form. More specifically, the channel is given by $\boldsymbol{y}(t) =\boldsymbol{x}(t) +\boldsymbol{z}(t)$, where the channel input $\{\boldsymbol{x}(t) \}$ satisfies average power constraint $P$ and the noise $\{\boldsymbol{z}(t)\}$ is a first-order {\em autoregressive moving average} (ARMA(1,1)) Gaussian process satisfying $$ \boldsymbol{z}^\prime(t)+κ\boldsymbol{z}(t)=(κ+λ)\boldsymbol{w}(t)+\boldsymbol{w}^\prime(t), $$ where $κ>0,~λ\in\mathbb{R}$ and $\{\boldsymbol{w}(t) \}$ is a white Gaussian process with unit double-sided spectral density. We show that the feedback capacity of this channel is equal to the unique positive root of the equation $$ P(x+κ)^2 = 2x(x+\vert κ+λ\vert)^2 $$ when $-2κ<λ<0$ and is equal to $P/2$ otherwise. Among many others, this result shows that, as opposed to a discrete-time additive Gaussian channel, feedback may not increase the capacity of a continuous-time additive Gaussian channel even if the noise process is colored. The formula enables us to conduct a thorough analysis of the effect of feedback on the capacity for such a channel. We characterize when the feedback capacity equals or doubles the non-feedback capacity; moreover, we disprove continuous-time analogues of the half-bit bound and Cover's $2P$ conjecture for discrete-time additive Gaussian channels. △ Less

Submitted 10 April, 2024; v1 submitted 25 February, 2023; originally announced February 2023.

arXiv:2302.12662 [pdf, other]

FedDBL: Communication and Data Efficient Federated Deep-Broad Learning for Histopathological Tissue Classification

Authors: Tianpeng Deng, Yanqi Huang, Guoqiang Han, Zhenwei Shi, Jiatai Lin, Qi Dou, Zaiyi Liu, Xiao-jing Guo, C. L. Philip Chen, Chu Han

Abstract: Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by keeping training samples locally, but existing FL-based frameworks require a large number of well-annotated… ▽ More Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by keeping training samples locally, but existing FL-based frameworks require a large number of well-annotated training samples and numerous rounds of communication which hinder their practicability in the real-world clinical scenario. In this paper, we propose a universal and lightweight federated learning framework, named Federated Deep-Broad Learning (FedDBL), to achieve superior classification performance with limited training samples and only one-round communication. By simply associating a pre-trained deep learning feature extractor, a fast and lightweight broad learning inference system and a classical federated aggregation approach, FedDBL can dramatically reduce data dependency and improve communication efficiency. Five-fold cross-validation demonstrates that FedDBL greatly outperforms the competitors with only one-round communication and limited training samples, while it even achieves comparable performance with the ones under multiple-round communications. Furthermore, due to the lightweight design and one-round communication, FedDBL reduces the communication burden from 4.6GB to only 276.5KB per client using the ResNet-50 backbone at 50-round training. Since no data or deep model sharing across different clients, the privacy issue is well-solved and the model security is guaranteed with no model inversion attack risk. Code is available at https://github.com/tianpeng-deng/FedDBL. △ Less

Submitted 17 December, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

arXiv:2212.13738 [pdf, other]

TempCLR: Temporal Alignment Representation with Contrastive Learning

Authors: Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, Shih-Fu Chang

Abstract: Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitl… ▽ More Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design. △ Less

Submitted 29 March, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

Comments: ICLR 2023 Camera Ready. Code Link: https://github.com/yyuncong/TempCLR

arXiv:2211.15875 [pdf, other]

Data Poisoning Attack Aiming the Vulnerability of Continual Learning

Authors: Gyojin Han, Jaehyun Choi, Hyeong Gwon Hong, Junmo Kim

Abstract: Generally, regularization-based continual learning models limit access to the previous task data to imitate the real-world constraints related to memory and privacy. However, this introduces a problem in these models by not being able to track the performance on each task. In essence, current continual learning methods are susceptible to attacks on previous tasks. We demonstrate the vulnerability… ▽ More Generally, regularization-based continual learning models limit access to the previous task data to imitate the real-world constraints related to memory and privacy. However, this introduces a problem in these models by not being able to track the performance on each task. In essence, current continual learning methods are susceptible to attacks on previous tasks. We demonstrate the vulnerability of regularization-based continual learning methods by presenting a simple task-specific data poisoning attack that can be used in the learning process of a new task. Training data generated by the proposed attack causes performance degradation on a specific task targeted by the attacker. We experiment with the attack on the two representative regularization-based continual learning methods, Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI), trained with variants of MNIST dataset. The experiment results justify the vulnerability proposed in this paper and demonstrate the importance of developing continual learning models that are robust to adversarial attacks. △ Less

Submitted 3 July, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: ICIP 2023 (NeurIPS 2022 ML Safety Workshop accepted paper)

arXiv:2210.12444 [pdf, other]

Weakly-Supervised Temporal Article Grounding

Authors: Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, Shih-Fu Chang

Abstract: Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptio… ▽ More Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptions make today's VG models fail to work in practice. For example, in real-world multimodal assets (eg, news articles), most of the sentences in the article can not be grounded in their affiliated videos, and they typically have rich hierarchical relations (ie, at different semantic scales). To this end, we propose a new challenging grounding task: Weakly-Supervised temporal Article Grounding (WSAG). Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales. Accordingly, we collect the first WSAG dataset to facilitate this task: YouwikiHow, which borrows the inherent multi-scale descriptions in wikiHow articles and plentiful YouTube videos. In addition, we propose a simple but effective method DualMIL for WSAG, which consists of a two-level MIL loss and a single-/cross- sentence constraint loss. These training objectives are carefully designed for these relaxed assumptions. Extensive ablations have verified the effectiveness of DualMIL. △ Less

Submitted 23 February, 2023; v1 submitted 22 October, 2022; originally announced October 2022.

Comments: EMNLP 2022, https://github.com/zjuchenlong/WSAG

arXiv:2210.09198 [pdf, other]

Pixel-Aligned Non-parametric Hand Mesh Reconstruction

Authors: Shijian Jiang, Guwen Han, Danhang Tang, Yang Zhou, Xiang Li, Jiming Chen, Qi Ye

Abstract: Non-parametric mesh reconstruction has recently shown significant progress in 3D hand and body applications. In these methods, mesh vertices and edges are visible to neural networks, enabling the possibility to establish a direct mapping between 2D image pixels and 3D mesh vertices. In this paper, we seek to establish and exploit this mapping with a simple and compact architecture. The network is… ▽ More Non-parametric mesh reconstruction has recently shown significant progress in 3D hand and body applications. In these methods, mesh vertices and edges are visible to neural networks, enabling the possibility to establish a direct mapping between 2D image pixels and 3D mesh vertices. In this paper, we seek to establish and exploit this mapping with a simple and compact architecture. The network is designed with these considerations: 1) aggregating both local 2D image features from the encoder and 3D geometric features captured in the mesh decoder; 2) decoding coarse-to-fine meshes along the decoding layers to make the best use of the hierarchical multi-scale information. Specifically, we propose an end-to-end pipeline for hand mesh recovery tasks which consists of three phases: a 2D feature extractor constructing multi-scale feature maps, a feature mapping module transforming local 2D image features to 3D vertex features via 3D-to-2D projection, and a mesh decoder combining the graph convolution and self-attention to reconstruct mesh. The decoder aggregate both local image features in pixels and geometric features in vertices. It also regresses the mesh vertices in a coarse-to-fine manner, which can leverage multi-scale information. By exploiting the local connection and designing the mesh decoder, Our approach achieves state-of-the-art for hand mesh reconstruction on the public FreiHAND dataset. △ Less

Submitted 17 October, 2022; originally announced October 2022.

arXiv:2208.06132 [pdf, ps, other]

On the Physical Layer Security of Visible Light Communications Empowered by Gold Nanoparticles

Authors: Geonho Han, Hyuckjin Choi, Ryeong Myeong Kim, Ki Tae Nam, Junil Choi, Theodoros A. Tsiftsis

Abstract: Visible light is a proper spectrum for secure wireless communications because of its high directivity and impermeability in indoor scenarios. However, if an eavesdropper is located very close to a legitimate receiver, secure communications become highly risky. In this paper, to further increase the level of security of visible light communication (VLC) and increase its resilience against to malici… ▽ More Visible light is a proper spectrum for secure wireless communications because of its high directivity and impermeability in indoor scenarios. However, if an eavesdropper is located very close to a legitimate receiver, secure communications become highly risky. In this paper, to further increase the level of security of visible light communication (VLC) and increase its resilience against to malicious attacks, we propose to capitalize on the recently synthesized gold nanoparticles (GNPs) with chiroptical properties for circularly polarized light resulting the phase retardation that interacts with the linear polarizer angle. GNP plates made by judiciously stacking many GNPs perform as physical secret keys. Transmitters send both the intended symbol and artificial noise to exploit the channel variation effect by the GNP plates, which is highly effective when an eavesdropper is closely located to the legitimate receiver. A new VLC channel model is first developed by representing the effect of GNP plates and linear polarizers in the circular polarization domain. Based on the new channel model, the angles of linear polarizers at the transmitters and legitimate receiver are optimized considering the effect of GNP plates to increase the secrecy rate in wiretapping scenarios. Simulations verify that when the transmitters are equipped with GNP plates, even if the eavesdropper is located right next to the legitimate receiver, insightful results on the physical layer security metrics are gained as follows: 1) the secrecy rate is significantly improved and 2) the symbol error rate gap between the legitimate receiver and eavesdropper becomes much larger due to the chiroptical properties of GNP plates. △ Less

Submitted 7 June, 2024; v1 submitted 12 August, 2022; originally announced August 2022.

arXiv:2208.02473 [pdf, ps, other]

doi 10.1109/TSP.2022.3213488

Radar Imaging Based on IEEE 802.11ad Waveform in V2I Communications

Authors: Geonho Han, Junil Choi, Robert W. Heath Jr

Abstract: Since most of vehicular radar systems are already exploiting millimeter-wave (mmWave) spectra, it would become much more feasible to implement a joint radar and communication system by extending communication frequencies into the mmWave band. In this paper, an IEEE 802.11ad waveform-based radar imaging technique is proposed for vehicular settings. A roadside unit (RSU) transmits the IEEE 802.11ad… ▽ More Since most of vehicular radar systems are already exploiting millimeter-wave (mmWave) spectra, it would become much more feasible to implement a joint radar and communication system by extending communication frequencies into the mmWave band. In this paper, an IEEE 802.11ad waveform-based radar imaging technique is proposed for vehicular settings. A roadside unit (RSU) transmits the IEEE 802.11ad waveform to a vehicle for communications while the RSU also listens to the echoes of transmitted waveform to perform inverse synthetic aperture radar (ISAR) imaging. To obtain high-resolution images of the vehicle, the RSU needs to accurately estimate round-trip delays, Doppler shifts, and velocity of vehicle. The proposed ISAR imaging first estimates the round-trip delays using a good correlation property of Golay complementary sequences in the IEEE 802.11ad preamble. The Doppler shifts are then obtained using least square estimation from the echo signals and refined to compensate phase wrapping caused by phase rotation. The velocity of vehicle is determined using an equation of motion and the estimated Doppler shifts. Simulation results verify that the proposed technique is able to form high-resolution ISAR images from point scatterer models of realistic vehicular settings with different viewpoints. The proposed ISAR imaging technique can be used for various vehicular applications, e.g., traffic condition analyses or advanced collision warning systems. △ Less

Submitted 4 August, 2022; originally announced August 2022.

arXiv:2207.09625 [pdf, other]

Explicit Image Caption Editing

Authors: Zhen Wang, Long Chen, Wenbo Ma, Guangxing Han, Yulei Niu, Jian Shao, Jun Xiao

Abstract: Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, ie, they directly produce the refined captions without explicit connections to the reference captions. In this paper, we introduce a new task: Explicit Caption Editing (ECE). ECE models explic… ▽ More Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, ie, they directly produce the refined captions without explicit connections to the reference captions. In this paper, we introduce a new task: Explicit Caption Editing (ECE). ECE models explicitly generate a sequence of edit operations, and this edit operation sequence can translate the reference caption into a refined one. Compared to the implicit editing, ECE has multiple advantages: 1) Explainable: it can trace the whole editing path. 2) Editing Efficient: it only needs to modify a few words. 3) Human-like: it resembles the way that humans perform caption editing, and tries to keep original sentence structures. To solve this new task, we propose the first ECE model: TIger. TIger is a non-autoregressive transformer-based model, consisting of three modules: Tagger_del, Tagger_add, and Inserter. Specifically, Tagger_del decides whether each word should be preserved or not, Tagger_add decides where to add new words, and Inserter predicts the specific word for adding. To further facilitate ECE research, we propose two new ECE benchmarks by re-organizing two existing datasets, dubbed COCO-EE and Flickr30K-EE, respectively. Extensive ablations on both two benchmarks have demonstrated the effectiveness of TIger. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: ECCV 2022, dataset and code are available at https://github.com/baaaad/ECE

arXiv:2207.07554 [pdf, ps, other]

Renyi Entropy Rate of Stationary Ergodic Processes

Authors: Chengyu Wu, Yonglong Li, Li Xu, Guangyue Han

Abstract: In this paper, we examine the Renyi entropy rate of stationary ergodic processes. For a special class of stationary ergodic processes, we prove that the Renyi entropy rate always exists and can be polynomially approximated by its defining sequence; moreover, using the Markov approximation method, we show that the Renyi entropy rate can be exponentially approximated by that of the Markov approximat… ▽ More In this paper, we examine the Renyi entropy rate of stationary ergodic processes. For a special class of stationary ergodic processes, we prove that the Renyi entropy rate always exists and can be polynomially approximated by its defining sequence; moreover, using the Markov approximation method, we show that the Renyi entropy rate can be exponentially approximated by that of the Markov approximating sequence, as the Markov order goes to infinity. For the general case, by constructing a counterexample, we disprove the conjecture that the Renyi entropy rate of a general stationary ergodic process always converges to its Shannon entropy rate as α goes to 1. △ Less

Submitted 15 July, 2022; originally announced July 2022.

arXiv:2207.07370 [pdf, other]

CKD-TransBTS: Clinical Knowledge-Driven Hybrid Transformer with Modality-Correlated Cross-Attention for Brain Tumor Segmentation

Authors: Jianwei Lin, Jiatai Lin, Cheng Lu, Hao Chen, Huan Lin, Bingchao Zhao, Zhenwei Shi, Bingjiang Qiu, Xipeng Pan, Zeyan Xu, Biao Huang, Changhong Liang, Guoqiang Han, Zaiyi Liu, Chu Han

Abstract: Brain tumor segmentation (BTS) in magnetic resonance image (MRI) is crucial for brain tumor diagnosis, cancer management and research purposes. With the great success of the ten-year BraTS challenges as well as the advances of CNN and Transformer algorithms, a lot of outstanding BTS models have been proposed to tackle the difficulties of BTS in different technical aspects. However, existing studie… ▽ More Brain tumor segmentation (BTS) in magnetic resonance image (MRI) is crucial for brain tumor diagnosis, cancer management and research purposes. With the great success of the ten-year BraTS challenges as well as the advances of CNN and Transformer algorithms, a lot of outstanding BTS models have been proposed to tackle the difficulties of BTS in different technical aspects. However, existing studies hardly consider how to fuse the multi-modality images in a reasonable manner. In this paper, we leverage the clinical knowledge of how radiologists diagnose brain tumors from multiple MRI modalities and propose a clinical knowledge-driven brain tumor segmentation model, called CKD-TransBTS. Instead of directly concatenating all the modalities, we re-organize the input modalities by separating them into two groups according to the imaging principle of MRI. A dual-branch hybrid encoder with the proposed modality-correlated cross-attention block (MCCA) is designed to extract the multi-modality image features. The proposed model inherits the strengths from both Transformer and CNN with the local feature representation ability for precise lesion boundaries and long-range feature extraction for 3D volumetric images. To bridge the gap between Transformer and CNN features, we propose a Trans&CNN Feature Calibration block (TCFC) in the decoder. We compare the proposed model with five CNN-based models and six transformer-based models on the BraTS 2021 challenge dataset. Extensive experiments demonstrate that the proposed model achieves state-of-the-art brain tumor segmentation performance compared with all the competitors. △ Less

Submitted 15 July, 2022; originally announced July 2022.

arXiv:2204.13065 [pdf]

Treating Crowdsourcing as Examination: How to Score Tasks and Online Workers?

Authors: Guangyang Han, Sufang Li, Runmin Wang, Chunming Wu

Abstract: Crowdsourcing is an online outsourcing mode which can solve the current machine learning algorithm's urge need for massive labeled data. Requester posts tasks on crowdsourcing platforms, which employ online workers over the Internet to complete tasks, then aggregate and return results to requester. How to model the interaction between different types of workers and tasks is a hot spot. In this pap… ▽ More Crowdsourcing is an online outsourcing mode which can solve the current machine learning algorithm's urge need for massive labeled data. Requester posts tasks on crowdsourcing platforms, which employ online workers over the Internet to complete tasks, then aggregate and return results to requester. How to model the interaction between different types of workers and tasks is a hot spot. In this paper, we try to model workers as four types based on their ability: expert, normal worker, sloppy worker and spammer, and divide tasks into hard, medium and easy task according to their difficulty. We believe that even experts struggle with difficult tasks while sloppy workers can get easy tasks right, and spammers always give out wrong answers deliberately. So, good examination tasks should have moderate degree of difficulty and discriminability to score workers more objectively. Thus, we first score workers' ability mainly on the medium difficult tasks, then reducing the weight of answers from sloppy workers and modifying the answers from spammers when inferring the tasks' ground truth. A probability graph model is adopted to simulate the task execution process, and an iterative method is adopted to calculate and update the ground truth, the ability of workers and the difficulty of the task successively. We verify the rightness and effectiveness of our algorithm both in simulated and real crowdsourcing scenes. △ Less

Submitted 26 April, 2022; originally announced April 2022.

arXiv:2204.07841 [pdf, other]

Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting

Authors: Guangxing Han, Long Chen, Jiawei Ma, Shiyuan Huang, Rama Chellappa, Shih-Fu Chang

Abstract: We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD are fine-tuning-based which are inefficient for online applications. Moreover, these methods usually require expertise like class names to extract cl… ▽ More We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD are fine-tuning-based which are inefficient for online applications. Moreover, these methods usually require expertise like class names to extract class semantic embedding, which are hard to get for rare classes. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning to learn generalizable few-shot and zero-shot object detection models respectively without fine-tuning. Specifically, we combine the few-shot visual classifier and text classifier learned via meta-learning and prompt-based learning respectively to build the multi-modal classifier and detection models. In addition, to fully exploit the pre-trained language models, we propose meta-learning-based cross-modal prompting to generate soft prompts for novel classes present in few-shot visual examples, which are then used to learn the text classifier. Knowledge distillation is introduced to learn the soft prompt generator without using human prior knowledge of class names, which may not be available for rare classes. Our insight is that the few-shot support images naturally include related context information and semantics of the class. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results. △ Less

Submitted 27 March, 2023; v1 submitted 16 April, 2022; originally announced April 2022.

Comments: 17 pages

arXiv:2204.06455 [pdf, other]

WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung Adenocarcinoma

Authors: Chu Han, Xipeng Pan, Lixu Yan, Huan Lin, Bingbing Li, Su Yao, Shanshan Lv, Zhenwei Shi, Jinhai Mai, Jiatai Lin, Bingchao Zhao, Zeyan Xu, Zhizhen Wang, Yumeng Wang, Yuan Zhang, Huihui Wang, Chao Zhu, Chunhui Lin, Lijian Mao, Min Wu, Luwen Duan, Jingsong Zhu, Dong Hu, Zijie Fang, Yang Chen , et al. (18 additional authors not shown)

Abstract: Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient… ▽ More Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient pixel-level annotations, which is time-consuming and expensive. To enrich the label resources of LUAD and to alleviate the annotation efforts, we organize this challenge WSSS4LUAD to call for the outstanding weakly-supervised semantic segmentation (WSSS) techniques for histopathology images of LUAD. Participants have to design the algorithm to segment tumor epithelial, tumor-associated stroma and normal tissue with only patch-level labels. This challenge includes 10,091 patch-level annotations (the training set) and over 130 million labeled pixels (the validation and test sets), from 87 WSIs (67 from GDPH, 20 from TCGA). All the labels were generated by a pathologist-in-the-loop pipeline with the help of AI models and checked by the label review board. Among 532 registrations, 28 teams submitted the results in the test phase with over 1,000 submissions. Finally, the first place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919). According to the technical reports of the top-tier teams, CAM is still the most popular approach in WSSS. Cutmix data augmentation has been widely adopted to generate more reliable samples. With the success of this challenge, we believe that WSSS approaches with patch-level annotations can be a complement to the traditional pixel annotations while reducing the annotation efforts. The entire dataset has been released to encourage more researches on computational pathology in LUAD and more novel WSSS techniques. △ Less

Submitted 13 April, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

arXiv:2204.03873 [pdf, other]

Spatial Transformer Network on Skeleton-based Gait Recognition

Authors: Cun Zhang, Xing-Peng Chen, Guo-Qiang Han, Xiang-Jie Liu

Abstract: Skeleton-based gait recognition models usually suffer from the robustness problem, as the Rank-1 accuracy varies from 90\% in normal walking cases to 70\% in walking with coats cases. In this work, we propose a state-of-the-art robust skeleton-based gait recognition model called Gait-TR, which is based on the combination of spatial transformer frameworks and temporal convolutional networks. Gait-T… ▽ More Skeleton-based gait recognition models usually suffer from the robustness problem, as the Rank-1 accuracy varies from 90\% in normal walking cases to 70\% in walking with coats cases. In this work, we propose a state-of-the-art robust skeleton-based gait recognition model called Gait-TR, which is based on the combination of spatial transformer frameworks and temporal convolutional networks. Gait-TR achieves substantial improvements over other skeleton-based gait models with higher accuracy and better robustness on the well-known gait dataset CASIA-B. Particularly in walking with coats cases, Gait-TR get a 90\% Rank-1 gait recognition accuracy rate, which is higher than the best result of silhouette-based models, which usually have higher accuracy than the silhouette-based gait recognition models. Moreover, our experiment on CASIA-B shows that the spatial transformer can extract gait features from the human skeleton better than the widely used graph convolutional network. △ Less

Submitted 8 April, 2022; originally announced April 2022.

arXiv:2203.15021 [pdf, other]

Few-Shot Object Detection with Fully Cross-Transformer

Authors: Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Shih-Fu Chang

Abstract: Few-shot object detection (FSOD), with the aim to detect novel objects using very few training examples, has recently attracted great research interest in the community. Metric-learning based methods have been demonstrated to be effective for this task using a two-branch based siamese network, and calculate the similarity between image regions and few-shot examples for detection. However, in previ… ▽ More Few-shot object detection (FSOD), with the aim to detect novel objects using very few training examples, has recently attracted great research interest in the community. Metric-learning based methods have been demonstrated to be effective for this task using a two-branch based siamese network, and calculate the similarity between image regions and few-shot examples for detection. However, in previous works, the interaction between the two branches is only restricted in the detection head, while leaving the remaining hundreds of layers for separate feature extraction. Inspired by the recent work on vision transformers and vision-language transformers, we propose a novel Fully Cross-Transformer based model (FCT) for FSOD by incorporating cross-transformer into both the feature backbone and detection head. The asymmetric-batched cross-attention is proposed to aggregate the key information from the two branches with different batch sizes. Our model can improve the few-shot similarity learning between the two branches by introducing the multi-level interactions. Comprehensive experiments on both PASCAL VOC and MSCOCO FSOD benchmarks demonstrate the effectiveness of our model. △ Less

Submitted 29 September, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: CVPR 2022 (Oral). Code is available at https://github.com/GuangxingHan/FCT

arXiv:2202.01747 [pdf, other]

The Met Dataset: Instance-level Recognition for Artworks

Authors: Nikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han, Sarah Ibrahimi, Nanne Van Noord, Giorgos Tolias

Abstract: This work introduces a dataset for large-scale instance-level recognition in the domain of artworks. The proposed benchmark exhibits a number of different challenges such as large inter-class similarity, long tail distribution, and many classes. We rely on the open access collection of The Met museum to form a large training set of about 224k classes, where each class corresponds to a museum exhib… ▽ More This work introduces a dataset for large-scale instance-level recognition in the domain of artworks. The proposed benchmark exhibits a number of different challenges such as large inter-class similarity, long tail distribution, and many classes. We rely on the open access collection of The Met museum to form a large training set of about 224k classes, where each class corresponds to a museum exhibit with photos taken under studio conditions. Testing is primarily performed on photos taken by museum guests depicting exhibits, which introduces a distribution shift between training and testing. Testing is additionally performed on a set of images not related to Met exhibits making the task resemble an out-of-distribution detection problem. The proposed benchmark follows the paradigm of other recent datasets for instance-level recognition on different domains to encourage research on domain independent approaches. A number of suitable approaches are evaluated to offer a testbed for future comparisons. Self-supervised and supervised contrastive learning are effectively combined to train the backbone which is used for non-parametric classification that is shown as a promising direction. Dataset webpage: http://cmp.felk.cvut.cz/met/ △ Less

Submitted 3 February, 2022; originally announced February 2022.

arXiv:2202.01551 [pdf, ps, other]

Isometries and MacWilliams Extension Property for Weighted Poset Metric

Authors: Yang Xu, Haibin Kan, Guangyue Han

Abstract: Let $\mathbf{H}$ be the cartesian product of a family of left modules over a ring $S$, indexed by a finite set $Ω$. We are concerned with the $(\mathbf{P},ω)$-weight on $\mathbf{H}$, where $\mathbf{P}=(Ω,\preccurlyeq_{\mathbf{P}})$ is a poset and $ω:Ω\longrightarrow\mathbb{R}^{+}$ is a weight function. We characterize the group of $(\mathbf{P},ω)$-weight isometries of $\mathbf{H}$, and give a cano… ▽ More Let $\mathbf{H}$ be the cartesian product of a family of left modules over a ring $S$, indexed by a finite set $Ω$. We are concerned with the $(\mathbf{P},ω)$-weight on $\mathbf{H}$, where $\mathbf{P}=(Ω,\preccurlyeq_{\mathbf{P}})$ is a poset and $ω:Ω\longrightarrow\mathbb{R}^{+}$ is a weight function. We characterize the group of $(\mathbf{P},ω)$-weight isometries of $\mathbf{H}$, and give a canonical decomposition for semi-simple subcodes of $\mathbf{H}$ when $\mathbf{P}$ is hierarchical. We then study the MacWilliams extension property (MEP) for $(\mathbf{P},ω)$-weight. We show that the MEP implies the unique decomposition property (UDP) of $(\mathbf{P},ω)$, which further implies that $\mathbf{P}$ is hierarchical if $ω$ is identically $1$. For the case that either $\mathbf{P}$ is hierarchical or $ω$ is identically $1$, we show that the MEP for $(\mathbf{P},ω)$-weight can be characterized in terms of the MEP for Hamming weight, and give necessary and sufficient conditions for $\mathbf{H}$ to satisfy the MEP for $(\mathbf{P},ω)$-weight when $S$ is an Artinian simple ring (either finite or infinite). When $S$ is a finite field, in the context of $(\mathbf{P},ω)$-weight, we compare the MEP with other coding theoretic properties including the MacWilliams identity, Fourier-reflexivity of partitions and the UDP, and show that the MEP is strictly stronger than all the rest among them. △ Less

Submitted 20 July, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:2201.10828

arXiv:2201.10828 [pdf, ps, other]

Reflexivity of Partitions Induced by Weighted Poset Metric and Combinatorial Metric

Authors: Yang Xu, Haibin Kan, Guangyue Han

Abstract: Let $\mathbf{H}$ be the Cartesian product of a family of finite abelian groups. Via a polynomial approach, we give sufficient conditions for a partition of $\mathbf{H}$ induced by weighted poset metric to be reflexive, which also become necessary for some special cases. Moreover, by examining the roots of the Krawtchouk polynomials, we establish non-reflexive partitions of $\mathbf{H}$ induced by… ▽ More Let $\mathbf{H}$ be the Cartesian product of a family of finite abelian groups. Via a polynomial approach, we give sufficient conditions for a partition of $\mathbf{H}$ induced by weighted poset metric to be reflexive, which also become necessary for some special cases. Moreover, by examining the roots of the Krawtchouk polynomials, we establish non-reflexive partitions of $\mathbf{H}$ induced by combinatorial metric. When $\mathbf{H}$ is a vector space over a finite field $\mathbb{F}$, we consider the property of admitting MacWilliams identity (PAMI) and the MacWilliams extension property (MEP) for partitions of $\mathbf{H}$. With some invariance assumptions, we show that two partitions of $\mathbf{H}$ admit MacWilliams identity if and only if they are mutually dual and reflexive, and any partition of $\mathbf{H}$ satisfying the MEP is in fact an orbit partition induced by some subgroup of $\Aut_{\mathbb{F}}(\mathbf{H})$, which is necessarily reflexive. As an application of the aforementioned results, we establish partitions of $\mathbf{H}$ induced by combinatorial metric that do not satisfy the MEP, which further enable us to provide counter-examples to a conjecture proposed by Pinheiro, Machado and Firer in \cite{39}. △ Less

Submitted 20 July, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

arXiv:2201.05277 [pdf, other]

Boundary-aware Self-supervised Learning for Video Scene Segmentation

Authors: Jonghwan Mun, Minchul Shin, Gunsoo Han, Sangho Lee, Seongsu Ha, Joonseok Lee, Eun-Sol Kim

Abstract: Self-supervised learning has drawn attention through its effectiveness in learning in-domain representations with no ground-truth annotations; in particular, it is shown that properly designed pretext tasks (e.g., contrastive prediction task) bring significant performance gains for downstream tasks (e.g., classification task). Inspired from this, we tackle video scene segmentation, which is a task… ▽ More Self-supervised learning has drawn attention through its effectiveness in learning in-domain representations with no ground-truth annotations; in particular, it is shown that properly designed pretext tasks (e.g., contrastive prediction task) bring significant performance gains for downstream tasks (e.g., classification task). Inspired from this, we tackle video scene segmentation, which is a task of temporally localizing scene boundaries in a video, with a self-supervised learning framework where we mainly focus on designing effective pretext tasks. In our framework, we discover a pseudo-boundary from a sequence of shots by splitting it into two continuous, non-overlapping sub-sequences and leverage the pseudo-boundary to facilitate the pre-training. Based on this, we introduce three novel boundary-aware pretext tasks: 1) Shot-Scene Matching (SSM), 2) Contextual Group Matching (CGM) and 3) Pseudo-boundary Prediction (PP); SSM and CGM guide the model to maximize intra-scene similarity and inter-scene discrimination while PP encourages the model to identify transitional moments. Through comprehensive analysis, we empirically show that pre-training and transferring contextual representation are both critical to improving the video scene segmentation performance. Lastly, we achieve the new state-of-the-art on the MovieNet-SSeg benchmark. The code is available at https://github.com/kakaobrain/bassl. △ Less

Submitted 13 January, 2022; originally announced January 2022.

Comments: The code is available at https://github.com/kakaobrain/bassl

Showing 1–50 of 123 results for author: Han, G