subscribe to arXiv mailings

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Authors: Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen, Xuguang Lan

Abstract: This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment o… ▽ More This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. The code and dataset are available at https://github.com/xiuqhou/Relation-DETR. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.11497 [pdf, other]

"I Came Across a Junk": Understanding Design Flaws of Data Visualization from the Public's Perspective

Authors: Xingyu Lan, Yu Liu

Abstract: The visualization community has a rich history of reflecting upon flaws of visualization design, and research in this direction has remained lively until now. However, three main gaps still exist. First, most existing work characterizes design flaws from the perspective of researchers rather than the perspective of general users. Second, little work has been done to infer why these design flaws oc… ▽ More The visualization community has a rich history of reflecting upon flaws of visualization design, and research in this direction has remained lively until now. However, three main gaps still exist. First, most existing work characterizes design flaws from the perspective of researchers rather than the perspective of general users. Second, little work has been done to infer why these design flaws occur. Third, due to problems such as unclear terminology and ambiguous research scope, a better framework that systematically outlines various design flaws and helps distinguish different types of flaws is desired. To address the above gaps, this work investigated visualization design flaws through the lens of the public, constructed a framework to summarize and categorize the identified flaws, and explored why these flaws occur. Specifically, we analyzed 2227 flawed data visualizations collected from an online gallery and derived a design task-associated taxonomy containing 76 specific design flaws. These flaws were further classified into three high-level categories (i.e., misinformation, uninformativeness, unsociableness) and ten subcategories (e.g., inaccuracy, unfairness, ambiguity). Next, we organized five focus groups to explore why these design flaws occur and identified seven causes of the flaws. Finally, we proposed a set of reflections and implications arising from the research. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.07844 [pdf, other]

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Authors: Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

Abstract: Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources… ▽ More Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6\% on the COCO dataset and 40.0\% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4\% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at \href{https://github.com/wanghao9610/OV-DINO}{https://github.com/wanghao9610/OV-DINO}. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Technical Report

arXiv:2406.18838 [pdf]

Electric-field control of the perpendicular magnetization switching in ferroelectric/ferrimagnet heterostructures

Authors: Pengfei Liu, Tao Xu, Qi Liu, Juncai Dong, Ting Lin, Qinhua Zhang, Xiukai Lan, Yu Sheng, Chunyu Wang, Jiajing Pei, Hongxin Yang, Lin Gu, Kaiyou Wang

Abstract: Electric field control of the magnetic state in ferrimagnets holds great promise for developing spintronic devices due to low power consumption. Here, we demonstrate a non-volatile reversal of perpendicular net magnetization in a ferrimagnet by manipulating the electric-field driven polarization within the Pb (Zr0.2Ti0.8) O3 (PZT)/CoGd heterostructure. Electron energy loss spectra and X-ray absorp… ▽ More Electric field control of the magnetic state in ferrimagnets holds great promise for developing spintronic devices due to low power consumption. Here, we demonstrate a non-volatile reversal of perpendicular net magnetization in a ferrimagnet by manipulating the electric-field driven polarization within the Pb (Zr0.2Ti0.8) O3 (PZT)/CoGd heterostructure. Electron energy loss spectra and X-ray absorption spectrum directly verify that the oxygen ion migration at the PZT/CoGd interface associated with reversing the polarization causes the enhanced/reduced oxidation in CoGd. Ab initio calculations further substantiate that the migrated oxygen ions can modulate the relative magnetization of Co/Gd sublattices, facilitating perpendicular net magnetization switching. Our findings offer an approach to effectively control ferrimagnetic net magnetization, holding significant implications for ferrimagnetic spintronic applications. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 21 pages,4 figures

arXiv:2404.13405 [pdf]

Field-free switching of perpendicular magnetization by cooperation of planar Hall and orbital Hall effects

Authors: Zelalem Abebe Bekele, Yuan-Yuan Jiang, Kun Lei, Xiukai Lan, Xiangyu Liu, Hui Wen, Ding-Fu Shao, Kaiyou Wang

Abstract: Spin-orbit torques (SOTs) generated through the conventional spin Hall effect and/or Rashba-Edelstein effect are promising for manipulating magnetization. However, this approach typically exhibits non-deterministic and inefficient behaviour when it comes to switching perpendicular ferromagnets. This limitation posed a challenge for write-in operations in high-density magnetic memory devices. Here,… ▽ More Spin-orbit torques (SOTs) generated through the conventional spin Hall effect and/or Rashba-Edelstein effect are promising for manipulating magnetization. However, this approach typically exhibits non-deterministic and inefficient behaviour when it comes to switching perpendicular ferromagnets. This limitation posed a challenge for write-in operations in high-density magnetic memory devices. Here, we determine an effective solution to overcome this challenge by simultaneously leveraging both a planar Hall effect (PHE) and an orbital Hall effect (OHE). Using a representative Co/PtGd/Mo trilayer SOT device, we demonstrate that the PHE of Co is enhanced by the interfacial coupling of Co/PtGd, giving rise to a finite out-of-plane damping-like torque within the Co layer. Simultaneously, the OHE in Mo layer induces a strong out-of-plane orbital current, significantly amplifying the in-plane damping-like torque through orbital-to-spin conversion. While either the PHE or OHE alone proves insufficient for reversing the perpendicular magnetization of Co, their collaborative action enables high-efficiency field-free deterministic switching. Our work provides a straightforward strategy to realize high-speed and low-power spintronics. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: 13 pages, 3 figures, submitted to Nat. Commun

arXiv:2404.01622 [pdf, ps, other]

Gen4DS: Workshop on Data Storytelling in an Era of Generative AI

Authors: Xingyu Lan, Leni Yang, Zezhong Wang, Yun Wang, Danqing Shi, Sheelagh Carpendale

Abstract: Storytelling is an ancient and precious human ability that has been rejuvenated in the digital age. Over the last decade, there has been a notable surge in the recognition and application of data storytelling, both in academia and industry. Recently, the rapid development of generative AI has brought new opportunities and challenges to this field, sparking numerous new questions. These questions m… ▽ More Storytelling is an ancient and precious human ability that has been rejuvenated in the digital age. Over the last decade, there has been a notable surge in the recognition and application of data storytelling, both in academia and industry. Recently, the rapid development of generative AI has brought new opportunities and challenges to this field, sparking numerous new questions. These questions may not necessarily be quickly transformed into papers, but we believe it is necessary to promptly discuss them to help the community better clarify important issues and research agendas for the future. We thus invite you to join our workshop (Gen4DS) to discuss questions such as: How can generative AI facilitate the creation of data stories? How might generative AI alter the workflow of data storytellers? What are the pitfalls and risks of incorporating AI in storytelling? We have designed both paper presentations and interactive activities (including hands-on creation, group discussion pods, and debates on controversial issues) for the workshop. We hope that participants will learn about the latest advances and pioneering work in data storytelling, engage in critical conversations with each other, and have an enjoyable, unforgettable, and meaningful experience at the event. △ Less

Submitted 5 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.10750 [pdf, other]

Depression Detection on Social Media with Large Language Models

Authors: Xiaochong Lan, Yiming Cheng, Li Sheng, Chen Gao, Yong Li

Abstract: Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces… ▽ More Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces two key challenges: 1) it requires professional medical knowledge, and 2) it necessitates both high accuracy and explainability. To address it, we propose a novel depression detection system called DORIS, combining medical knowledge and the recent advances in large language models (LLMs). Specifically, to tackle the first challenge, we proposed an LLM-based solution to first annotate whether high-risk texts meet medical diagnostic criteria. Further, we retrieve texts with high emotional intensity and summarize critical information from the historical mood records of users, so-called mood courses. To tackle the second challenge, we combine LLM and traditional classifiers to integrate medical knowledge-guided features, for which the model can also explain its prediction results, achieving both high accuracy and explainability. Extensive experimental results on benchmarking datasets show that, compared to the current best baseline, our approach improves by 0.036 in AUPRC, which can be considered significant, demonstrating the effectiveness of our approach and its high value as an NLP application. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.19231 [pdf, other]

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Authors: Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, Yaowei Wang, Chun Yuan

Abstract: Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust glob… ▽ More Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR. △ Less

Submitted 1 April, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted by CVPR2024

arXiv:2402.17978 [pdf, other]

Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning

Authors: Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, Xuguang Lan

Abstract: Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often… ▽ More Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models. △ Less

Submitted 1 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence

arXiv:2402.16086 [pdf, other]

doi 10.1609/aaai.v38i9.28901

Deep Homography Estimation for Visual Place Recognition

Authors: Feng Lu, Shuting Dong, Lijun Zhang, Bingxi Liu, Xiangyuan Lan, Dongmei Jiang, Chun Yuan

Abstract: Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ran… ▽ More Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR. △ Less

Submitted 18 March, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI2024

Journal ref: AAAI 2024

arXiv:2402.14505 [pdf, other]

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Authors: Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, Chun Yuan

Abstract: Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model… ▽ More Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR. △ Less

Submitted 3 April, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: ICLR2024

arXiv:2402.11816 [pdf, other]

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Authors: Jihai Zhang, Xiang Lan, Xiaoye Qu, Yu Cheng, Mengling Feng, Bryan Hooi

Abstract: Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This… ▽ More Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark. △ Less

Submitted 15 July, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

Comments: ECCV 2024 Camera-Ready

arXiv:2402.11792 [pdf, other]

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

Authors: Jie Xu, Hanbo Zhang, Xinghang Li, Huaping Liu, Xuguang Lan, Tao Kong

Abstract: Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In thi… ▽ More Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In this paper, we present SInViG, which is a self-evolving interactive visual agent for human-robot interaction based on natural languages, aiming to resolve language ambiguity, if any, through multi-turn visual-language dialogues. It continuously and automatically learns from unlabeled images and large language models, without human intervention, to be more robust against visual and linguistic complexity. Benefiting from self-evolving, it sets new state-of-the-art on several interactive visual grounding benchmarks. Moreover, our human-robot interaction experiments show that the evolved models consistently acquire more and more preferences from human users. Besides, we also deployed our model on a Franka robot for interactive manipulation tasks. Results demonstrate that our model can follow diverse user instructions and interact naturally with humans in natural language, despite the complexity and disturbance of the environment. △ Less

Submitted 19 February, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.03699 [pdf]

Automatic Robotic Development through Collaborative Framework by Large Language Models

Authors: Zhirong Luan, Yujun Lai, Rundong Huang, Xiaruiqi Lan, Liangjun Chen, Badong Chen

Abstract: Despite the remarkable code generation abilities of large language models LLMs, they still face challenges in complex task handling. Robot development, a highly intricate field, inherently demands human involvement in task allocation and collaborative teamwork . To enhance robot development, we propose an innovative automated collaboration framework inspired by real-world robot developers. This fr… ▽ More Despite the remarkable code generation abilities of large language models LLMs, they still face challenges in complex task handling. Robot development, a highly intricate field, inherently demands human involvement in task allocation and collaborative teamwork . To enhance robot development, we propose an innovative automated collaboration framework inspired by real-world robot developers. This framework employs multiple LLMs in distinct roles analysts, programmers, and testers. Analysts delve deep into user requirements, enabling programmers to produce precise code, while testers fine-tune the parameters based on user feedback for practical robot application. Each LLM tackles diverse, critical tasks within the development process. Clear collaboration rules emulate real world teamwork among LLMs. Analysts, programmers, and testers form a cohesive team overseeing strategy, code, and parameter adjustments . Through this framework, we achieve complex robot development without requiring specialized knowledge, relying solely on non experts participation. △ Less

Submitted 16 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.16699 [pdf, other]

Towards Unified Interactive Visual Grounding in The Wild

Authors: Jie Xu, Hanbo Zhang, Qingyi Si, Yifeng Li, Xuguang Lan, Tao Kong

Abstract: Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper… ▽ More Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO. △ Less

Submitted 18 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted to ICRA 2024

arXiv:2401.16355 [pdf, other]

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

Authors: Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, Jingxiong Li, Xinheng Lyu, Tao Lin, Lin Yang

Abstract: The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-cho… ▽ More The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology. △ Less

Submitted 20 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 27 pages, 12 figures

arXiv:2401.05671 [pdf]

Deciphering Interphase Instability of Lithium Metal Batteries with Localized High-Concentration Electrolytes at Elevated Temperatures

Authors: Tao Meng, Shanshan Yang, Yitong Peng, Xiwei Lan, Pingan Li, Kangjia Hu, Xianluo Hu

Abstract: Lithium metal batteries (LMBs), when coupled with a localized high-concentration electrolyte and a high-voltage nickel-rich cathode, offer a solution to the increasing demand for high energy density and long cycle life. However, the aggressive electrode chemistry poses safety risks to LMBs at higher temperatures and cutoff voltages. Here, we decipher the interphase instability in LHCE-based LMBs w… ▽ More Lithium metal batteries (LMBs), when coupled with a localized high-concentration electrolyte and a high-voltage nickel-rich cathode, offer a solution to the increasing demand for high energy density and long cycle life. However, the aggressive electrode chemistry poses safety risks to LMBs at higher temperatures and cutoff voltages. Here, we decipher the interphase instability in LHCE-based LMBs with a Ni0.8Co0.1Mn0.1O2 cathode at elevated temperatures. Our findings reveal that the generation of fluorine radicals in the electrolyte induces the solvent decomposition and consequent chain reactions, thereby reconstructing the cathode electrolyte interphase (CEI) and degrading battery cyclability. As further evidenced, introducing an acid scavenger of dimethoxydimethylsilane (DODSi) significantly boosts CEI stability with suppressed microcracking. A Ni0.8Co0.1Mn0.1O2||Li cell with this DODSi-functionalized LHCE achieves an unprecedented capacity retention of 93.0% after 100 cycles at 80 °C. This research provides insights into electrolyte engineering for practical LMBs with high safety under extreme temperatures. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 10 pages, 8 figures

arXiv:2312.11970 [pdf, other]

Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives

Authors: Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, Yong Li

Abstract: Agent-based modeling and simulation has evolved as a powerful tool for modeling complex systems, offering insights into emergent behaviors and interactions among diverse agents. Integrating large language models into agent-based modeling and simulation presents a promising avenue for enhancing simulation capabilities. This paper surveys the landscape of utilizing large language models in agent-bas… ▽ More Agent-based modeling and simulation has evolved as a powerful tool for modeling complex systems, offering insights into emergent behaviors and interactions among diverse agents. Integrating large language models into agent-based modeling and simulation presents a promising avenue for enhancing simulation capabilities. This paper surveys the landscape of utilizing large language models in agent-based modeling and simulation, examining their challenges and promising future directions. In this survey, since this is an interdisciplinary field, we first introduce the background of agent-based modeling and simulation and large language model-empowered agents. We then discuss the motivation for applying large language models to agent-based simulation and systematically analyze the challenges in environment perception, human alignment, action generation, and evaluation. Most importantly, we provide a comprehensive overview of the recent works of large language model-empowered agent-based modeling and simulation in multiple scenarios, which can be divided into four domains: cyber, physical, social, and hybrid, covering simulation of both real-world and virtual environments. Finally, since this area is new and quickly evolving, we discuss the open problems and promising future directions. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 37 pages

arXiv:2310.10467 [pdf, other]

Stance Detection with Collaborative Role-Infused LLM-Based Agents

Authors: Xiaochong Lan, Chen Gao, Depeng Jin, Yong Li

Abstract: Stance detection automatically detects the stance in a text towards a target, vital for content analysis in web and social media research. Despite their promising capabilities, LLMs encounter challenges when directly applied to stance detection. First, stance detection demands multi-aspect knowledge, from deciphering event-related terminologies to understanding the expression styles in social medi… ▽ More Stance detection automatically detects the stance in a text towards a target, vital for content analysis in web and social media research. Despite their promising capabilities, LLMs encounter challenges when directly applied to stance detection. First, stance detection demands multi-aspect knowledge, from deciphering event-related terminologies to understanding the expression styles in social media platforms. Second, stance detection requires advanced reasoning to infer authors' implicit viewpoints, as stance are often subtly embedded rather than overtly stated in the text. To address these challenges, we design a three-stage framework COLA (short for Collaborative rOle-infused LLM-based Agents) in which LLMs are designated distinct roles, creating a collaborative system where each role contributes uniquely. Initially, in the multidimensional text analysis stage, we configure the LLMs to act as a linguistic expert, a domain specialist, and a social media veteran to get a multifaceted analysis of texts, thus overcoming the first challenge. Next, in the reasoning-enhanced debating stage, for each potential stance, we designate a specific LLM-based agent to advocate for it, guiding the LLM to detect logical connections between text features and stance, tackling the second challenge. Finally, in the stance conclusion stage, a final decision maker agent consolidates prior insights to determine the stance. Our approach avoids extra annotated data and model training and is highly usable. We achieve state-of-the-art performance across multiple datasets. Ablation studies validate the effectiveness of each design role in handling stance detection. Further experiments have demonstrated the explainability and the versatility of our approach. Our approach excels in usability, accuracy, effectiveness, explainability and versatility, highlighting its value. △ Less

Submitted 16 April, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.05694 [pdf, other]

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Authors: Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

Abstract: The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development… ▽ More The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics. △ Less

Submitted 11 June, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2309.15983 [pdf, other]

What To Do (and Not to Do) with Causal Panel Analysis under Parallel Trends: Lessons from A Large Reanalysis Study

Authors: Albert Chiu, Xingchen Lan, Ziyi Liu, Yiqing Xu

Abstract: Two-way fixed effects (TWFE) models are ubiquitous in causal panel analysis in political science. However, recent methodological discussions challenge their validity in the presence of heterogeneous treatment effects (HTE) and violations of the parallel trends assumption (PTA). This burgeoning literature has introduced multiple estimators and diagnostics, leading to confusion among empirical resea… ▽ More Two-way fixed effects (TWFE) models are ubiquitous in causal panel analysis in political science. However, recent methodological discussions challenge their validity in the presence of heterogeneous treatment effects (HTE) and violations of the parallel trends assumption (PTA). This burgeoning literature has introduced multiple estimators and diagnostics, leading to confusion among empirical researchers on two fronts: the reliability of existing results based on TWFE models and the current best practices. To address these concerns, we examined, replicated, and reanalyzed 37 articles from three leading political science journals that employed observational panel data with binary treatments. Using six newly introduced HTE-robust estimators, along with diagnostics tests and uncertainty measures that are robust to PTA violations, we find that only a small minority of studies are highly robust. Although HTE-robust estimates tend to be broadly consistent with TWFE estimates, discrepancies in point estimates, increased measures of uncertainty, and potential PTA violations call into question many results that were already on the margins of statistical significance. We offer recommendations for improving practice in empirical research based on these findings. △ Less

Submitted 14 June, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

arXiv:2308.05938 [pdf, other]

doi 10.1109/TMM.2023.3330047

FoodSAM: Any Food Segmentation

Authors: Xing Lan, Jiayi Lyu, Hanyu Jiang, Kun Dong, Zehai Niu, Yi Zhang, Jian Xue

Abstract: In this paper, we explore the zero-shot capability of the Segment Anything Model (SAM) for food image segmentation. To address the lack of class-specific information in SAM-generated masks, we propose a novel framework, called FoodSAM. This innovative approach integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality. Besides, we recognize that the ingre… ▽ More In this paper, we explore the zero-shot capability of the Segment Anything Model (SAM) for food image segmentation. To address the lack of class-specific information in SAM-generated masks, we propose a novel framework, called FoodSAM. This innovative approach integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality. Besides, we recognize that the ingredients in food can be supposed as independent individuals, which motivated us to perform instance segmentation on food images. Furthermore, FoodSAM extends its zero-shot capability to encompass panoptic segmentation by incorporating an object detector, which renders FoodSAM to effectively capture non-food object information. Drawing inspiration from the recent success of promptable segmentation, we also extend FoodSAM to promptable segmentation, supporting various prompt variants. Consequently, FoodSAM emerges as an all-encompassing solution capable of segmenting food items at multiple levels of granularity. Remarkably, this pioneering framework stands as the first-ever work to achieve instance, panoptic, and promptable segmentation on food images. Extensive experiments demonstrate the feasibility and impressing performance of FoodSAM, validating SAM's potential as a prominent and influential tool within the domain of food image segmentation. We release our code at https://github.com/jamesjg/FoodSAM. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: Code is available at https://github.com/jamesjg/FoodSAM

arXiv:2308.02831 [pdf, other]

Affective Visualization Design: Leveraging the Emotional Impact of Data

Authors: Xingyu Lan, Yanqiu Wu, Nan Cao

Abstract: In recent years, more and more researchers have reflected on the undervaluation of emotion in data visualization and highlighted the importance of considering human emotion in visualization design. Meanwhile, an increasing number of studies have been conducted to explore emotion-related factors. However, so far, this research area is still in its early stages and faces a set of challenges, such as… ▽ More In recent years, more and more researchers have reflected on the undervaluation of emotion in data visualization and highlighted the importance of considering human emotion in visualization design. Meanwhile, an increasing number of studies have been conducted to explore emotion-related factors. However, so far, this research area is still in its early stages and faces a set of challenges, such as the unclear definition of key concepts, the insufficient justification of why emotion is important in visualization design, and the lack of characterization of the design space of affective visualization design. To address these challenges, first, we conducted a literature review and identified three research lines that examined both emotion and data visualization. We clarified the differences between these research lines and kept 109 papers that studied or discussed how data visualization communicates and influences emotion. Then, we coded the 109 papers in terms of how they justified the legitimacy of considering emotion in visualization design (i.e., why emotion is important) and identified five argumentative perspectives. Based on these papers, we also identified 61 projects that practiced affective visualization design. We coded these design projects in three dimensions, including design fields (where), design tasks (what), and design methods (how), to explore the design space of affective visualization design. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: to appear at IEEE VIS 2023

arXiv:2307.16644 [pdf, other]

doi 10.1145/3580305.3599874

NEON: Living Needs Prediction System in Meituan

Authors: Xiaochong Lan, Chen Gao, Shiqi Wen, Xiuqi Chen, Yingge Che, Han Zhang, Huazhou Wei, Hengliang Luo, Yong Li

Abstract: Living needs refer to the various needs in human's daily lives for survival and well-being, including food, housing, entertainment, etc. On life service platforms that connect users to service providers, such as Meituan, the problem of living needs prediction is fundamental as it helps understand users and boost various downstream applications such as personalized recommendation. However, the prob… ▽ More Living needs refer to the various needs in human's daily lives for survival and well-being, including food, housing, entertainment, etc. On life service platforms that connect users to service providers, such as Meituan, the problem of living needs prediction is fundamental as it helps understand users and boost various downstream applications such as personalized recommendation. However, the problem has not been well explored and is faced with two critical challenges. First, the needs are naturally connected to specific locations and times, suffering from complex impacts from the spatiotemporal context. Second, there is a significant gap between users' actual living needs and their historical records on the platform. To address these two challenges, we design a system of living NEeds predictiON named NEON, consisting of three phases: feature mining, feature fusion, and multi-task prediction. In the feature mining phase, we carefully extract individual-level user features for spatiotemporal modeling, and aggregated-level behavioral features for enriching data, which serve as the basis for addressing two challenges, respectively. Further, in the feature fusion phase, we propose a neural network that effectively fuses two parts of features into the user representation. Moreover, we design a multi-task prediction phase, where the auxiliary task of needs-meeting way prediction can enhance the modeling of spatiotemporal context. Extensive offline evaluations verify that our NEON system can effectively predict users' living needs. Furthermore, we deploy NEON into Meituan's algorithm engine and evaluate how it enhances the three downstream prediction applications, via large-scale online A/B testing. △ Less

Submitted 31 July, 2023; originally announced July 2023.

arXiv:2307.16107 [pdf, other]

doi 10.3847/1538-4357/ace166

Fermi-LAT detection of A new starburst galaxy candidate: IRAS 13052-5711

Authors: Yunchuan Xiang, Qingquan Jiang, Xiaofei Lan

Abstract: A likely starburst galaxy (SBG), IRAS 13052-5711, which is the most distant SBG candidate discovered to date, was found by analyzing 14.4 years of data from the Fermi large-area telescope (Fermi-LAT). This SBG's significance level is approximately 6.55$σ$ in the 0.1-500 GeV band. Its spatial position is close to that of 4FGL J1308.9-5730, determined from the Fermi large telescope fourth-source Cat… ▽ More A likely starburst galaxy (SBG), IRAS 13052-5711, which is the most distant SBG candidate discovered to date, was found by analyzing 14.4 years of data from the Fermi large-area telescope (Fermi-LAT). This SBG's significance level is approximately 6.55$σ$ in the 0.1-500 GeV band. Its spatial position is close to that of 4FGL J1308.9-5730, determined from the Fermi large telescope fourth-source Catalog (4FGL). Its power-law spectral index is approximately 2.1, and its light curve (LC) for 14.4 years has no significant variability. These characteristics are highly similar to those of SBGs found in the past. We calculate the SBG's star formation rate (SFR) to be 29.38 $\rm M_{\odot}\ yr^{-1}$, which is within the SFR range of SBGs found to date. Therefore, IRAS 13052-5711 is considered to be a likely SBG. In addition, its 0.1-500 GeV luminosity is (3.28 $\pm$ 0.67) $\times 10^{42}\ \rm erg\ s^{-1}$, which deviates from the empirical relationship of the $γ$-ray luminosity and the total infrared luminosity. We considered a hadronic model to explain the GeV spectrum of IRAS 13052-5711. △ Less

Submitted 29 July, 2023; originally announced July 2023.

arXiv:2307.14984 [pdf, other]

S3: Social-network Simulation System with Large Language Model-Empowered Agents

Authors: Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, Yong Li

Abstract: Social network simulation plays a crucial role in addressing various challenges within social science. It offers extensive applications such as state prediction, phenomena explanation, and policy-making support, among others. In this work, we harness the formidable human-like capabilities exhibited by large language models (LLMs) in sensing, reasoning, and behaving, and utilize these qualities to… ▽ More Social network simulation plays a crucial role in addressing various challenges within social science. It offers extensive applications such as state prediction, phenomena explanation, and policy-making support, among others. In this work, we harness the formidable human-like capabilities exhibited by large language models (LLMs) in sensing, reasoning, and behaving, and utilize these qualities to construct the S$^3$ system (short for $\textbf{S}$ocial network $\textbf{S}$imulation $\textbf{S}$ystem). Adhering to the widely employed agent-based simulation paradigm, we employ prompt engineering and prompt tuning techniques to ensure that the agent's behavior closely emulates that of a genuine human within the social network. Specifically, we simulate three pivotal aspects: emotion, attitude, and interaction behaviors. By endowing the agent in the system with the ability to perceive the informational environment and emulate human actions, we observe the emergence of population-level phenomena, including the propagation of information, attitudes, and emotions. We conduct an evaluation encompassing two levels of simulation, employing real-world social network data. Encouragingly, the results demonstrate promising accuracy. This work represents an initial step in the realm of social network simulation empowered by LLM-based agents. We anticipate that our endeavors will serve as a source of inspiration for the development of simulation systems within, but not limited to, social science. △ Less

Submitted 19 October, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

arXiv:2307.11458 [pdf, other]

Strip-MLP: Efficient Token Interaction for Vision MLP

Authors: Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang

Abstract: Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small s… ▽ More Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2307.09193 [pdf, other]

ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint

Authors: Zhenhao Jiang, Biao Zeng, Hao Feng, Jin Liu, Jicong Fan, Jie Zhang, Jia Jia, Ning Hu, Xingyu Chen, Xuguang Lan

Abstract: Large-scale online recommender system spreads all over the Internet being in charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion Rate (CVR) estimations. However, traditional CVR estimators suffer from well-known Sample Selection Bias and Data Sparsity issues. Entire space models were proposed to address the two issues via tracing the decision-making path of "exposure_clic… ▽ More Large-scale online recommender system spreads all over the Internet being in charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion Rate (CVR) estimations. However, traditional CVR estimators suffer from well-known Sample Selection Bias and Data Sparsity issues. Entire space models were proposed to address the two issues via tracing the decision-making path of "exposure_click_purchase". Further, some researchers observed that there are purchase-related behaviors between click and purchase, which can better draw the user's decision-making intention and improve the recommendation performance. Thus, the decision-making path has been extended to "exposure_click_in-shop action_purchase" and can be modeled with conditional probability approach. Nevertheless, we observe that the chain rule of conditional probability does not always hold. We report Probability Space Confusion (PSC) issue and give a derivation of difference between ground-truth and estimation mathematically. We propose a novel Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint (ESMC) and two alternatives: Entire Space Multi-Task Model with Siamese Network (ESMS) and Entire Space Multi-Task Model in Global Domain (ESMG) to address the PSC issue. Specifically, we handle "exposure_click_in-shop action" and "in-shop action_purchase" separately in the light of characteristics of in-shop action. The first path is still treated with conditional probability while the second one is treated with parameter constraint strategy. Experiments on both offline and online environments in a large-scale recommendation system illustrate the superiority of our proposed methods over state-of-the-art models. The real-world datasets will be released. △ Less

Submitted 29 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

arXiv:2305.12624 [pdf, other]

Scalable regression calibration approaches to correcting measurement error in multi-level generalized functional linear regression models with heteroscedastic measurement errors

Authors: Yuanyuan Luan, Roger S. Zoh, Erjia Cui, Xue Lan, Sneha Jadhav, Carmen D. Tekwe

Abstract: Wearable devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behavior, such as sleep quality and physical activity. The continuous monitoring often occurs in epochs of 60 seconds over multiple days, resulting in high dimensional longitudinal curves that are best described and analyzed as functional data. From this perspective, the functional data… ▽ More Wearable devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behavior, such as sleep quality and physical activity. The continuous monitoring often occurs in epochs of 60 seconds over multiple days, resulting in high dimensional longitudinal curves that are best described and analyzed as functional data. From this perspective, the functional data are smooth, latent functions obtained at discrete time intervals and prone to homoscedastic white noise. However, the assumption of homoscedastic errors might not be appropriate in this setting because the devices collect the data serially. While researchers have previously addressed measurement error in scalar covariates prone to errors, less work has been done on correcting measurement error in high dimensional longitudinal curves prone to heteroscedastic errors. We present two new methods for correcting measurement error in longitudinal functional curves prone to complex measurement error structures in multi-level generalized functional linear regression models. These methods are based on two-stage scalable regression calibration. We assume that the distribution of the scalar responses and the surrogate measures prone to heteroscedastic errors both belong in the exponential family and that the measurement errors follow Gaussian processes. In simulations and sensitivity analyses, we established some finite sample properties of these methods. In our simulations, both regression calibration methods for correcting measurement error performed better than estimators based on averaging the longitudinal functional data and using observations from a single day. We also applied the methods to assess the relationship between physical activity and type 2 diabetes in community dwelling adults in the United States who participated in the National Health and Nutrition Examination Survey. △ Less

Submitted 20 April, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

arXiv:2304.12592 [pdf, other]

MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes

Authors: Han Wang, Jiayuan Zhang, Lipeng Wan, Xingyu Chen, Xuguang Lan, Nanning Zheng

Abstract: Manipulation relationship detection (MRD) aims to guide the robot to grasp objects in the right order, which is important to ensure the safety and reliability of grasping in object stacked scenes. Previous works infer manipulation relationship by deep neural network trained with data collected from a predefined view, which has limitation in visual dislocation in unstructured environments. Multi-vi… ▽ More Manipulation relationship detection (MRD) aims to guide the robot to grasp objects in the right order, which is important to ensure the safety and reliability of grasping in object stacked scenes. Previous works infer manipulation relationship by deep neural network trained with data collected from a predefined view, which has limitation in visual dislocation in unstructured environments. Multi-view data provide more comprehensive information in space, while a challenge of multi-view MRD is domain shift. In this paper, we propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN), which is trained by 2D and 3D multi-view data. We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions to learn the consistent representations. Besides, taking advantage of position information within the 3D data, we select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects. Finally, the features of multi-view 2D and 3D data are concatenated to predict the pairwise relationship of objects. Experimental results on the challenging REGRAD dataset show that MMRDN outperforms the state-of-the-art methods in multi-view MRD tasks. The results also demonstrate that our model trained by synthetic data is capable to transfer to real-world scenarios. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.01171 [pdf, other]

Revisiting Context Aggregation for Image Matting

Authors: Qinglin Liu, Xiaoqian Lv, Quanling Meng, Zonglin Li, Xiangyuan Lan, Shuo Yang, Shengping Zhang, Liqiang Nie

Abstract: Traditional studies emphasize the significance of context information in improving matting performance. Consequently, deep learning-based matting methods delve into designing pooling or affinity-based context aggregation modules to achieve superior results. However, these modules cannot well handle the context scale shift caused by the difference in image size during training and inference, result… ▽ More Traditional studies emphasize the significance of context information in improving matting performance. Consequently, deep learning-based matting methods delve into designing pooling or affinity-based context aggregation modules to achieve superior results. However, these modules cannot well handle the context scale shift caused by the difference in image size during training and inference, resulting in matting performance degradation. In this paper, we revisit the context aggregation mechanisms of matting networks and find that a basic encoder-decoder network without any context aggregation modules can actually learn more universal context aggregation, thereby achieving higher matting performance compared to existing methods. Building on this insight, we present AEMatter, a matting network that is straightforward yet very effective. AEMatter adopts a Hybrid-Transformer backbone with appearance-enhanced axis-wise learning (AEAL) blocks to build a basic network with strong context aggregation learning capability. Furthermore, AEMatter leverages a large image training strategy to assist the network in learning context aggregation from data. Extensive experiments on five popular matting datasets demonstrate that the proposed AEMatter outperforms state-of-the-art matting methods by a large margin. △ Less

Submitted 14 May, 2024; v1 submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.17408 [pdf, other]

P-Transformer: A Prompt-based Multimodal Transformer Architecture For Medical Tabular Data

Authors: Yucheng Ruan, Xiang Lan, Daniel J. Tan, Hairil Rizal Abdullah, Mengling Feng

Abstract: Medical tabular data, abundant in Electronic Health Records (EHRs), is a valuable resource for diverse medical tasks such as risk prediction. While deep learning approaches, particularly transformer-based models, have shown remarkable performance in tabular data prediction, there are still problems remained for existing work to be effectively adapted into medical domain, such as under-utilization… ▽ More Medical tabular data, abundant in Electronic Health Records (EHRs), is a valuable resource for diverse medical tasks such as risk prediction. While deep learning approaches, particularly transformer-based models, have shown remarkable performance in tabular data prediction, there are still problems remained for existing work to be effectively adapted into medical domain, such as under-utilization of unstructured free-texts, limited exploration of textual information in structured data, and data corruption. To address these issues, we propose P-Transformer, a Prompt-based multimodal Transformer architecture designed specifically for medical tabular data. This framework consists two critical components: a tabular cell embedding generator and a tabular transformer. The former efficiently encodes diverse modalities from both structured and unstructured tabular data into a harmonized language semantic space with the help of pre-trained sentence encoder and medical prompts. The latter integrates cell representations to generate patient embeddings for various medical tasks. In comprehensive experiments on two real-world datasets for three medical tasks, P-Transformer demonstrated the improvements with 10.9%/11.0% on RMSE/MAE, 0.5%/2.2% on RMSE/MAE, and 1.6%/0.8% on BACC/AUROC compared to state-of-the-art (SOTA) baselines in predictability. Notably, the model exhibited strong resilience to data corruption in the structured data, particularly when the corruption rates are high. △ Less

Submitted 9 January, 2024; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2303.07828 [pdf, other]

Prioritized Planning for Target-Oriented Manipulation via Hierarchical Stacking Relationship Prediction

Authors: Zewen Wu, Jian Tang, Xingyu Chen, Chengzhong Ma, Xuguang Lan, Nanning Zheng

Abstract: In scenarios involving the grasping of multiple targets, the learning of stacking relationships between objects is fundamental for robots to execute safely and efficiently. However, current methods lack subdivision for the hierarchy of stacking relationship types. In scenes where objects are mostly stacked in an orderly manner, they are incapable of performing human-like and high-efficient graspin… ▽ More In scenarios involving the grasping of multiple targets, the learning of stacking relationships between objects is fundamental for robots to execute safely and efficiently. However, current methods lack subdivision for the hierarchy of stacking relationship types. In scenes where objects are mostly stacked in an orderly manner, they are incapable of performing human-like and high-efficient grasping decisions. This paper proposes a perception-planning method to distinguish different stacking types between objects and generate prioritized manipulation order decisions based on given target designations. We utilize a Hierarchical Stacking Relationship Network (HSRN) to discriminate the hierarchy of stacking and generate a refined Stacking Relationship Tree (SRT) for relationship description. Considering that objects with high stacking stability can be grasped together if necessary, we introduce an elaborate decision-making planner based on the Partially Observable Markov Decision Process (POMDP), which leverages observations and generates the least grasp-consuming decision chain with robustness and is suitable for simultaneously specifying multiple targets. To verify our work, we set the scene to the dining table and augment the REGRAD dataset with a set of common tableware models for network training. Experiments show that our method effectively generates grasping decisions that conform to human requirements, and improves the implementation efficiency compared with existing methods on the basis of guaranteeing the success rate. △ Less

Submitted 25 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: 8 pages, 8 figures. Accepted by 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

arXiv:2302.03357 [pdf, other]

Towards Enhancing Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach

Authors: Xiang Lan, Hanshu Yan, Shenda Hong, Mengling Feng

Abstract: Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that can impair the quality of time series representation learned through contrastive learning: the noisy positive pair and the faulty positive pair. We observe that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy… ▽ More Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that can impair the quality of time series representation learned through contrastive learning: the noisy positive pair and the faulty positive pair. We observe that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy Alignment). Meanwhile, when faulty positive pairs arise, the model wastes considerable amount of effort aligning non-representative patterns (Faulty Alignment). To address this problem, we propose a Dynamic Bad Pair Mining (DBPM) algorithm, which reliably identifies and suppresses bad positive pairs in time series contrastive learning. Specifically, DBPM utilizes a memory module to dynamically track the training behavior of each positive pair along training process. This allows us to identify potential bad positive pairs at each epoch based on their historical training behaviors. The identified bad pairs are subsequently down-weighted through a transformation module, thereby mitigating their negative impact on the representation learning process. DBPM is a simple algorithm designed as a lightweight plug-in without learnable parameters to enhance the performance of existing state-of-the-art methods. Through extensive experiments conducted on four large-scale, real-world time series datasets, we demonstrate DBPM's efficacy in mitigating the adverse effects of bad positive pairs. △ Less

Submitted 28 March, 2024; v1 submitted 7 February, 2023; originally announced February 2023.

Comments: ICLR 2024 Camera Ready (https://openreview.net/pdf?id=K2c04ulKXn)

arXiv:2211.12075 [pdf, other]

Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Authors: Lipeng Wan, Zeyang Liu, Xingyu Chen, Xuguang Lan, Nanning Zheng

Abstract: Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive th… ▽ More Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2112.04454

arXiv:2211.03296 [pdf, other]

The Chart Excites Me! Exploring How Data Visualization Design Influences Affective Arousal

Authors: Xingyu Lan, Yanqiu Wu, Qing Chen, Nan Cao

Abstract: As data visualizations have been increasingly applied in mass communication, designers often seek to grasp viewers immediately and motivate them to read more. Such goals, as suggested by previous research, are closely associated with the activation of emotion, namely affective arousal. Given this motivation, this work takes initial steps toward understanding the arousal-related factors in data vis… ▽ More As data visualizations have been increasingly applied in mass communication, designers often seek to grasp viewers immediately and motivate them to read more. Such goals, as suggested by previous research, are closely associated with the activation of emotion, namely affective arousal. Given this motivation, this work takes initial steps toward understanding the arousal-related factors in data visualization design. We collected a corpus of 265 data visualizations and conducted a crowdsourcing study with 184 participants during which the participants were asked to rate the affective arousal elicited by data visualization design (all texts were blurred to exclude the influence of semantics) and provide their reasons. Based on the collected data, first, we identified a set of arousal-related design features by analyzing user comments qualitatively. Then, we mapped these features to computable variables and constructed regression models to infer which features are significant contributors to affective arousal quantitatively. Through this exploratory study, we finally identified four design features (e.g., colorfulness, the number of different visual channels) cross-validated as important features correlated with affective arousal. △ Less

Submitted 6 November, 2022; originally announced November 2022.

arXiv:2209.03642 [pdf, other]

VizBelle: A Design Space of Embellishments for Data Visualization

Authors: Qing Chen, Ziyan Liu, Chengwei Wang, Xingyu Lan, Ying Chen, Siming Chen, Nan Cao

Abstract: Visual embellishments, as a form of non-linguistic rhetorical figures, are used to help convey abstract concepts or attract readers' attention. Creating data visualizations with appropriate and visually pleasing embellishments is challenging since this process largely depends on the experience and the aesthetic taste of designers. To help facilitate designers in the ideation and creation process,… ▽ More Visual embellishments, as a form of non-linguistic rhetorical figures, are used to help convey abstract concepts or attract readers' attention. Creating data visualizations with appropriate and visually pleasing embellishments is challenging since this process largely depends on the experience and the aesthetic taste of designers. To help facilitate designers in the ideation and creation process, we propose a design space, VizBelle, based on the analysis of 361 classified visualizations from online sources. VizBelle consists of four dimensions, namely, communication goal to fit user intention, object to select the target area, strategy and technique to offer potential approaches. We further provide a website to present detailed explanations and examples of various techniques. We conducted a within-subject study with 20 professional and amateur design enthusiasts to evaluate the effectiveness of our design space. Results show that our design space is illuminating and useful for designers to create data visualizations with embellishments. △ Less

Submitted 8 September, 2022; originally announced September 2022.

arXiv:2208.07156 [pdf, other]

Cooperative guidance of multiple missiles: a hybrid co-evolutionary approach

Authors: Xuejing Lan, Junda Chen, Zhijia Zhao, Tao Zou

Abstract: Cooperative guidance of multiple missiles is a challenging task with rigorous constraints of time and space consensus, especially when attacking dynamic targets. In this paper, the cooperative guidance task is described as a distributed multi-objective cooperative optimization problem. To address the issues of non-stationarity and continuous control faced by cooperative guidance, the natural evolu… ▽ More Cooperative guidance of multiple missiles is a challenging task with rigorous constraints of time and space consensus, especially when attacking dynamic targets. In this paper, the cooperative guidance task is described as a distributed multi-objective cooperative optimization problem. To address the issues of non-stationarity and continuous control faced by cooperative guidance, the natural evolutionary strategy (NES) is improved along with an elitist adaptive learning technique to develop a novel natural co-evolutionary strategy (NCES). The gradients of original evolutionary strategy are rescaled to reduce the estimation bias caused by the interaction between the multiple missiles. Then, a hybrid co-evolutionary cooperative guidance law (HCCGL) is proposed by integrating the highly scalable co-evolutionary mechanism and the traditional guidance strategy. Finally, three simulations under different conditions demonstrate the effectiveness and superiority of this guidance law in solving cooperative guidance tasks with high accuracy. The proposed co-evolutionary approach has great prospects not only in cooperative guidance, but also in other application scenarios of multi-objective optimization, dynamic optimization and distributed control. △ Less

Submitted 14 April, 2023; v1 submitted 15 August, 2022; originally announced August 2022.

ACM Class: F.2.2; J.2

arXiv:2208.04518 [pdf]

doi 10.1145/3555585

A Mixed-Methods Analysis of the Algorithm-Mediated Labor of Online Food Deliverers in China

Authors: Zhilong Chen, Xiaochong Lan, Jinghua Piao, Yunke Zhang, Yong Li

Abstract: In recent years, China has witnessed the proliferation and success of the online food delivery industry, an emerging type of the gig economy. Online food deliverers who deliver the food from restaurants to customers play a critical role in enabling this industry. Mediated by algorithms and coupled with interactions with multiple stakeholders, this emerging kind of labor has been taken by millions… ▽ More In recent years, China has witnessed the proliferation and success of the online food delivery industry, an emerging type of the gig economy. Online food deliverers who deliver the food from restaurants to customers play a critical role in enabling this industry. Mediated by algorithms and coupled with interactions with multiple stakeholders, this emerging kind of labor has been taken by millions of people. In this paper, we present a mixed-methods analysis to investigate this labor of online food deliverers and uncover how the mediation of algorithms shapes it. Combining large-scale quantitative data-driven investigations of 100,000 deliverers' behavioral data with in-depth qualitative interviews with 15 online food deliverers, we demonstrate their working activities, identify how algorithms mediate their delivery procedures, and reveal how they perceive their relationships with different stakeholders as a result of their algorithm-mediated labor. Our findings provide important implications for enabling better experiences and more humanized labor of deliverers as well as workers in gig economies of similar kinds. △ Less

Submitted 26 August, 2022; v1 submitted 8 August, 2022; originally announced August 2022.

Comments: Accepted to CSCW 2022

arXiv:2208.04122 [pdf]

doi 10.1145/3555646

Practitioners Versus Users: A Value-Sensitive Evaluation of Current Industrial Recommender System Design

Authors: Zhilong Chen, Jinghua Piao, Xiaochong Lan, Hancheng Cao, Chen Gao, Zhicong Lu, Yong Li

Abstract: Recommender systems are playing an increasingly important role in alleviating information overload and supporting users' various needs, e.g., consumption, socialization, and entertainment. However, limited research focuses on how values should be extensively considered in industrial deployments of recommender systems, the ignorance of which can be problematic. To fill this gap, in this paper, we a… ▽ More Recommender systems are playing an increasingly important role in alleviating information overload and supporting users' various needs, e.g., consumption, socialization, and entertainment. However, limited research focuses on how values should be extensively considered in industrial deployments of recommender systems, the ignorance of which can be problematic. To fill this gap, in this paper, we adopt Value Sensitive Design to comprehensively explore how practitioners and users recognize different values of current industrial recommender systems. Based on conceptual and empirical investigations, we focus on five values: recommendation quality, privacy, transparency, fairness, and trustworthiness. We further conduct in-depth qualitative interviews with 20 users and 10 practitioners to delve into their opinions about these values. Our results reveal the existence and sources of tensions between practitioners and users in terms of value interpretation, evaluation, and practice, which provide novel implications for designing more human-centric and value-sensitive recommender systems. △ Less

Submitted 26 August, 2022; v1 submitted 8 August, 2022; originally announced August 2022.

Comments: Zhilong Chen and Jinghua Piao contribute equally to this work; Accepted to CSCW 2022

arXiv:2208.03668 [pdf, ps, other]

doi 10.1051/0004-6361/202243805

Interpreting time-integrated polarization data of gamma-ray burst prompt emission

Authors: R. Y. Guan, M. X. Lan

Abstract: Aims. With the accumulation of polarization data in the gamma-ray burst (GRB) prompt phase, polarization models can be tested. Methods. We predicted the time-integrated polarizations of 37 GRBs with polarization observation. We used their observed spectral parameters to do this. In the model, the emission mechanism is synchrotron radiation, and the magnetic field configuration in the emission regi… ▽ More Aims. With the accumulation of polarization data in the gamma-ray burst (GRB) prompt phase, polarization models can be tested. Methods. We predicted the time-integrated polarizations of 37 GRBs with polarization observation. We used their observed spectral parameters to do this. In the model, the emission mechanism is synchrotron radiation, and the magnetic field configuration in the emission region was assumed to be large-scale ordered. Therefore, the predicted polarization degrees (PDs) are upper limits. Results. For most GRBs detected by the Gamma-ray Burst Polarimeter (GAP), POLAR, and AstroSat, the predicted PD can match the corresponding observed PD. Hence the synchrotron-emission model in a large-scale ordered magnetic field can interpret both the moderately low PDs ($\sim10\%$) detected by POLAR and relatively high PDs ($\sim45\%$) observed by GAP and AstroSat well. Therefore, the magnetic fields in these GRB prompt phases or at least during the peak times are dominated by the ordered component. However, the predicted PDs of GRB 110721A observed by GAP and GRB 180427A observed by AstroSat are both lower than the observed values. Because the synchrotron emission in an ordered magnetic field predicts the upper-limit of the PD for the synchrotron-emission models, PD observations of the two bursts challenge the synchrotron-emission model. Then we predict the PDs of the High-energy Polarimetry Detector (HPD) and Low-energy Polarimetry Detector (LPD) on board the upcoming POLAR-2. In the synchrotron-emission models, the concentrated PD values of the GRBs detected by HPD will be higher than the LPD, which might be different from the predictions of the dissipative photosphere model. Therefore, more accurate multiband polarization observations are highly desired to test models of the GRB prompt phase. △ Less

Submitted 7 October, 2022; v1 submitted 7 August, 2022; originally announced August 2022.

Comments: 6 pages, 5 figures, with updated AstroSat data, accepted by AA

Journal ref: A&A 670, A160 (2023)

arXiv:2207.08794 [pdf, other]

DeFlowSLAM: Self-Supervised Scene Motion Decomposition for Dynamic Dense SLAM

Authors: Weicai Ye, Xingyuan Yu, Xinyue Lan, Yuhang Ming, Jinyu Li, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Abstract: We present a novel dual-flow representation of scene motion that decomposes the optical flow into a static flow field caused by the camera motion and another dynamic flow field caused by the objects' movements in the scene. Based on this representation, we present a dynamic SLAM, dubbed DeFlowSLAM, that exploits both static and dynamic pixels in the images to solve the camera poses, rather than si… ▽ More We present a novel dual-flow representation of scene motion that decomposes the optical flow into a static flow field caused by the camera motion and another dynamic flow field caused by the objects' movements in the scene. Based on this representation, we present a dynamic SLAM, dubbed DeFlowSLAM, that exploits both static and dynamic pixels in the images to solve the camera poses, rather than simply using static background pixels as other dynamic SLAM systems do. We propose a dynamic update module to train our DeFlowSLAM in a self-supervised manner, where a dense bundle adjustment layer takes in estimated static flow fields and the weights controlled by the dynamic mask and outputs the residual of the optimized static flow fields, camera poses, and inverse depths. The static and dynamic flow fields are estimated by warping the current image to the neighboring images, and the optical flow can be obtained by summing the two fields. Extensive experiments demonstrate that DeFlowSLAM generalizes well to both static and dynamic scenes as it exhibits comparable performance to the state-of-the-art DROID-SLAM in static and less dynamic scenes while significantly outperforming DROID-SLAM in highly dynamic environments. The code and pre-trained model will be available on the project webpage: \urlstyle{tt} \textcolor{url_color}{\url{https://zju3dv.github.io/deflowslam/}}. △ Less

Submitted 13 January, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

Comments: Homepage: https://zju3dv.github.io/deflowslam

arXiv:2207.02705 [pdf, other]

Incentivizing Proof-of-Stake Blockchain for Secured Data Collection in UAV-Assisted IoT: A Multi-Agent Reinforcement Learning Approach

Authors: Xiao Tang, Xunqiang Lan, Lixin Li, Yan Zhang, Zhu Han

Abstract: The Internet of Things (IoT) can be conveniently deployed while empowering various applications, where the IoT nodes can form clusters to finish certain missions collectively. In this paper, we propose to employ unmanned aerial vehicles (UAVs) to assist the clustered IoT data collection with blockchain-based security provisioning. In particular, the UAVs generate candidate blocks based on the coll… ▽ More The Internet of Things (IoT) can be conveniently deployed while empowering various applications, where the IoT nodes can form clusters to finish certain missions collectively. In this paper, we propose to employ unmanned aerial vehicles (UAVs) to assist the clustered IoT data collection with blockchain-based security provisioning. In particular, the UAVs generate candidate blocks based on the collected data, which are then audited through a lightweight proof-of-stake consensus mechanism within the UAV-based blockchain network. To motivate efficient blockchain while reducing the operational cost, a stake pool is constructed at the active UAV while encouraging stake investment from other UAVs with profit sharing. The problem is formulated to maximize the overall profit through the blockchain system in unit time by jointly investigating the IoT transmission, incentives through investment and profit sharing, and UAV deployment strategies. Then, the problem is solved in a distributed manner while being decoupled into two layers. The inner layer incorporates IoT transmission and incentive design, which are tackled with large-system approximation and one-leader-multi-follower Stackelberg game analysis, respectively. The outer layer for UAV deployment is undertaken with a multi-agent deep deterministic policy gradient approach. Results show the convergence of the proposed learning process and the UAV deployment, and also demonstrated is the performance superiority of our proposal as compared with the baselines. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: 14 pages, 10 figures, submitted to IEEE Journal

arXiv:2207.01610 [pdf, other]

PVO: Panoptic Visual Odometry

Authors: Weicai Ye, Xinyue Lan, Shuo Chen, Yuhang Ming, Xingyuan Yu, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Abstract: We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of… ▽ More We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of image panoptic segmentation. This Panoptic-Enhanced VO Module can alleviate the impact of dynamic objects in the camera pose estimation with a panoptic-aware dynamic mask. On the other hand, the VO-Enhanced VPS Module also improves the segmentation accuracy by fusing the panoptic segmentation result of the current frame on the fly to the adjacent frames, using geometric information such as camera pose, depth, and optical flow obtained from the VO Module. These two modules contribute to each other through recurrent iterative optimization. Extensive experiments demonstrate that PVO outperforms state-of-the-art methods in both visual odometry and video panoptic segmentation tasks. △ Less

Submitted 26 March, 2023; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: CVPR2023 Project page: https://zju3dv.github.io/pvo/ code: https://github.com/zju3dv/PVO

arXiv:2203.05243 [pdf, other]

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Authors: Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma, Wenwu Zhu

Abstract: Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve SOTA performance. In this paper, we take a closer… ▽ More Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve SOTA performance. In this paper, we take a closer look at existing evaluation protocols, and find both the prevailing dataset and evaluation metrics are the devils that lead to untrustworthy benchmarking. Therefore, we propose to re-organize the two widely-used datasets, making the ground-truth moment distributions different in the training and test splits, i.e., out-of-distribution (OOD) test. Meanwhile, we introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets. New benchmarking results indicate that our proposed evaluation protocols can better monitor the research progress. Furthermore, we propose a novel causality-based Multi-branch Deconfounding Debiasing (MDD) framework for unbiased moment prediction. Specifically, we design a multi-branch deconfounder to eliminate the effects caused by multiple confounders with causal intervention. In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. Specifically, for textual information, the query is parsed into several verb-centered phrases to obtain a more fine-grained textual feature. For visual information, the positional information has been decomposed from moment features to enhance representations of moments with diverse locations. Extensive experiments demonstrate that our proposed approach can achieve competitive results among existing SOTA approaches and outperform the base model with great gains. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2203.01217 [pdf, other]

Hybrid Tracker with Pixel and Instance for Video Panoptic Segmentation

Authors: Weicai Ye, Xinyue Lan, Ge Su, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Abstract: Video Panoptic Segmentation (VPS) aims to generate coherent panoptic segmentation and track the identities of all pixels across video frames. Existing methods predominantly utilize the trained instance embedding to keep the consistency of panoptic segmentation. However, they inevitably struggle to cope with the challenges of small objects, similar appearance but inconsistent identities, occlusion,… ▽ More Video Panoptic Segmentation (VPS) aims to generate coherent panoptic segmentation and track the identities of all pixels across video frames. Existing methods predominantly utilize the trained instance embedding to keep the consistency of panoptic segmentation. However, they inevitably struggle to cope with the challenges of small objects, similar appearance but inconsistent identities, occlusion, and strong instance contour deformations. To address these problems, we present HybridTracker, a lightweight and joint tracking model attempting to eliminate the limitations of the single tracker. HybridTracker performs pixel tracker and instance tracker in parallel to obtain the association matrices, which are fused into a matching matrix. In the instance tracker, we design a differentiable matching layer, ensuring the stability of inter-frame matching. In the pixel tracker, we compute the dice coefficient of the same instance of different frames given the estimated optical flow, forming the Intersection Over Union (IoU) matrix. We additionally propose mutual check and temporal consistency constraints during inference to settle the occlusion and contour deformation challenges. Comprehensive experiments show that HybridTracker achieves superior performance than state-of-the-art methods on Cityscapes-VPS and VIPER datasets. △ Less

Submitted 11 December, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2203.00865 [pdf]

Beyond Virtual Bazaar: How Social Commerce Promotes Inclusivity for the Traditionally Underserved Community in Chinese Developing Regions

Authors: Zhilong Chen, Hancheng Cao, Xiaochong Lan, Zhicong Lu, Yong Li

Abstract: The disadvantaged population is often underserved and marginalized in technology engagement: prior works show they are generally more reluctant and experience more barriers in adopting and engaging with mainstream technology. Here, we contribute to the HCI4D and ICTD literature through a novel "counter" case study on Chinese social commerce (e.g., Pinduoduo), which 1) first prospers among the trad… ▽ More The disadvantaged population is often underserved and marginalized in technology engagement: prior works show they are generally more reluctant and experience more barriers in adopting and engaging with mainstream technology. Here, we contribute to the HCI4D and ICTD literature through a novel "counter" case study on Chinese social commerce (e.g., Pinduoduo), which 1) first prospers among the traditionally underserved community from developing regions ahead of the more technologically advantaged communities, and 2) has been heavily engaged by this community. Through 12 in-depth interviews with social commerce users from the traditionally underserved community in Chinese developing regions, we demonstrate how social commerce, acting as a "counter", brings online the traditional offline socioeconomic lives the community has lived for ages, fits into the community's social, cultural, and economic context, and thus effectively promotes technology inclusivity. Our work provides novel insights and implications for building inclusive technology for the "next billion" population. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: Zhilong Chen and Hancheng Cao contribute equally to this work; Accepted to CHI 2022

arXiv:2202.03631 [pdf, ps, other]

Robotic Grasping from Classical to Modern: A Survey

Authors: Hanbo Zhang, Jian Tang, Shiguang Sun, Xuguang Lan

Abstract: Robotic Grasping has always been an active topic in robotics since grasping is one of the fundamental but most challenging skills of robots. It demands the coordination of robotic perception, planning, and control for robustness and intelligence. However, current solutions are still far behind humans, especially when confronting unstructured scenarios. In this paper, we survey the advances of robo… ▽ More Robotic Grasping has always been an active topic in robotics since grasping is one of the fundamental but most challenging skills of robots. It demands the coordination of robotic perception, planning, and control for robustness and intelligence. However, current solutions are still far behind humans, especially when confronting unstructured scenarios. In this paper, we survey the advances of robotic grasping, starting from the classical formulations and solutions to the modern ones. By reviewing the history of robotic grasping, we want to provide a complete view of this community, and perhaps inspire the combination and fusion of different ideas, which we think would be helpful to touch and explore the essence of robotic grasping problems. In detail, we firstly give an overview of the analytic methods for robotic grasping. After that, we provide a discussion on the recent state-of-the-art data-driven grasping approaches rising in recent years. With the development of computer vision, semantic grasping is being widely investigated and can be the basis of intelligent manipulation and skill learning for autonomous robotic systems in the future. Therefore, in our survey, we also briefly review the recent progress in this topic. Finally, we discuss the open problems and the future research directions that may be important for the human-level robustness, autonomy, and intelligence of robots. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2112.05101 [pdf]

The In-Flight Realtime Trigger and Localization Software of GECAM

Authors: Xiao-Yun Zhao, Shao-Lin Xiong, Xiang-Yang Wen, Xin-Qiao Li, Ce Cai, Shuo Xiao, Qi Luo, Wen-Xi Peng, Dong-Ya Guo, Zheng-Hua An, Ke Gong, Jin-Yuan Liao, Yan-Qiu Zhang, Yue Huang, Lu Li, Xing Wen, Fei Zhang, Jing Duan, Chen-Wei Wang, Dong-Li Shi, Peng Zhang, Qi-Bin Yi, Chao-Yang Li, Yan-Bing Xu, Xiao-Hua Liang , et al. (64 additional authors not shown)

Abstract: Realtime trigger and localization of bursts are the key functions of GECAM, which is an all-sky gamma-ray monitor launched in Dec 10, 2020. We developed a multifunctional trigger and localization software operating on the CPU of the GECAM electronic box (EBOX). This onboard software has the following features: high trigger efficiency for real celestial bursts with a suppression of false triggers c… ▽ More Realtime trigger and localization of bursts are the key functions of GECAM, which is an all-sky gamma-ray monitor launched in Dec 10, 2020. We developed a multifunctional trigger and localization software operating on the CPU of the GECAM electronic box (EBOX). This onboard software has the following features: high trigger efficiency for real celestial bursts with a suppression of false triggers caused by charged particle bursts and background fluctuation, dedicated localization algorithm optimized for short and long bursts respetively, short time latency of the trigger information which is downlinked throught the BeiDou satellite navigation System (BDS). This paper presents the detailed design and deveopment of this trigger and localization software system of GECAM, including the main functions, general design, workflow and algorithms, as well as the verification and demonstration of this software, including the on-ground trigger tests with simulated gamma-ray bursts made by a dedicated X-ray tube and the in-flight performance to real gamma-ray bursts and magnetar bursts. △ Less

Submitted 9 December, 2021; originally announced December 2021.

Comments: Draft, comments welcome

arXiv:2112.04454 [pdf, other]

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Authors: Lipeng Wan, Zeyang Liu, Xingyu Chen, Han Wang, Xuguang Lan

Abstract: Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive th… ▽ More Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration. △ Less

Submitted 3 July, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Showing 1–50 of 127 results for author: Lan, X