subscribe to arXiv mailings

Mitigating Interpretation Bias in Rock Records with Large Language Models: Insights from Paleoenvironmental Analysis

Authors: Luoqi Wang, Haipeng Li, Linshu Hu, Jiarui Cai, Zhenhong Du

Abstract: The reconstruction of Earth's history faces significant challenges due to the nonunique interpretations often derived from rock records. The problem has long been recognized but there are no systematic solutions in practice. This study introduces an innovative approach that leverages Large Language Models (LLMs) along with retrieval augmented generation and real-time search capabilities to counter… ▽ More The reconstruction of Earth's history faces significant challenges due to the nonunique interpretations often derived from rock records. The problem has long been recognized but there are no systematic solutions in practice. This study introduces an innovative approach that leverages Large Language Models (LLMs) along with retrieval augmented generation and real-time search capabilities to counteract interpretation biases, thereby enhancing the accuracy and reliability of geological analyses. By applying this framework to sedimentology and paleogeography, we demonstrate its effectiveness in mitigating interpretations biases through the generation and evaluation of multiple hypotheses for the same data, which can effectively reduce human bias. Our research illuminates the transformative potential of LLMs in refining paleoenvironmental studies and extends their applicability across various sub-disciplines of Earth sciences, enabling a deeper and more accurate depiction of Earth's evolution. △ Less

Submitted 17 May, 2024; originally announced July 2024.

arXiv:2407.08366 [pdf, other]

An Economic Framework for 6-DoF Grasp Detection

Authors: Xiao-Ming Wu, Jia-Feng Cai, Jian-Jian Jiang, Dian Zheng, Yi-Lin Wei, Wei-Shi Zheng

Abstract: Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck of current SOTA methods that severely encumbers the entire training overload… ▽ More Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck of current SOTA methods that severely encumbers the entire training overload, meanwhile making the training difficult to converge. To solve the above problem, we first propose an economic supervision paradigm for efficient and effective grasping. This paradigm includes a well-designed supervision selection strategy, selecting key labels basically without ambiguity, and an economic pipeline to enable the training after selection. Furthermore, benefit from the economic supervision, we can focus on a specific grasp, and thus we devise a focal representation module, which comprises an interactive grasp head and a composite score estimation to generate the specific grasp more accurately. Combining all together, the EconomicGrasp framework is proposed. Our extensive experiments show that EconomicGrasp surpasses the SOTA grasp method by about 3AP on average, and with extremely low resource cost, for about 1/4 training time cost, 1/8 memory cost and 1/30 storage cost. Our code is available at https://github.com/iSEE-Laboratory/EconomicGrasp. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 19 pages, 7 figures. Accepted in ECCV 2024!

arXiv:2407.06662 [pdf, other]

Experimental Demonstration of 16D Voronoi Constellation with Two-Level Coding over 50km Four-Core Fiber

Authors: Can Zhao, Bin Chen, Jiaqi Cai, Zhiwei Liang, Yi Lei, Junjie Xiong, Lin Ma, Daohui Hu, Lin Sun, Gangxiang Shen

Abstract: A 16-dimensional Voronoi constellation concatenated with multilevel coding is experimentally demonstrated over a 50km four-core fiber transmission system. The proposed scheme reduces the required launch power by 6dB and provides a 17dB larger operating range than 16QAM with BICM at the outer HD-FEC BER threshold. A 16-dimensional Voronoi constellation concatenated with multilevel coding is experimentally demonstrated over a 50km four-core fiber transmission system. The proposed scheme reduces the required launch power by 6dB and provides a 17dB larger operating range than 16QAM with BICM at the outer HD-FEC BER threshold. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 4 pages, 4 figures, accepted by 2024 European Conference on Optical Communication (ECOC)

arXiv:2407.06612 [pdf]

AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Authors: Rui Jin, Derun Li, Dehui Xiang, Lei Zhang, Hailing Zhou, Fei Shi, Weifang Zhu, Jing Cai, Tao Peng, Xinjian Chen

Abstract: Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The ad… ▽ More Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The advent of precision medicine and a significant increase in clinical capacity have spurred the need for various data-driven tasks in the field of medical imaging. Recently, numerous machine learning and data mining tools have been integrated into various medical areas, including image segmentation. This article proposes a new classification method that differentiates supervision types, either in number or kind, during the training phase. Subsequently, we conducted a survey on artificial intelligence (AI)-based automatic prostate segmentation methods, examining the advantages and limitations of each. Additionally, we introduce variants of evaluation metrics for the verification and performance assessment of the segmentation method and summarize the current challenges. Finally, future research directions and development trends are discussed, reflecting the outcomes of our literature survey, suggesting high-precision detection and treatment of prostate cancer as a promising avenue. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.04938 [pdf, other]

SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation

Authors: Guoan Wang, Jin Ye, Junlong Cheng, Tianbin Li, Zhaolin Chen, Jianfei Cai, Junjun He, Bohan Zhuang

Abstract: Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervi… ▽ More Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervised Finetuning (SFT) serves as an effective way to adapt such foundation models for task-specific downstream tasks but at the cost of degrading the general knowledge previously stored in the original foundation model.To address this, we propose SAM-Med3D-MoE, a novel framework that seamlessly integrates task-specific finetuned models with the foundational model, creating a unified model at minimal additional training expense for an extra gating network. This gating network, in conjunction with a selection strategy, allows the unified model to achieve comparable performance of the original models in their respective tasks both general and specialized without updating any parameters of them.Our comprehensive experiments demonstrate the efficacy of SAM-Med3D-MoE, with an average Dice performance increase from 53 to 56.4 on 15 specific classes. It especially gets remarkable gains of 29.6, 8.5, 11.2 on the spinal cord, esophagus, and right hip, respectively. Additionally, it achieves 48.9 Dice on the challenging SPPIN2023 Challenge, significantly surpassing the general expert's performance of 32.3. We anticipate that SAM-Med3D-MoE can serve as a new framework for adapting the foundation model to specific areas in medical image analysis. Codes and datasets will be publicly available. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Journal ref: MICCAI 2024

arXiv:2407.02203 [pdf, other]

Automatic Adaptation Rule Optimization via Large Language Models

Authors: Yusei Ishimizu, Jialong Li, Jinglue Xu, Jinyu Cai, Hitoshi Iba, Kenji Tei

Abstract: Rule-based adaptation is a foundational approach to self-adaptation, characterized by its human readability and rapid response. However, building high-performance and robust adaptation rules is often a challenge because it essentially involves searching the optimal design in a complex (variables) space. In response, this paper attempt to employ large language models (LLMs) as a optimizer to constr… ▽ More Rule-based adaptation is a foundational approach to self-adaptation, characterized by its human readability and rapid response. However, building high-performance and robust adaptation rules is often a challenge because it essentially involves searching the optimal design in a complex (variables) space. In response, this paper attempt to employ large language models (LLMs) as a optimizer to construct and optimize adaptation rules, leveraging the common sense and reasoning capabilities inherent in LLMs. Preliminary experiments conducted in SWIM have validated the effectiveness and limitation of our method. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01469 [pdf, other]

Unrolling Plug-and-Play Gradient Graph Laplacian Regularizer for Image Restoration

Authors: Jianghe Cai, Gene Cheung, Fei Chen

Abstract: Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune a large parameter set, and are fragile during covariance shift. To address these shortcomings, for a general linear image formation model, we first formulate a convex optimization problem with a new graph smoothness prior called gra… ▽ More Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune a large parameter set, and are fragile during covariance shift. To address these shortcomings, for a general linear image formation model, we first formulate a convex optimization problem with a new graph smoothness prior called gradient graph Laplacian regularizer (GGLR) that promotes piecewise planar (PWP) signal reconstruction. To solve the posed problem, we introduce a variable number of auxiliary variables to create a family of Plug-and-Play (PnP) ADMM algorithms and unroll them into variable-complexity feed-forward networks, amenable to parameter tuning via back-propagation. More complex unrolled networks require more labeled data to train more parameters, but have better potential performance. Experimental results show that our unrolled networks perform competitively to generic DL networks in image restoration quality while using a small fraction of parameters, and demonstrate improved robustness to covariance shift. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00908 [pdf, other]

Fine-grained, Multi-dimensional Summarization Evaluation with LLMs

Authors: Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour

Abstract: Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can onl… ▽ More Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24. △ Less

Submitted 9 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

Comments: Accepted at ACL 2024 (main, long)

arXiv:2406.19435 [pdf, other]

A Sanity Check for AI-generated Image Detection

Authors: Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, Weidi Xie

Abstract: With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on "whether the task of AI-generated image detection has been solved". To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify th… ▽ More With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on "whether the task of AI-generated image detection has been solved". To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify the generalization of existing methods, we evaluate 9 off-the-shelf AI-generated image detectors on Chameleon dataset. Upon analysis, almost all models classify AI-generated images as real ones. Later, we propose AIDE (AI-generated Image DEtector with Hybrid Features), which leverages multiple experts to simultaneously extract visual artifacts and noise patterns. Specifically, to capture the high-level semantics, we utilize CLIP to compute the visual embedding. This effectively enables the model to discern AI-generated images based on semantics or contextual information; Secondly, we select the highest frequency patches and the lowest frequency patches in the image, and compute the low-level patchwise features, aiming to detect AI-generated images by low-level artifacts, for example, noise pattern, anti-aliasing, etc. While evaluating on existing benchmarks, for example, AIGCDetectBenchmark and GenImage, AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods, and on our proposed challenging Chameleon benchmarks, it also achieves the promising results, despite this problem for detecting AI-generated images is far from being solved. The dataset, codes, and pre-train models will be published at https://github.com/shilinyan99/AIDE. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Project page: https://shilinyan99.github.io/AIDE Code: https://github.com/shilinyan99/AIDE

arXiv:2406.14927 [pdf, other]

Gaussian-Informed Continuum for Physical Property Identification and Simulation

Authors: Junhao Cai, Yuji Yang, Weihao Yuan, Yisheng He, Zilong Dong, Liefeng Bo, Hui Cheng, Qifeng Chen

Abstract: This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a… ▽ More This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https://jukgei.github.io/project/gic. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 19 pages, 8 figures

arXiv:2406.12846 [pdf, other]

DrVideo: Document Retrieval Based Long Video Understanding

Authors: Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai

Abstract: Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long… ▽ More Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes). △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 11 pages

arXiv:2406.09680 [pdf, other]

Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks

Authors: Yingchao Yu, Yuping Yan, Jisong Cai, Yaochu Jin

Abstract: Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) a… ▽ More Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) and biologically more plausible spiking neural networks (SNNs). This diversity empowers the efficient handling of specific tasks and requirements, showcasing the adaptability and versatility of edge computing platforms. One main challenge of such heterogeneous FL system lies in effectively aggregating models from the local devices in a privacy-preserving manner. To address the above issue, this work benchmarks FL systems containing both convoluntional neural networks (CNNs) and SNNs by comparing various aggregation approaches, including federated CNNs, federated SNNs, federated CNNs for SNNs, federated SNNs for CNNs, and federated CNNs with SNN fusion. Experimental results demonstrate that the CNN-SNN fusion framework exhibits the best performance among the above settings on the MNIST dataset. Additionally, intriguing phenomena of competitive suppression are noted during the convergence process of multi-model FL. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 8 pages, 5 figures, FL@FM-IJCAI'24

arXiv:2406.09591 [pdf]

Ferromagnetism and Topology of the Higher Flat Band in a Fractional Chern Insulator

Authors: Heonjoon Park, Jiaqi Cai, Eric Anderson, Xiao-Wei Zhang, Xiaoyu Liu, William Holtzmann, Weijie Li, Chong Wang, Chaowei Hu, Yuzhou Zhao, Takashi Taniguchi, Kenji Watanabe, Jihui Yang, David Cobden, Jiun-Haw Chu, Nicolas Regnault, B. Andrei Bernevig, Liang Fu, Ting Cao, Di Xiao, Xiaodong Xu

Abstract: The recent observation of the fractional quantum anomalous Hall effect in moiré fractional Chern insulators (FCI) provides opportunities for investigating zero magnetic field anyons. So far, both experimental and theoretical results suggest that filling > 1/3 FCI states in the first Chern band share features with those of the lowest Landau level (LL). To create the possibility of realizing non-Abe… ▽ More The recent observation of the fractional quantum anomalous Hall effect in moiré fractional Chern insulators (FCI) provides opportunities for investigating zero magnetic field anyons. So far, both experimental and theoretical results suggest that filling > 1/3 FCI states in the first Chern band share features with those of the lowest Landau level (LL). To create the possibility of realizing non-Abelian anyons, one route is to engineer higher flat Chern bands that mimic higher LLs. Here, we investigate the interaction, topology, and ferromagnetism of the second moiré miniband in twisted MoTe2 bilayer (tMoTe2). Around filling factor v = -3, i.e., half-filling of the second miniband, we uncover spontaneous ferromagnetism and an incipient Chern insulator state. By measuring the anomalous Hall effect as a function of twist angle, we find that the Chern numbers (C) of the top two moiré flat bands have opposite sign (C = -+1) at twist angles above 3.1° but the same sign (C = -1) around 2.6°. This observation is consistent with the recently predicted twist-angle dependent band topology, resulting from the competition between moiré ferroelectricity and piezoelectricity. As we increase the magnetic field, only the small twist-angle device (2.6°) experiences a topological phase transition with an emergent C = -2 state. This is attributed to a Zeeman field-induced band crossing between opposite valleys, with the determined C = -1 for the top two bands. Our results lay a firm foundation for understanding the higher flat Chern bands, which is essential for the prediction or discovery of non-Abelian FCIs. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 24 pages, 4 figures

arXiv:2406.09041 [pdf, other]

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

Authors: Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang

Abstract: The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expe… ▽ More The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch's promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Tech report

arXiv:2406.08698 [pdf, other]

Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes… ▽ More In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 17 pages, 12 figures, accepted by PRL

arXiv:2406.05641 [pdf, other]

PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction

Authors: Shangyu Chen, Zizheng Pan, Jianfei Cai, Dinh Phung

Abstract: Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is challenging as it typically struggles to make an appropriate trade-off between its training data distribution and the target distribution, i.e., learning a novel concept with only a few target images to achieve personalization (aligning with the personalized target) while preserving text editability (aligning with divers… ▽ More Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is challenging as it typically struggles to make an appropriate trade-off between its training data distribution and the target distribution, i.e., learning a novel concept with only a few target images to achieve personalization (aligning with the personalized target) while preserving text editability (aligning with diverse text prompts). In this paper, we propose PaRa, an effective and efficient Parameter Rank Reduction approach for T2I model personalization by explicitly controlling the rank of the diffusion model parameters to restrict its initial diverse generation space into a small and well-balanced target space. Our design is motivated by the fact that taming a T2I model toward a novel concept such as a specific art style implies a small generation space. To this end, by reducing the rank of model parameters during finetuning, we can effectively constrain the space of the denoising sampling trajectories towards the target. With comprehensive experiments, we show that PaRa achieves great advantages over existing finetuning approaches on single/multi-subject generation as well as single-image editing. Notably, compared to the prevailing fine-tuning technique LoRA, PaRa achieves better parameter efficiency (2x fewer learnable parameters) and much better target image alignment. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05588 [pdf, other]

CERET: Cost-Effective Extrinsic Refinement for Text Generation

Authors: Jason Cai, Hang Su, Monica Sunkara, Igor Shalyminov, Saab Mansour

Abstract: Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by… ▽ More Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by their high computational cost and lack of scalability. In this work, we propose CERET, a method for refining text generations by considering semantic stability, entailment and inter-sample uncertainty measures. Experimental results show that CERET outperforms Self-consistency and Self-rerank baselines consistently under various task setups, by ~1.6% in Rouge-1 for abstractive summarization and ~3.5% in hit rate for question answering. Compared to LLM Self-rerank method, our approach only requires 9.4% of its latency and is more cost-effective. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: The source code and data samples are released at https://github.com/amazon-science/CERET-LLM-refine

arXiv:2406.04101 [pdf, other]

How Far Can We Compress Instant-NGP-Based NeRF?

Authors: Yihang Chen, Qianyi Wu, Mehrtash Harandi, Jianfei Cai

Abstract: In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, whic… ▽ More In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically, we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally, we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge, we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100$\times$ and 70$\times$ with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets, respectively. Additionally, we attain 86.7\% and 82.3\% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: https://github.com/YihangChen-ee/CNC. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Project Page: https://yihangchen-ee.github.io/project_cnc/ Code: https://github.com/yihangchen-ee/cnc/. We further propose a 3DGS compression method HAC, which is based on CNC: https://yihangchen-ee.github.io/project_hac/

Journal ref: CVPR 2024

arXiv:2406.00985 [pdf, other]

MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models

Authors: Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, Siwei Lyu

Abstract: Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially f… ▽ More Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially for multi-aspect edits increases computational demands and efficiency losses. In this paper, we address these challenges with significant contributions. Our main contribution is the development of MultiEdits, a method that seamlessly manages simultaneous edits across multiple attributes. In contrast to previous approaches, MultiEdits not only preserves the quality of single attribute edits but also significantly improves the performance of multitasking edits. This is achieved through an innovative attention distribution mechanism and a multi-branch design that operates across several processing heads. Additionally, we introduce the PIE-Bench++ dataset, an expansion of the original PIE-Bench dataset, to better support evaluating image-editing tasks involving multiple objects and attributes simultaneously. This dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios. Dataset and code are available at https://mingzhenhuang.com/projects/MultiEdits.html. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.19308 [pdf, other]

Visualizing the microscopic origins of topology in twisted molybdenum ditelluride

Authors: Ellis Thompson, Keng Tou Chu, Florie Mesple, Xiao-Wei Zhang, Chaowei Hu, Yuzhou Zhao, Heonjoon Park, Jiaqi Cai, Eric Anderson, Kenji Watanabe, Takashi Taniguchi, Jihui Yang, Jiun-Haw Chu, Xiaodong Xu, Ting Cao, Di Xiao, Matthew Yankowitz

Abstract: In moiré materials with flat electronic bands and suitable quantum geometry, strong correlations can give rise to novel topological states of matter. The nontrivial band topology of twisted molybdenum ditelluride (tMoTe$_2$) -- responsible for its fractional quantum anomalous Hall (FQAH) states -- is predicted to arise from a layer-pseudospin skyrmion lattice. Tracing the layer polarization of wav… ▽ More In moiré materials with flat electronic bands and suitable quantum geometry, strong correlations can give rise to novel topological states of matter. The nontrivial band topology of twisted molybdenum ditelluride (tMoTe$_2$) -- responsible for its fractional quantum anomalous Hall (FQAH) states -- is predicted to arise from a layer-pseudospin skyrmion lattice. Tracing the layer polarization of wavefunctions within the moiré unit cell can thus offer crucial insights into the band topology. Here, we use scanning tunneling microscopy and spectroscopy (STM/S) to probe the layer-pseudospin skyrmion textures of tMoTe$_2$. We do this by simultaneously visualizing the moiré lattice structure and the spatial localization of its electronic states. We find that the wavefunctions associated with the topological flat bands exhibit a spatially-dependent layer polarization within the moiré unit cell. This is in excellent agreement with our theoretical modeling, thereby revealing a direct microscopic connection between the structural properties of tMoTe$_2$ and its band topology. Our work enables new pathways for engineering FQAH states with strain, as well as future STM studies of the intertwined correlated and topological states arising in gate-tunable devices. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 7 pages, 4 figures, Extended Data, 9 figures, Supplementary Information, 8 pages, 5 figures

arXiv:2405.10269 [pdf, other]

Direct magnetic imaging of fractional Chern insulators in twisted MoTe$_2$ with a superconducting sensor

Authors: Evgeny Redekop, Canxun Zhang, Heonjoon Park, Jiaqi Cai, Eric Anderson, Owen Sheekey, Trevor Arp, Grigory Babikyan, Samuel Salters, Kenji Watanabe, Takashi Taniguchi, Xiaodong Xu, Andrea F. Young

Abstract: In the absence of time reversal symmetry, orbital magnetization provides a sensitive probe of topology and interactions, with particularly rich phenomenology in Chern insulators where topological edge states carry large equilibrium currents. Here, we use a nanoscale superconducting sensor to map the magnetic fringe fields in twisted bilayers of MoTe$_2$, where transport and optical sensing experim… ▽ More In the absence of time reversal symmetry, orbital magnetization provides a sensitive probe of topology and interactions, with particularly rich phenomenology in Chern insulators where topological edge states carry large equilibrium currents. Here, we use a nanoscale superconducting sensor to map the magnetic fringe fields in twisted bilayers of MoTe$_2$, where transport and optical sensing experiments have revealed the formation of fractional Chern insulator (FCI) states at zero magnetic field. At a temperature of 1.6K, we observe oscillations in the local magnetic field associated with fillings $ν=-1,-2/3,-3/5,-4/7$ and $-5/9$ of the first moiré hole band, consistent with the formation of FCIs at these fillings. By quantitatively reconstructing the magnetization, we determine the local thermodynamic gaps of the most robust FCI state at $ν=-2/3$, finding $^{-2/3}Δ$ as large as 7 meV. Spatial mapping of the charge density- and displacement field-tuned magnetic phase diagram further allows us to characterize sample disorder, which we find to be dominated by both inhomogeneity in the effective unit cell area as well as inhomogeneity in the band edge offset and bound dipole moment. Our results highlight both the challenges posed by structural disorder in the study of twisted homobilayer moiré systems and the opportunities afforded by the remarkably robust nature of the underlying correlated topological states. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09463 [pdf, other]

Gaze-DETR: Using Expert Gaze to Reduce False Positives in Vulvovaginal Candidiasis Screening

Authors: Yan Kong, Sheng Wang, Jiangdong Cai, Zihao Zhao, Zhenrong Shen, Yonghao Li, Manman Fei, Qian Wang

Abstract: Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positi… ▽ More Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positives of neural networks. Leveraging this finding, we introduce Gaze-DETR, a pioneering method that integrates gaze data to enhance neural network precision by diminishing false positives. Gaze-DETR incorporates a universal gaze-guided warm-up protocol applicable across various detection methods and a gaze-guided rectification strategy specifically designed for DETR-based models. Our comprehensive tests confirm that Gaze-DETR surpasses existing leading methods, showcasing remarkable improvements in detection accuracy and generalizability. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: MICCAI-2024 early accept. Our code is available at https://github.com/YanKong0408/Gaze-DETR

arXiv:2405.09153 [pdf, other]

Adapting Abstract Meaning Representation Parsing to the Clinical Narrative -- the SPRING THYME parser

Authors: Jon Z. Cai, Kristin Wright-Bettner, Martha Palmer, Guergana K. Savova, James H. Martin

Abstract: This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME)… ▽ More This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME) corpus, we adapted a state-of-the-art AMR parser utilizing continuous training. Our approach incorporates data augmentation techniques to enhance the accuracy of AMR structure predictions. Notably, through this learning strategy, our parser achieved an impressive F1 score of 88% on the THYME corpus's colon cancer dataset. Moreover, our research delved into the efficacy of data required for domain adaptation within the realm of clinical notes, presenting domain adaptation data requirements for AMR parsing. This exploration not only underscores the parser's robust performance but also highlights its potential in facilitating a deeper understanding of clinical narratives through structured semantic representations. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: Accepted to the 6th Clinical NLP Workshop at NAACL, 2024

arXiv:2405.07691 [pdf, other]

Discovery of Very-high-energy Gamma-ray Emissions from the Low Luminosity AGN NGC 4278 by LHAASO

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: The first source catalog of Large High Altitude Air Shower Observatory reported the detection of a very-high-energy gamma ray source, 1LHAASO J1219+2915. In this paper a further detailed study of the spectral and temporal behavior of this point-like source have been carried. The best-fit position of the TeV source ($\rm{RA}=185.05^{\circ}\pm0.04^{\circ}$, $\rm{Dec}=29.25^{\circ}\pm0.03^{\circ}$) i… ▽ More The first source catalog of Large High Altitude Air Shower Observatory reported the detection of a very-high-energy gamma ray source, 1LHAASO J1219+2915. In this paper a further detailed study of the spectral and temporal behavior of this point-like source have been carried. The best-fit position of the TeV source ($\rm{RA}=185.05^{\circ}\pm0.04^{\circ}$, $\rm{Dec}=29.25^{\circ}\pm0.03^{\circ}$) is compatible with NGC 4278 within $\sim0.03$ degree. Variation analysis shows an indication of the variability at a few months level in the TeV band, which is consistent with low frequency observations. Based on these observations, we report the detection of TeV $γ$-ray emissions from this low-luminosity AGN NGC 4278. The observations by LHAASO-WCDA during active period has a significance level of 8.8\,$σ$ with best-fit photon spectral index $\varGamma=2.56\pm0.14$ and a flux $f_{1-10\,\rm{TeV}}=(7.0\pm1.1_{\rm{sta}}\pm0.35_{\rm{syst}})\times10^{-13}\,\rm{photons\,cm^{-2}\,s^{-1}}$, or approximately $5\%$ of the Crab Nebula. The discovery of VHE from NGC 4278 indicates that the compact, weak radio jet can efficiently accelerate particles and emit TeV photons. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 11 pages, 5 figures

arXiv:2405.04434 [pdf, other]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. △ Less

Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.03806 [pdf, other]

In Situ AI Prototyping: Infusing Multimodal Prompts into Mobile Settings with MobileMaker

Authors: Savvas Petridis, Michael Xieyang Liu, Alexander J. Fiannaca, Vivian Tsai, Michael Terry, Carrie J. Cai

Abstract: Recent advances in multimodal large language models (LLMs) have lowered the barriers to rapidly prototyping AI-powered features via prompting, especially for mobile-intended use cases. Despite the value of situated user feedback, the process of soliciting early, mobile-situated user feedback on AI prototypes remains challenging. The broad scope and flexibility of LLMs means that, for a given use-c… ▽ More Recent advances in multimodal large language models (LLMs) have lowered the barriers to rapidly prototyping AI-powered features via prompting, especially for mobile-intended use cases. Despite the value of situated user feedback, the process of soliciting early, mobile-situated user feedback on AI prototypes remains challenging. The broad scope and flexibility of LLMs means that, for a given use-case-specific prototype, there is a crucial need to understand the wide range of in-the-wild input likely to be provided by the user, as well as their in-context expectations of the AI's behavior. To explore the concept of in situ AI prototyping and testing, we created MobileMaker: an AI prototyping tool that enables designers to rapidly create mobile AI prototypes that can be tested on-device, and enables testers to make on-device, in-the-field revisions of the prototype through natural language. In an exploratory study with 16 users, we explored how user feedback on prototypes created with MobileMaker compares to that of existing prototyping tools (e.g., Figma, prompt editors). We found that MobileMaker prototypes enabled more serendipitous discovery of: model input edge cases, discrepancies between AI's and user's in-context interpretation of the task, and contextual signals missed by the AI. Furthermore, we learned that while the ability to make in-the-wild revisions led users to feel more fulfilled as active participants in the design process, it might also constrain their feedback to the subset of changes perceived as more actionable or implementable by the prototyping tool. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.03229 [pdf, ps, other]

Spectral conditions for the existence of (doubly) chorded cycles in graphs with fixed size

Authors: Jin Cai, Leyou Xu, Bo Zhou

Abstract: A chorded cycle is a cycle with at least one chord, and a doubly chorded cycle is a cycle with at least two chords. Gould asked in [Graphs Comb. 38 (2022) 189] the question: What spectral conditions imply a graph contains a chorded cycle? For a graph with fixed size, extremal spectral conditions are given to ensure that a graph contains a chorded cycle and a doubly chorded cycle, respectively, via… ▽ More A chorded cycle is a cycle with at least one chord, and a doubly chorded cycle is a cycle with at least two chords. Gould asked in [Graphs Comb. 38 (2022) 189] the question: What spectral conditions imply a graph contains a chorded cycle? For a graph with fixed size, extremal spectral conditions are given to ensure that a graph contains a chorded cycle and a doubly chorded cycle, respectively, via spectral radius. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02876 [pdf, ps, other]

Exploring the Improvement of Evolutionary Computation via Large Language Models

Authors: Jinyu Cai, Jinglue Xu, Jialong Li, Takuto Ymauchi, Hitoshi Iba, Kenji Tei

Abstract: Evolutionary computation (EC), as a powerful optimization algorithm, has been applied across various domains. However, as the complexity of problems increases, the limitations of EC have become more apparent. The advent of large language models (LLMs) has not only transformed natural language processing but also extended their capabilities to diverse fields. By harnessing LLMs' vast knowledge and… ▽ More Evolutionary computation (EC), as a powerful optimization algorithm, has been applied across various domains. However, as the complexity of problems increases, the limitations of EC have become more apparent. The advent of large language models (LLMs) has not only transformed natural language processing but also extended their capabilities to diverse fields. By harnessing LLMs' vast knowledge and adaptive capabilities, we provide a forward-looking overview of potential improvements LLMs can bring to EC, focusing on the algorithms themselves, population design, and additional enhancements. This presents a promising direction for future research at the intersection of LLMs and EC. △ Less

Submitted 23 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

Comments: accepted by GECCO 2024

arXiv:2405.02858 [pdf, ps, other]

Language Evolution for Evading Social Media Regulation via LLM-based Multi-agent Simulation

Authors: Jinyu Cai, Jialong Li, Mingyue Zhang, Munan Li, Chen-Shu Wang, Kenji Tei

Abstract: Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a stra… ▽ More Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation framework using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted by IEEE WCCI 2024

arXiv:2405.01047 [pdf, ps, other]

Optimal Pricing for Linear-Quadratic Games with Nonlinear Interaction Between Agents

Authors: Jiamin Cai, Chenyue Zhang, Hoi-To Wai

Abstract: This paper studies a class of network games with linear-quadratic payoffs and externalities exerted through a strictly concave interaction function. This class of game is motivated by the diminishing marginal effects with peer influences. We analyze the optimal pricing strategy for this class of network game. First, we prove the existence of a unique Nash Equilibrium (NE). Second, we study the opt… ▽ More This paper studies a class of network games with linear-quadratic payoffs and externalities exerted through a strictly concave interaction function. This class of game is motivated by the diminishing marginal effects with peer influences. We analyze the optimal pricing strategy for this class of network game. First, we prove the existence of a unique Nash Equilibrium (NE). Second, we study the optimal pricing strategy of a monopolist selling a divisible good to agents. We show that the optimal pricing strategy, found by solving a bilevel optimization problem, is strictly better when the monopolist knows the network structure as opposed to the best strategy agnostic to network structure. Numerical experiments demonstrate that in most cases, the maximum revenue is achieved with an asymmetric network. These results contrast with the previously studied case of linear interaction function, where a network-independent price is proven optimal with symmetric networks. Lastly, we describe an efficient algorithm to find the optimal pricing strategy. △ Less

Submitted 3 June, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: 7 pages, 2 figures, accepted by IEEE Control Systems Letters

arXiv:2404.18121 [pdf, ps, other]

Research on the Evaluation Index System of Enterprise Production Efficiency

Authors: W. Li, J. Cai, C. Wang, Y. Chen, J. Xu, J. Zhao, Y. Chen

Abstract: This paper focuses on studying the evaluation index system for the production efficiency of tobacco enterprises. Considering the limitations of existing evaluation methods in accurately assessing the production quality of cigarette enterprises, a mathematical model based on the Analytic Hierarchy Process (AHP) is established. This model constructs an evaluation framework for the production efficie… ▽ More This paper focuses on studying the evaluation index system for the production efficiency of tobacco enterprises. Considering the limitations of existing evaluation methods in accurately assessing the production quality of cigarette enterprises, a mathematical model based on the Analytic Hierarchy Process (AHP) is established. This model constructs an evaluation framework for the production efficiency of cigarette enterprises and subsequently analyzes the significance of each index within this framework. To comprehensively analyze the multi-index and feasibility aspects of the selected projects, the AHP method is employed to establish a comprehensive feasibility research and evaluation structure model. The result of this feasibility study provides the conclusion that the construction of an evaluation index system for the production efficiency of cigarette enterprises can indeed promote the enhancement of their production efficiency. △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.18033 [pdf, other]

Exposing Text-Image Inconsistency Using Diffusion Models

Authors: Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu

Abstract: In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more… ▽ More In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.15007 [pdf, ps, other]

Single-Spin Waved-Brim Flat-Top Hat in the Band Edge of GdIH Monolayer

Authors: Ningning Jia, Zhao Yang, Jiangtao Cai, Zhiheng Lv, Yongting Shi, Tielei Song, Xin Cui, Zhifeng Liu

Abstract: Exotic electronic bands, such as flat bands, linear crossing bands, spontaneously valley- or spin-polarized bands, in two-dimensional materials have been the hot topics in condensed matter physics. Herein, we first propose a general dispersion model for possible hat-like electronic bands, and then identify an intriguing single-spin \emph{waved-brim flat-top hat} in the valence band edge of a stabl… ▽ More Exotic electronic bands, such as flat bands, linear crossing bands, spontaneously valley- or spin-polarized bands, in two-dimensional materials have been the hot topics in condensed matter physics. Herein, we first propose a general dispersion model for possible hat-like electronic bands, and then identify an intriguing single-spin \emph{waved-brim flat-top hat} in the valence band edge of a stable ferromagnetic semiconducting electrene (i.e., Janus GdIH monolayer), which can be well described by a simplified two-bands Hamiltonian model. Specifically, the hat-band has a waved brim with six valleys along the boundary of the first Brillouin zone; meanwhile it holds a flat top close to the Fermi level, resulting in the emergence of single-spin van Hove singularities divergence and Lifshitz transitions. Owing to the breaking of both time-reversal and space inversion symmetries, a sizable spontaneous valley polarization is formed between the adjacent brim valleys, which provides the opportunity to realize the high-temperature anomalous valley Hall effect. Particularly, via modest strains and carriers doping, various conductive bipolar-states (spin-up vs. spin-down, K valley vs. $-$K valley, and ultra-low-speed vs. ultra-high-speed) can be modulated out from the distorted waved-brim flat-top hat of GdIH ML. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.14305 [pdf, other]

"I Upload...All Types of Different Things to Say, the World of Blindness Is More Than What They Think It Is": A Study of Blind TikTokers' Identity Work from a Flourishing Perspective

Authors: Yao Lyu, Jie Cai, Bryan Dosono, Davis Yadav, John M. Carroll

Abstract: Identity work in Human-Computer Interaction (HCI) has focused on the marginalized group to explore designs to support their asset (what they have). However, little has been explored specifically on the identity work of people with disabilities, specifically, visual impairments. In this study, we interviewed 45 BlindTokers (blind users on TikTok) from various backgrounds to understand their identit… ▽ More Identity work in Human-Computer Interaction (HCI) has focused on the marginalized group to explore designs to support their asset (what they have). However, little has been explored specifically on the identity work of people with disabilities, specifically, visual impairments. In this study, we interviewed 45 BlindTokers (blind users on TikTok) from various backgrounds to understand their identity work from a positive design perspective. We found that BlindTokers leverage the affordance of the platform to create positive content, share their identities, and build the community with the desire to flourish. We proposed flourishing labor to present the work conducted by BlindTokers for their community's flourishing with implications to support the flourishing labor. This work contributes to understanding blind users' experience in short video platforms and highlights that flourishing is not just an activity for any single Blind user but also a job that needs all stakeholders, including all user groups and the TikTok platform, serious and committed contribution. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: ACM CSCW

arXiv:2404.12759 [pdf, other]

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Authors: Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu

Abstract: Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degr… ▽ More Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: quantization for deep models

arXiv:2404.10973 [pdf, other]

Quantum delocalization on correlation landscape: The key to exponentially fast multipartite entanglement generation

Authors: Yaoming Chu, Xiangbei Li, Jianming Cai

Abstract: Entanglement, a hallmark of quantum mechanics, is a vital resource for quantum technologies. Generating highly entangled multipartite states is a key goal in current quantum experiments. We unveil a novel framework for understanding entanglement generation dynamics in Hamiltonian systems by quantum delocalization of an effective operator wavefunction on a correlation landscape. Our framework estab… ▽ More Entanglement, a hallmark of quantum mechanics, is a vital resource for quantum technologies. Generating highly entangled multipartite states is a key goal in current quantum experiments. We unveil a novel framework for understanding entanglement generation dynamics in Hamiltonian systems by quantum delocalization of an effective operator wavefunction on a correlation landscape. Our framework establishes a profound connection between the exponentially fast generation of multipartite entanglement, witnessed by the quantum Fisher information, and the linearly increasing asymptotics of hopping amplitudes governing the delocalization dynamics in Krylov space. We illustrate this connection using the paradigmatic Lipkin-Meshkov-Glick model and highlight potential signatures in chaotic Feingold-Peres tops. Our results provide a transformative tool for understanding and harnessing rapid entanglement production in complex quantum systems, providing a pathway for quantum enhanced technologies by large-scale entanglement. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.09000 [pdf, other]

MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

Authors: Yingjie Xi, Boyuan Cheng, Jingyao Cai, Jian Jun Zhang, Xiaosong Yang

Abstract: The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it… ▽ More The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it lacks adaptability and safety. In our work, We proposed a new method to directly generate the 2D human whole-body X-rays from the human masking images. The predicted images will be similar to the real ones with the same image style and anatomic structure. We employed a data-driven strategy. By leveraging advanced generative techniques, our model MaSkel(Masking image to Skeleton X-rays) could generate a high-quality X-ray image from a human masking image without the need for invasive and harmful radiation exposure, which not only provides a new path to generate highly anatomic and customized data but also reduces health risks. To our knowledge, our model MaSkel is the first work for predicting whole-body X-rays. In this paper, we did two parts of the work. The first one is to solve the data limitation problem, the diffusion-based techniques are utilized to make a data augmentation, which provides two synthetic datasets for preliminary pretraining. Then we designed a two-stage training strategy to train MaSkel. At last, we make qualitative and quantitative evaluations of the generated X-rays. In addition, we invite some professional doctors to assess our predicted data. These evaluations demonstrate the MaSkel's superior ability to generate anatomic X-rays from human masking images. The related code and links of the dataset are available at https://github.com/2022yingjie/MaSkel. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.08521 [pdf, other]

The magnetism measurements of the two-dimensional van der Waals antiferromagnet CrPS4 using dynamic cantilever magnetometry

Authors: Qi Li, Weili Zhen, Ning Wang, Yang Yu, Senyang Pan, Lin Deng, Jiaqiang Cai, Kang Wang, Lvkuan Zou, Zhongming Zeng, Jinglei Zhang, Haifeng Du

Abstract: The exploration of van der Waals (vdWs) magnetic materials has sparked great interest in spintronics. However, conventional methods often face challenges in characterizing the magnetic properties of small-sized vdWs materials, especially for antiferromagnets with extremely small magnetic moments. Here, we demonstrate the efficacy of dynamic cantilever magnetometry (DCM) in characterizing the magne… ▽ More The exploration of van der Waals (vdWs) magnetic materials has sparked great interest in spintronics. However, conventional methods often face challenges in characterizing the magnetic properties of small-sized vdWs materials, especially for antiferromagnets with extremely small magnetic moments. Here, we demonstrate the efficacy of dynamic cantilever magnetometry (DCM) in characterizing the magnetic properties of vdWs magnets, using an antiferromagnetic semiconductor CrPS4. We observe continuous spin axis rotation under a magnetic field, accurately modelled by considering the existance of marked magnetic anisotropies. Furthermore, the dominance of out-of-plane magnetic anisotropy in spin reorientation behavior at low temperatures transitions to the prevalence of in-plane anisotropy with increasing temperature, leading to a sign reversal of the frequency shift in measurements. The peculiar magnetic phase transitions make CrPS4 an intriguing platform for studying two-dimensional magnetism. Our findings underscore the effectiveness of DCM in characterizing magnetic anisotropies and phase transitions in vdWs magnets. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2404.07949 [pdf, other]

Taming Stable Diffusion for Text to 360° Panorama Image Generation

Authors: Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, Jianfei Cai

Abstract: Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to gen… ▽ More Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: CVPR 2024. Project Page: https://chengzhag.github.io/publication/panfusion Code: https://github.com/chengzhag/PanFusion

arXiv:2404.07362 [pdf, other]

doi 10.1145/3613905.3650756

"We Need Structured Output": Towards User-centered Constraints on Large Language Model Output

Authors: Michael Xieyang Liu, Frederick Liu, Alexander J. Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, Carrie J. Cai

Abstract: Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective… ▽ More Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Journal ref: "We Need Structured Output": Towards User-centered Constraints on LLM Output. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), May 11-16, 2024, Honolulu, HI, USA

arXiv:2404.06395 [pdf, other]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Authors: Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun

Abstract: The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce… ▽ More The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM . △ Less

Submitted 3 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: revise according to peer review

arXiv:2404.05016 [pdf, other]

Hyperbolic Learning with Synthetic Captions for Open-World Detection

Authors: Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

Abstract: Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary description… ▽ More Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.04836 [pdf, ps, other]

Global strong solution to the inviscid liquid-gas two-phase flow model in $L^p$ framework

Authors: Zhigang Wu, Mengqian Liu, Juanzi Cai

Abstract: This paper is dedicated to the study of the inviscid liquid-gas two-phase flow model in $\mathbb{R}^d\ (d\geq1)$. We establish the global existence of strong solutions to this system with small initial data in hybrid Besov spaces based on general $L^p$-norms. Additionally, we obtain the decay estimates of solutions rely on the constructed Lyapunov functional. This paper is dedicated to the study of the inviscid liquid-gas two-phase flow model in $\mathbb{R}^d\ (d\geq1)$. We establish the global existence of strong solutions to this system with small initial data in hybrid Besov spaces based on general $L^p$-norms. Additionally, we obtain the decay estimates of solutions rely on the constructed Lyapunov functional. △ Less

Submitted 7 April, 2024; originally announced April 2024.

MSC Class: 35A09; 35B40; 35Q35

arXiv:2404.04801 [pdf, ps, other]

doi 10.1007/s41605-024-00467-8

LHAASO-KM2A detector simulation using Geant4

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (254 additional authors not shown)

Abstract: KM2A is one of the main sub-arrays of LHAASO, working on gamma ray astronomy and cosmic ray physics at energies above 10 TeV. Detector simulation is the important foundation for estimating detector performance and data analysis. It is a big challenge to simulate the KM2A detector in the framework of Geant4 due to the need to track numerous photons from a large number of detector units (>6000) with… ▽ More KM2A is one of the main sub-arrays of LHAASO, working on gamma ray astronomy and cosmic ray physics at energies above 10 TeV. Detector simulation is the important foundation for estimating detector performance and data analysis. It is a big challenge to simulate the KM2A detector in the framework of Geant4 due to the need to track numerous photons from a large number of detector units (>6000) with large altitude difference (30 m) and huge coverage (1.3 km^2). In this paper, the design of the KM2A simulation code G4KM2A based on Geant4 is introduced. The process of G4KM2A is optimized mainly in memory consumption to avoid memory overffow. Some simpliffcations are used to signiffcantly speed up the execution of G4KM2A. The running time is reduced by at least 30 times compared to full detector simulation. The particle distributions and the core/angle resolution comparison between simulation and experimental data of the full KM2A array are also presented, which show good agreement. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2404.04629 [pdf, other]

DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

Authors: Duy-Tho Le, Hengcan Shi, Jianfei Cai, Hamid Rezatofighi

Abstract: Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the i… ▽ More Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the inherent denoising property of diffusion, DifFUSER is able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output. In terms of architecture, our DifFUSER blocks are chained together in a hierarchical BiFPN fashion, termed cMini-BiFPN, offering an alternative architecture for latent diffusion. We further introduce a Gated Self-conditioned Modulated (GSM) latent diffusion module together with a Progressive Sensor Dropout Training (PSDT) paradigm, designed to add stronger conditioning to the diffusion process and robustness to sensor failures. Our extensive evaluations on the Nuscenes dataset reveal that DifFUSER not only achieves state-of-the-art performance with a 69.1% mIOU in BEV map segmentation tasks but also competes effectively with leading transformer-based fusion techniques in 3D object detection. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: 23 pages

arXiv:2404.01686 [pdf, other]

JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments

Authors: Duy-Tho Le, Chenhui Gou, Stavya Datta, Hengcan Shi, Ian Reid, Jianfei Cai, Hamid Rezatofighi

Abstract: Autonomous robot systems have attracted increasing research attention in recent years, where environment understanding is a crucial step for robot navigation, human-robot interaction, and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks, w… ▽ More Autonomous robot systems have attracted increasing research attention in recent years, where environment understanding is a crucial step for robot navigation, human-robot interaction, and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks, with their reliance on single sensors and limited object classes and scenarios, fail to provide the comprehensive environmental understanding robots need for accurate navigation, interaction, and decision-making. As an extension of JRDB dataset, we unveil JRDB-PanoTrack, a novel open-world panoptic segmentation and tracking benchmark, towards more comprehensive environmental perception. JRDB-PanoTrack includes (1) various data involving indoor and outdoor crowded scenes, as well as comprehensive 2D and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic segmentation and temporal tracking annotations, with additional 3D label projections for further spatial understanding; (3) diverse object classes for closed- and open-world recognition benchmarks, with OSPA-based metrics for evaluation. Extensive evaluation of leading methods shows significant challenges posed by our dataset. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.01078 [pdf, other]

Energy-based Model for Accurate Shapley Value Estimation in Interpretable Deep Learning Predictive Modeling

Authors: Cheng Lu, Jiusun Zeng, Yu Xia, Jinhui Cai, Shihua Luo

Abstract: As a favorable tool for explainable artificial intelligence (XAI), Shapley value has been widely used to interpret deep learning based predictive models. However, accurate and efficient estimation of Shapley value is difficult since the computation load grows exponentially with the increase of input features. Most existing accelerated estimation methods have to compromise on estimation accuracy wi… ▽ More As a favorable tool for explainable artificial intelligence (XAI), Shapley value has been widely used to interpret deep learning based predictive models. However, accurate and efficient estimation of Shapley value is difficult since the computation load grows exponentially with the increase of input features. Most existing accelerated estimation methods have to compromise on estimation accuracy with efficiency. In this article, we present EmSHAP(Energy-based model for Shapley value estimation) to estimate the expectation of Shapley contribution function under arbitrary subset of features given the rest. The energy-based model estimates the conditional density in the Shapley contribution function, which involves an energy network for approximating the unnormalized conditional density and a GRU (Gated Recurrent Unit) network for approximating the partition function. The GRU network maps the input features onto a hidden space to eliminate the impact of input orderings. In order to theoretically evaluate the performance of different Shapley value estimation methods, Theorems 1, 2 and 3 analyzed the error bounds of EmSHAP as well as two state-of-the-art methods, namely KernelSHAP and VAEAC. It is proved that EmSHAP has tighter error bound than KernelSHAP and VAEAC. Finally, case studies on two application examples show the enhanced estimation accuracy of EmSHAP. △ Less

Submitted 5 May, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.00905 [pdf, other]

doi 10.1063/5.0211557

Continuously tunable uniaxial strain control of van der Waals heterostructure devices

Authors: Zhaoyu Liu, Xuetao Ma, John Cenker, Jiaqi Cai, Zaiyao Fei, Paul Malinowski, Joshua Mutch, Yuzhou Zhao, Kyle Hwangbo, Zhong Lin, Arnab Manna, Jihui Yang, David Cobden, Xiaodong Xu, Matthew Yankowitz, Jiun-Haw Chu

Abstract: Uniaxial strain has been widely used as a powerful tool for investigating and controlling the properties of quantum materials. However, existing strain techniques have so far mostly been limited to use with bulk crystals. Although recent progress has been made in extending the application of strain to two-dimensional van der Waals (vdW) heterostructures, these techniques have been limited to optic… ▽ More Uniaxial strain has been widely used as a powerful tool for investigating and controlling the properties of quantum materials. However, existing strain techniques have so far mostly been limited to use with bulk crystals. Although recent progress has been made in extending the application of strain to two-dimensional van der Waals (vdW) heterostructures, these techniques have been limited to optical characterization and extremely simple electrical device geometries. Here, we report a piezoelectric-based \textit{in situ} uniaxial strain technique enabling simultaneous electrical transport and optical spectroscopy characterization of dual-gated vdW heterostructure devices. Critically, our technique remains compatible with vdW heterostructure devices of arbitrary complexity fabricated on conventional silicon/silicon dioxide wafer substrates. We demonstrate a large and continuously tunable strain of up to $-0.15\%$ at millikelvin temperatures, with larger strain values also likely achievable. We quantify the strain transmission from the silicon wafer to the vdW heterostructure, and further demonstrate the ability of strain to modify the electronic properties of twisted bilayer graphene. Our technique provides a highly versatile new method for exploring the effect of uniaxial strain on both the electrical and optical properties of vdW heterostructures, and can be easily extended to include additional characterization techniques. △ Less

Submitted 23 May, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 9 pages, 6 figures, to appear in Journal of Applied Physics

Journal ref: J. Appl. Phys. 135, 204306 (2024)

arXiv:2404.00269 [pdf, other]

IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Authors: Yushuang Wu, Luyue Shi, Junhao Cai, Weihao Yuan, Lingteng Qiu, Zilong Dong, Liefeng Bo, Shuguang Cui, Xiaoguang Han

Abstract: Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes im… ▽ More Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising, allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides, an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning, leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.19902 [pdf, other]

Heterogeneous Network Based Contrastive Learning Method for PolSAR Land Cover Classification

Authors: Jianfeng Cai, Yue Ma, Zhixi Feng, Shuyuan Yang

Abstract: Polarimetric synthetic aperture radar (PolSAR) image interpretation is widely used in various fields. Recently, deep learning has made significant progress in PolSAR image classification. Supervised learning (SL) requires a large amount of labeled PolSAR data with high quality to achieve better performance, however, manually labeled data is insufficient. This causes the SL to fail into overfitting… ▽ More Polarimetric synthetic aperture radar (PolSAR) image interpretation is widely used in various fields. Recently, deep learning has made significant progress in PolSAR image classification. Supervised learning (SL) requires a large amount of labeled PolSAR data with high quality to achieve better performance, however, manually labeled data is insufficient. This causes the SL to fail into overfitting and degrades its generalization performance. Furthermore, the scattering confusion problem is also a significant challenge that attracts more attention. To solve these problems, this article proposes a Heterogeneous Network based Contrastive Learning method(HCLNet). It aims to learn high-level representation from unlabeled PolSAR data for few-shot classification according to multi-features and superpixels. Beyond the conventional CL, HCLNet introduces the heterogeneous architecture for the first time to utilize heterogeneous PolSAR features better. And it develops two easy-to-use plugins to narrow the domain gap between optics and PolSAR, including feature filter and superpixel-based instance discrimination, which the former is used to enhance the complementarity of multi-features, and the latter is used to increase the diversity of negative samples. Experiments demonstrate the superiority of HCLNet on three widely used PolSAR benchmark datasets compared with state-of-the-art methods. Ablation studies also verify the importance of each component. Besides, this work has implications for how to efficiently utilize the multi-features of PolSAR data to learn better high-level representation in CL and how to construct networks suitable for PolSAR data better. △ Less

Submitted 3 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Showing 1–50 of 802 results for author: Cai, J