-
Mitigating Interpretation Bias in Rock Records with Large Language Models: Insights from Paleoenvironmental Analysis
Authors:
Luoqi Wang,
Haipeng Li,
Linshu Hu,
Jiarui Cai,
Zhenhong Du
Abstract:
The reconstruction of Earth's history faces significant challenges due to the nonunique interpretations often derived from rock records. The problem has long been recognized but there are no systematic solutions in practice. This study introduces an innovative approach that leverages Large Language Models (LLMs) along with retrieval augmented generation and real-time search capabilities to counter…
▽ More
The reconstruction of Earth's history faces significant challenges due to the nonunique interpretations often derived from rock records. The problem has long been recognized but there are no systematic solutions in practice. This study introduces an innovative approach that leverages Large Language Models (LLMs) along with retrieval augmented generation and real-time search capabilities to counteract interpretation biases, thereby enhancing the accuracy and reliability of geological analyses. By applying this framework to sedimentology and paleogeography, we demonstrate its effectiveness in mitigating interpretations biases through the generation and evaluation of multiple hypotheses for the same data, which can effectively reduce human bias. Our research illuminates the transformative potential of LLMs in refining paleoenvironmental studies and extends their applicability across various sub-disciplines of Earth sciences, enabling a deeper and more accurate depiction of Earth's evolution.
△ Less
Submitted 17 May, 2024;
originally announced July 2024.
-
An Economic Framework for 6-DoF Grasp Detection
Authors:
Xiao-Ming Wu,
Jia-Feng Cai,
Jian-Jian Jiang,
Dian Zheng,
Yi-Lin Wei,
Wei-Shi Zheng
Abstract:
Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck of current SOTA methods that severely encumbers the entire training overload…
▽ More
Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck of current SOTA methods that severely encumbers the entire training overload, meanwhile making the training difficult to converge. To solve the above problem, we first propose an economic supervision paradigm for efficient and effective grasping. This paradigm includes a well-designed supervision selection strategy, selecting key labels basically without ambiguity, and an economic pipeline to enable the training after selection. Furthermore, benefit from the economic supervision, we can focus on a specific grasp, and thus we devise a focal representation module, which comprises an interactive grasp head and a composite score estimation to generate the specific grasp more accurately. Combining all together, the EconomicGrasp framework is proposed. Our extensive experiments show that EconomicGrasp surpasses the SOTA grasp method by about 3AP on average, and with extremely low resource cost, for about 1/4 training time cost, 1/8 memory cost and 1/30 storage cost. Our code is available at https://github.com/iSEE-Laboratory/EconomicGrasp.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Experimental Demonstration of 16D Voronoi Constellation with Two-Level Coding over 50km Four-Core Fiber
Authors:
Can Zhao,
Bin Chen,
Jiaqi Cai,
Zhiwei Liang,
Yi Lei,
Junjie Xiong,
Lin Ma,
Daohui Hu,
Lin Sun,
Gangxiang Shen
Abstract:
A 16-dimensional Voronoi constellation concatenated with multilevel coding is experimentally demonstrated over a 50km four-core fiber transmission system. The proposed scheme reduces the required launch power by 6dB and provides a 17dB larger operating range than 16QAM with BICM at the outer HD-FEC BER threshold.
A 16-dimensional Voronoi constellation concatenated with multilevel coding is experimentally demonstrated over a 50km four-core fiber transmission system. The proposed scheme reduces the required launch power by 6dB and provides a 17dB larger operating range than 16QAM with BICM at the outer HD-FEC BER threshold.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review
Authors:
Rui Jin,
Derun Li,
Dehui Xiang,
Lei Zhang,
Hailing Zhou,
Fei Shi,
Weifang Zhu,
Jing Cai,
Tao Peng,
Xinjian Chen
Abstract:
Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The ad…
▽ More
Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The advent of precision medicine and a significant increase in clinical capacity have spurred the need for various data-driven tasks in the field of medical imaging. Recently, numerous machine learning and data mining tools have been integrated into various medical areas, including image segmentation. This article proposes a new classification method that differentiates supervision types, either in number or kind, during the training phase. Subsequently, we conducted a survey on artificial intelligence (AI)-based automatic prostate segmentation methods, examining the advantages and limitations of each. Additionally, we introduce variants of evaluation metrics for the verification and performance assessment of the segmentation method and summarize the current challenges. Finally, future research directions and development trends are discussed, reflecting the outcomes of our literature survey, suggesting high-precision detection and treatment of prostate cancer as a promising avenue.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation
Authors:
Guoan Wang,
Jin Ye,
Junlong Cheng,
Tianbin Li,
Zhaolin Chen,
Jianfei Cai,
Junjun He,
Bohan Zhuang
Abstract:
Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervi…
▽ More
Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervised Finetuning (SFT) serves as an effective way to adapt such foundation models for task-specific downstream tasks but at the cost of degrading the general knowledge previously stored in the original foundation model.To address this, we propose SAM-Med3D-MoE, a novel framework that seamlessly integrates task-specific finetuned models with the foundational model, creating a unified model at minimal additional training expense for an extra gating network. This gating network, in conjunction with a selection strategy, allows the unified model to achieve comparable performance of the original models in their respective tasks both general and specialized without updating any parameters of them.Our comprehensive experiments demonstrate the efficacy of SAM-Med3D-MoE, with an average Dice performance increase from 53 to 56.4 on 15 specific classes. It especially gets remarkable gains of 29.6, 8.5, 11.2 on the spinal cord, esophagus, and right hip, respectively. Additionally, it achieves 48.9 Dice on the challenging SPPIN2023 Challenge, significantly surpassing the general expert's performance of 32.3. We anticipate that SAM-Med3D-MoE can serve as a new framework for adapting the foundation model to specific areas in medical image analysis. Codes and datasets will be publicly available.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Automatic Adaptation Rule Optimization via Large Language Models
Authors:
Yusei Ishimizu,
Jialong Li,
Jinglue Xu,
Jinyu Cai,
Hitoshi Iba,
Kenji Tei
Abstract:
Rule-based adaptation is a foundational approach to self-adaptation, characterized by its human readability and rapid response. However, building high-performance and robust adaptation rules is often a challenge because it essentially involves searching the optimal design in a complex (variables) space. In response, this paper attempt to employ large language models (LLMs) as a optimizer to constr…
▽ More
Rule-based adaptation is a foundational approach to self-adaptation, characterized by its human readability and rapid response. However, building high-performance and robust adaptation rules is often a challenge because it essentially involves searching the optimal design in a complex (variables) space. In response, this paper attempt to employ large language models (LLMs) as a optimizer to construct and optimize adaptation rules, leveraging the common sense and reasoning capabilities inherent in LLMs. Preliminary experiments conducted in SWIM have validated the effectiveness and limitation of our method.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Unrolling Plug-and-Play Gradient Graph Laplacian Regularizer for Image Restoration
Authors:
Jianghe Cai,
Gene Cheung,
Fei Chen
Abstract:
Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune a large parameter set, and are fragile during covariance shift. To address these shortcomings, for a general linear image formation model, we first formulate a convex optimization problem with a new graph smoothness prior called gra…
▽ More
Generic deep learning (DL) networks for image restoration like denoising and interpolation lack mathematical interpretability, require voluminous training data to tune a large parameter set, and are fragile during covariance shift. To address these shortcomings, for a general linear image formation model, we first formulate a convex optimization problem with a new graph smoothness prior called gradient graph Laplacian regularizer (GGLR) that promotes piecewise planar (PWP) signal reconstruction. To solve the posed problem, we introduce a variable number of auxiliary variables to create a family of Plug-and-Play (PnP) ADMM algorithms and unroll them into variable-complexity feed-forward networks, amenable to parameter tuning via back-propagation. More complex unrolled networks require more labeled data to train more parameters, but have better potential performance. Experimental results show that our unrolled networks perform competitively to generic DL networks in image restoration quality while using a small fraction of parameters, and demonstrate improved robustness to covariance shift.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Fine-grained, Multi-dimensional Summarization Evaluation with LLMs
Authors:
Hwanjun Song,
Hang Su,
Igor Shalyminov,
Jason Cai,
Saab Mansour
Abstract:
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can onl…
▽ More
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.
△ Less
Submitted 9 July, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
A Sanity Check for AI-generated Image Detection
Authors:
Shilin Yan,
Ouxiang Li,
Jiayin Cai,
Yanbin Hao,
Xiaolong Jiang,
Yao Hu,
Weidi Xie
Abstract:
With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on "whether the task of AI-generated image detection has been solved". To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify th…
▽ More
With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on "whether the task of AI-generated image detection has been solved". To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify the generalization of existing methods, we evaluate 9 off-the-shelf AI-generated image detectors on Chameleon dataset. Upon analysis, almost all models classify AI-generated images as real ones. Later, we propose AIDE (AI-generated Image DEtector with Hybrid Features), which leverages multiple experts to simultaneously extract visual artifacts and noise patterns. Specifically, to capture the high-level semantics, we utilize CLIP to compute the visual embedding. This effectively enables the model to discern AI-generated images based on semantics or contextual information; Secondly, we select the highest frequency patches and the lowest frequency patches in the image, and compute the low-level patchwise features, aiming to detect AI-generated images by low-level artifacts, for example, noise pattern, anti-aliasing, etc. While evaluating on existing benchmarks, for example, AIGCDetectBenchmark and GenImage, AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods, and on our proposed challenging Chameleon benchmarks, it also achieves the promising results, despite this problem for detecting AI-generated images is far from being solved. The dataset, codes, and pre-train models will be published at https://github.com/shilinyan99/AIDE.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Gaussian-Informed Continuum for Physical Property Identification and Simulation
Authors:
Junhao Cai,
Yuji Yang,
Weihao Yuan,
Yisheng He,
Zilong Dong,
Liefeng Bo,
Hui Cheng,
Qifeng Chen
Abstract:
This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a…
▽ More
This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https://jukgei.github.io/project/gic.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
DrVideo: Document Retrieval Based Long Video Understanding
Authors:
Ziyu Ma,
Chenhui Gou,
Hengcan Shi,
Bin Sun,
Shutao Li,
Hamid Rezatofighi,
Jianfei Cai
Abstract:
Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long…
▽ More
Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks
Authors:
Yingchao Yu,
Yuping Yan,
Jisong Cai,
Yaochu Jin
Abstract:
Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) a…
▽ More
Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) and biologically more plausible spiking neural networks (SNNs). This diversity empowers the efficient handling of specific tasks and requirements, showcasing the adaptability and versatility of edge computing platforms. One main challenge of such heterogeneous FL system lies in effectively aggregating models from the local devices in a privacy-preserving manner. To address the above issue, this work benchmarks FL systems containing both convoluntional neural networks (CNNs) and SNNs by comparing various aggregation approaches, including federated CNNs, federated SNNs, federated CNNs for SNNs, federated SNNs for CNNs, and federated CNNs with SNN fusion. Experimental results demonstrate that the CNN-SNN fusion framework exhibits the best performance among the above settings on the MNIST dataset. Additionally, intriguing phenomena of competitive suppression are noted during the convergence process of multi-model FL.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Ferromagnetism and Topology of the Higher Flat Band in a Fractional Chern Insulator
Authors:
Heonjoon Park,
Jiaqi Cai,
Eric Anderson,
Xiao-Wei Zhang,
Xiaoyu Liu,
William Holtzmann,
Weijie Li,
Chong Wang,
Chaowei Hu,
Yuzhou Zhao,
Takashi Taniguchi,
Kenji Watanabe,
Jihui Yang,
David Cobden,
Jiun-Haw Chu,
Nicolas Regnault,
B. Andrei Bernevig,
Liang Fu,
Ting Cao,
Di Xiao,
Xiaodong Xu
Abstract:
The recent observation of the fractional quantum anomalous Hall effect in moiré fractional Chern insulators (FCI) provides opportunities for investigating zero magnetic field anyons. So far, both experimental and theoretical results suggest that filling > 1/3 FCI states in the first Chern band share features with those of the lowest Landau level (LL). To create the possibility of realizing non-Abe…
▽ More
The recent observation of the fractional quantum anomalous Hall effect in moiré fractional Chern insulators (FCI) provides opportunities for investigating zero magnetic field anyons. So far, both experimental and theoretical results suggest that filling > 1/3 FCI states in the first Chern band share features with those of the lowest Landau level (LL). To create the possibility of realizing non-Abelian anyons, one route is to engineer higher flat Chern bands that mimic higher LLs. Here, we investigate the interaction, topology, and ferromagnetism of the second moiré miniband in twisted MoTe2 bilayer (tMoTe2). Around filling factor v = -3, i.e., half-filling of the second miniband, we uncover spontaneous ferromagnetism and an incipient Chern insulator state. By measuring the anomalous Hall effect as a function of twist angle, we find that the Chern numbers (C) of the top two moiré flat bands have opposite sign (C = -+1) at twist angles above 3.1° but the same sign (C = -1) around 2.6°. This observation is consistent with the recently predicted twist-angle dependent band topology, resulting from the competition between moiré ferroelectricity and piezoelectricity. As we increase the magnetic field, only the small twist-angle device (2.6°) experiences a topological phase transition with an emergent C = -2 state. This is attributed to a Zeeman field-induced band crossing between opposite valleys, with the determined C = -1 for the top two bands. Our results lay a firm foundation for understanding the higher flat Chern bands, which is essential for the prediction or discovery of non-Abelian FCIs.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models
Authors:
Jing Liu,
Ruihao Gong,
Mingyang Zhang,
Yefei He,
Jianfei Cai,
Bohan Zhuang
Abstract:
The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expe…
▽ More
The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch's promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations
Authors:
Zhen Cao,
F. Aharonian,
Q. An,
Axikegu,
Y. X. Bai,
Y. W. Bao,
D. Bastieri,
X. J. Bi,
Y. J. Bi,
J. T. Cai,
Q. Cao,
W. Y. Cao,
Zhe Cao,
J. Chang,
J. F. Chang,
A. M. Chen,
E. S. Chen,
Liang Chen,
Lin Chen,
Long Chen,
M. J. Chen,
M. L. Chen,
Q. H. Chen,
S. H. Chen,
S. Z. Chen
, et al. (255 additional authors not shown)
Abstract:
In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes…
▽ More
In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction
Authors:
Shangyu Chen,
Zizheng Pan,
Jianfei Cai,
Dinh Phung
Abstract:
Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is challenging as it typically struggles to make an appropriate trade-off between its training data distribution and the target distribution, i.e., learning a novel concept with only a few target images to achieve personalization (aligning with the personalized target) while preserving text editability (aligning with divers…
▽ More
Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is challenging as it typically struggles to make an appropriate trade-off between its training data distribution and the target distribution, i.e., learning a novel concept with only a few target images to achieve personalization (aligning with the personalized target) while preserving text editability (aligning with diverse text prompts). In this paper, we propose PaRa, an effective and efficient Parameter Rank Reduction approach for T2I model personalization by explicitly controlling the rank of the diffusion model parameters to restrict its initial diverse generation space into a small and well-balanced target space. Our design is motivated by the fact that taming a T2I model toward a novel concept such as a specific art style implies a small generation space. To this end, by reducing the rank of model parameters during finetuning, we can effectively constrain the space of the denoising sampling trajectories towards the target. With comprehensive experiments, we show that PaRa achieves great advantages over existing finetuning approaches on single/multi-subject generation as well as single-image editing. Notably, compared to the prevailing fine-tuning technique LoRA, PaRa achieves better parameter efficiency (2x fewer learnable parameters) and much better target image alignment.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
CERET: Cost-Effective Extrinsic Refinement for Text Generation
Authors:
Jason Cai,
Hang Su,
Monica Sunkara,
Igor Shalyminov,
Saab Mansour
Abstract:
Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by…
▽ More
Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by their high computational cost and lack of scalability. In this work, we propose CERET, a method for refining text generations by considering semantic stability, entailment and inter-sample uncertainty measures. Experimental results show that CERET outperforms Self-consistency and Self-rerank baselines consistently under various task setups, by ~1.6% in Rouge-1 for abstractive summarization and ~3.5% in hit rate for question answering. Compared to LLM Self-rerank method, our approach only requires 9.4% of its latency and is more cost-effective.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
How Far Can We Compress Instant-NGP-Based NeRF?
Authors:
Yihang Chen,
Qianyi Wu,
Mehrtash Harandi,
Jianfei Cai
Abstract:
In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, whic…
▽ More
In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically, we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally, we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge, we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100$\times$ and 70$\times$ with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets, respectively. Additionally, we attain 86.7\% and 82.3\% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: https://github.com/YihangChen-ee/CNC.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models
Authors:
Mingzhen Huang,
Jialing Cai,
Shan Jia,
Vishnu Suresh Lokhande,
Siwei Lyu
Abstract:
Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially f…
▽ More
Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially for multi-aspect edits increases computational demands and efficiency losses. In this paper, we address these challenges with significant contributions. Our main contribution is the development of MultiEdits, a method that seamlessly manages simultaneous edits across multiple attributes. In contrast to previous approaches, MultiEdits not only preserves the quality of single attribute edits but also significantly improves the performance of multitasking edits. This is achieved through an innovative attention distribution mechanism and a multi-branch design that operates across several processing heads. Additionally, we introduce the PIE-Bench++ dataset, an expansion of the original PIE-Bench dataset, to better support evaluating image-editing tasks involving multiple objects and attributes simultaneously. This dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios. Dataset and code are available at https://mingzhenhuang.com/projects/MultiEdits.html.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Visualizing the microscopic origins of topology in twisted molybdenum ditelluride
Authors:
Ellis Thompson,
Keng Tou Chu,
Florie Mesple,
Xiao-Wei Zhang,
Chaowei Hu,
Yuzhou Zhao,
Heonjoon Park,
Jiaqi Cai,
Eric Anderson,
Kenji Watanabe,
Takashi Taniguchi,
Jihui Yang,
Jiun-Haw Chu,
Xiaodong Xu,
Ting Cao,
Di Xiao,
Matthew Yankowitz
Abstract:
In moiré materials with flat electronic bands and suitable quantum geometry, strong correlations can give rise to novel topological states of matter. The nontrivial band topology of twisted molybdenum ditelluride (tMoTe$_2$) -- responsible for its fractional quantum anomalous Hall (FQAH) states -- is predicted to arise from a layer-pseudospin skyrmion lattice. Tracing the layer polarization of wav…
▽ More
In moiré materials with flat electronic bands and suitable quantum geometry, strong correlations can give rise to novel topological states of matter. The nontrivial band topology of twisted molybdenum ditelluride (tMoTe$_2$) -- responsible for its fractional quantum anomalous Hall (FQAH) states -- is predicted to arise from a layer-pseudospin skyrmion lattice. Tracing the layer polarization of wavefunctions within the moiré unit cell can thus offer crucial insights into the band topology. Here, we use scanning tunneling microscopy and spectroscopy (STM/S) to probe the layer-pseudospin skyrmion textures of tMoTe$_2$. We do this by simultaneously visualizing the moiré lattice structure and the spatial localization of its electronic states. We find that the wavefunctions associated with the topological flat bands exhibit a spatially-dependent layer polarization within the moiré unit cell. This is in excellent agreement with our theoretical modeling, thereby revealing a direct microscopic connection between the structural properties of tMoTe$_2$ and its band topology. Our work enables new pathways for engineering FQAH states with strain, as well as future STM studies of the intertwined correlated and topological states arising in gate-tunable devices.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Direct magnetic imaging of fractional Chern insulators in twisted MoTe$_2$ with a superconducting sensor
Authors:
Evgeny Redekop,
Canxun Zhang,
Heonjoon Park,
Jiaqi Cai,
Eric Anderson,
Owen Sheekey,
Trevor Arp,
Grigory Babikyan,
Samuel Salters,
Kenji Watanabe,
Takashi Taniguchi,
Xiaodong Xu,
Andrea F. Young
Abstract:
In the absence of time reversal symmetry, orbital magnetization provides a sensitive probe of topology and interactions, with particularly rich phenomenology in Chern insulators where topological edge states carry large equilibrium currents. Here, we use a nanoscale superconducting sensor to map the magnetic fringe fields in twisted bilayers of MoTe$_2$, where transport and optical sensing experim…
▽ More
In the absence of time reversal symmetry, orbital magnetization provides a sensitive probe of topology and interactions, with particularly rich phenomenology in Chern insulators where topological edge states carry large equilibrium currents. Here, we use a nanoscale superconducting sensor to map the magnetic fringe fields in twisted bilayers of MoTe$_2$, where transport and optical sensing experiments have revealed the formation of fractional Chern insulator (FCI) states at zero magnetic field. At a temperature of 1.6K, we observe oscillations in the local magnetic field associated with fillings $ν=-1,-2/3,-3/5,-4/7$ and $-5/9$ of the first moiré hole band, consistent with the formation of FCIs at these fillings. By quantitatively reconstructing the magnetization, we determine the local thermodynamic gaps of the most robust FCI state at $ν=-2/3$, finding $^{-2/3}Δ$ as large as 7 meV. Spatial mapping of the charge density- and displacement field-tuned magnetic phase diagram further allows us to characterize sample disorder, which we find to be dominated by both inhomogeneity in the effective unit cell area as well as inhomogeneity in the band edge offset and bound dipole moment. Our results highlight both the challenges posed by structural disorder in the study of twisted homobilayer moiré systems and the opportunities afforded by the remarkably robust nature of the underlying correlated topological states.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Gaze-DETR: Using Expert Gaze to Reduce False Positives in Vulvovaginal Candidiasis Screening
Authors:
Yan Kong,
Sheng Wang,
Jiangdong Cai,
Zihao Zhao,
Zhenrong Shen,
Yonghao Li,
Manman Fei,
Qian Wang
Abstract:
Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positi…
▽ More
Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positives of neural networks. Leveraging this finding, we introduce Gaze-DETR, a pioneering method that integrates gaze data to enhance neural network precision by diminishing false positives. Gaze-DETR incorporates a universal gaze-guided warm-up protocol applicable across various detection methods and a gaze-guided rectification strategy specifically designed for DETR-based models. Our comprehensive tests confirm that Gaze-DETR surpasses existing leading methods, showcasing remarkable improvements in detection accuracy and generalizability.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Adapting Abstract Meaning Representation Parsing to the Clinical Narrative -- the SPRING THYME parser
Authors:
Jon Z. Cai,
Kristin Wright-Bettner,
Martha Palmer,
Guergana K. Savova,
James H. Martin
Abstract:
This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME)…
▽ More
This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME) corpus, we adapted a state-of-the-art AMR parser utilizing continuous training. Our approach incorporates data augmentation techniques to enhance the accuracy of AMR structure predictions. Notably, through this learning strategy, our parser achieved an impressive F1 score of 88% on the THYME corpus's colon cancer dataset. Moreover, our research delved into the efficacy of data required for domain adaptation within the realm of clinical notes, presenting domain adaptation data requirements for AMR parsing. This exploration not only underscores the parser's robust performance but also highlights its potential in facilitating a deeper understanding of clinical narratives through structured semantic representations.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Discovery of Very-high-energy Gamma-ray Emissions from the Low Luminosity AGN NGC 4278 by LHAASO
Authors:
Zhen Cao,
F. Aharonian,
Q. An,
Axikegu,
Y. X. Bai,
Y. W. Bao,
D. Bastieri,
X. J. Bi,
Y. J. Bi,
J. T. Cai,
Q. Cao,
W. Y. Cao,
Zhe Cao,
J. Chang,
J. F. Chang,
A. M. Chen,
E. S. Chen,
Liang Chen,
Lin Chen,
Long Chen,
M. J. Chen,
M. L. Chen,
Q. H. Chen,
S. H. Chen,
S. Z. Chen
, et al. (255 additional authors not shown)
Abstract:
The first source catalog of Large High Altitude Air Shower Observatory reported the detection of a very-high-energy gamma ray source, 1LHAASO J1219+2915. In this paper a further detailed study of the spectral and temporal behavior of this point-like source have been carried. The best-fit position of the TeV source ($\rm{RA}=185.05^{\circ}\pm0.04^{\circ}$, $\rm{Dec}=29.25^{\circ}\pm0.03^{\circ}$) i…
▽ More
The first source catalog of Large High Altitude Air Shower Observatory reported the detection of a very-high-energy gamma ray source, 1LHAASO J1219+2915. In this paper a further detailed study of the spectral and temporal behavior of this point-like source have been carried. The best-fit position of the TeV source ($\rm{RA}=185.05^{\circ}\pm0.04^{\circ}$, $\rm{Dec}=29.25^{\circ}\pm0.03^{\circ}$) is compatible with NGC 4278 within $\sim0.03$ degree. Variation analysis shows an indication of the variability at a few months level in the TeV band, which is consistent with low frequency observations. Based on these observations, we report the detection of TeV $γ$-ray emissions from this low-luminosity AGN NGC 4278. The observations by LHAASO-WCDA during active period has a significance level of 8.8\,$σ$ with best-fit photon spectral index $\varGamma=2.56\pm0.14$ and a flux $f_{1-10\,\rm{TeV}}=(7.0\pm1.1_{\rm{sta}}\pm0.35_{\rm{syst}})\times10^{-13}\,\rm{photons\,cm^{-2}\,s^{-1}}$, or approximately $5\%$ of the Crab Nebula. The discovery of VHE from NGC 4278 indicates that the compact, weak radio jet can efficiently accelerate particles and emit TeV photons.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
In Situ AI Prototyping: Infusing Multimodal Prompts into Mobile Settings with MobileMaker
Authors:
Savvas Petridis,
Michael Xieyang Liu,
Alexander J. Fiannaca,
Vivian Tsai,
Michael Terry,
Carrie J. Cai
Abstract:
Recent advances in multimodal large language models (LLMs) have lowered the barriers to rapidly prototyping AI-powered features via prompting, especially for mobile-intended use cases. Despite the value of situated user feedback, the process of soliciting early, mobile-situated user feedback on AI prototypes remains challenging. The broad scope and flexibility of LLMs means that, for a given use-c…
▽ More
Recent advances in multimodal large language models (LLMs) have lowered the barriers to rapidly prototyping AI-powered features via prompting, especially for mobile-intended use cases. Despite the value of situated user feedback, the process of soliciting early, mobile-situated user feedback on AI prototypes remains challenging. The broad scope and flexibility of LLMs means that, for a given use-case-specific prototype, there is a crucial need to understand the wide range of in-the-wild input likely to be provided by the user, as well as their in-context expectations of the AI's behavior. To explore the concept of in situ AI prototyping and testing, we created MobileMaker: an AI prototyping tool that enables designers to rapidly create mobile AI prototypes that can be tested on-device, and enables testers to make on-device, in-the-field revisions of the prototype through natural language. In an exploratory study with 16 users, we explored how user feedback on prototypes created with MobileMaker compares to that of existing prototyping tools (e.g., Figma, prompt editors). We found that MobileMaker prototypes enabled more serendipitous discovery of: model input edge cases, discrepancies between AI's and user's in-context interpretation of the task, and contextual signals missed by the AI. Furthermore, we learned that while the ability to make in-the-wild revisions led users to feel more fulfilled as active participants in the design process, it might also constrain their feedback to the subset of changes perceived as more actionable or implementable by the prototyping tool.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Spectral conditions for the existence of (doubly) chorded cycles in graphs with fixed size
Authors:
Jin Cai,
Leyou Xu,
Bo Zhou
Abstract:
A chorded cycle is a cycle with at least one chord, and a doubly chorded cycle is a cycle with at least two chords. Gould asked in [Graphs Comb. 38 (2022) 189] the question: What spectral conditions imply a graph contains a chorded cycle? For a graph with fixed size, extremal spectral conditions are given to ensure that a graph contains a chorded cycle and a doubly chorded cycle, respectively, via…
▽ More
A chorded cycle is a cycle with at least one chord, and a doubly chorded cycle is a cycle with at least two chords. Gould asked in [Graphs Comb. 38 (2022) 189] the question: What spectral conditions imply a graph contains a chorded cycle? For a graph with fixed size, extremal spectral conditions are given to ensure that a graph contains a chorded cycle and a doubly chorded cycle, respectively, via spectral radius.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Exploring the Improvement of Evolutionary Computation via Large Language Models
Authors:
Jinyu Cai,
Jinglue Xu,
Jialong Li,
Takuto Ymauchi,
Hitoshi Iba,
Kenji Tei
Abstract:
Evolutionary computation (EC), as a powerful optimization algorithm, has been applied across various domains. However, as the complexity of problems increases, the limitations of EC have become more apparent. The advent of large language models (LLMs) has not only transformed natural language processing but also extended their capabilities to diverse fields. By harnessing LLMs' vast knowledge and…
▽ More
Evolutionary computation (EC), as a powerful optimization algorithm, has been applied across various domains. However, as the complexity of problems increases, the limitations of EC have become more apparent. The advent of large language models (LLMs) has not only transformed natural language processing but also extended their capabilities to diverse fields. By harnessing LLMs' vast knowledge and adaptive capabilities, we provide a forward-looking overview of potential improvements LLMs can bring to EC, focusing on the algorithms themselves, population design, and additional enhancements. This presents a promising direction for future research at the intersection of LLMs and EC.
△ Less
Submitted 23 May, 2024; v1 submitted 5 May, 2024;
originally announced May 2024.
-
Language Evolution for Evading Social Media Regulation via LLM-based Multi-agent Simulation
Authors:
Jinyu Cai,
Jialong Li,
Mingyue Zhang,
Munan Li,
Chen-Shu Wang,
Kenji Tei
Abstract:
Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a stra…
▽ More
Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation framework using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Optimal Pricing for Linear-Quadratic Games with Nonlinear Interaction Between Agents
Authors:
Jiamin Cai,
Chenyue Zhang,
Hoi-To Wai
Abstract:
This paper studies a class of network games with linear-quadratic payoffs and externalities exerted through a strictly concave interaction function. This class of game is motivated by the diminishing marginal effects with peer influences. We analyze the optimal pricing strategy for this class of network game. First, we prove the existence of a unique Nash Equilibrium (NE). Second, we study the opt…
▽ More
This paper studies a class of network games with linear-quadratic payoffs and externalities exerted through a strictly concave interaction function. This class of game is motivated by the diminishing marginal effects with peer influences. We analyze the optimal pricing strategy for this class of network game. First, we prove the existence of a unique Nash Equilibrium (NE). Second, we study the optimal pricing strategy of a monopolist selling a divisible good to agents. We show that the optimal pricing strategy, found by solving a bilevel optimization problem, is strictly better when the monopolist knows the network structure as opposed to the best strategy agnostic to network structure. Numerical experiments demonstrate that in most cases, the maximum revenue is achieved with an asymmetric network. These results contrast with the previously studied case of linear interaction function, where a network-independent price is proven optimal with symmetric networks. Lastly, we describe an efficient algorithm to find the optimal pricing strategy.
△ Less
Submitted 3 June, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Research on the Evaluation Index System of Enterprise Production Efficiency
Authors:
W. Li,
J. Cai,
C. Wang,
Y. Chen,
J. Xu,
J. Zhao,
Y. Chen
Abstract:
This paper focuses on studying the evaluation index system for the production efficiency of tobacco enterprises. Considering the limitations of existing evaluation methods in accurately assessing the production quality of cigarette enterprises, a mathematical model based on the Analytic Hierarchy Process (AHP) is established. This model constructs an evaluation framework for the production efficie…
▽ More
This paper focuses on studying the evaluation index system for the production efficiency of tobacco enterprises. Considering the limitations of existing evaluation methods in accurately assessing the production quality of cigarette enterprises, a mathematical model based on the Analytic Hierarchy Process (AHP) is established. This model constructs an evaluation framework for the production efficiency of cigarette enterprises and subsequently analyzes the significance of each index within this framework. To comprehensively analyze the multi-index and feasibility aspects of the selected projects, the AHP method is employed to establish a comprehensive feasibility research and evaluation structure model. The result of this feasibility study provides the conclusion that the construction of an evaluation index system for the production efficiency of cigarette enterprises can indeed promote the enhancement of their production efficiency.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Exposing Text-Image Inconsistency Using Diffusion Models
Authors:
Mingzhen Huang,
Shan Jia,
Zhou Zhou,
Yan Ju,
Jialing Cai,
Siwei Lyu
Abstract:
In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more…
▽ More
In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Single-Spin Waved-Brim Flat-Top Hat in the Band Edge of GdIH Monolayer
Authors:
Ningning Jia,
Zhao Yang,
Jiangtao Cai,
Zhiheng Lv,
Yongting Shi,
Tielei Song,
Xin Cui,
Zhifeng Liu
Abstract:
Exotic electronic bands, such as flat bands, linear crossing bands, spontaneously valley- or spin-polarized bands, in two-dimensional materials have been the hot topics in condensed matter physics. Herein, we first propose a general dispersion model for possible hat-like electronic bands, and then identify an intriguing single-spin \emph{waved-brim flat-top hat} in the valence band edge of a stabl…
▽ More
Exotic electronic bands, such as flat bands, linear crossing bands, spontaneously valley- or spin-polarized bands, in two-dimensional materials have been the hot topics in condensed matter physics. Herein, we first propose a general dispersion model for possible hat-like electronic bands, and then identify an intriguing single-spin \emph{waved-brim flat-top hat} in the valence band edge of a stable ferromagnetic semiconducting electrene (i.e., Janus GdIH monolayer), which can be well described by a simplified two-bands Hamiltonian model. Specifically, the hat-band has a waved brim with six valleys along the boundary of the first Brillouin zone; meanwhile it holds a flat top close to the Fermi level, resulting in the emergence of single-spin van Hove singularities divergence and Lifshitz transitions. Owing to the breaking of both time-reversal and space inversion symmetries, a sizable spontaneous valley polarization is formed between the adjacent brim valleys, which provides the opportunity to realize the high-temperature anomalous valley Hall effect. Particularly, via modest strains and carriers doping, various conductive bipolar-states (spin-up vs. spin-down, K valley vs. $-$K valley, and ultra-low-speed vs. ultra-high-speed) can be modulated out from the distorted waved-brim flat-top hat of GdIH ML.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
"I Upload...All Types of Different Things to Say, the World of Blindness Is More Than What They Think It Is": A Study of Blind TikTokers' Identity Work from a Flourishing Perspective
Authors:
Yao Lyu,
Jie Cai,
Bryan Dosono,
Davis Yadav,
John M. Carroll
Abstract:
Identity work in Human-Computer Interaction (HCI) has focused on the marginalized group to explore designs to support their asset (what they have). However, little has been explored specifically on the identity work of people with disabilities, specifically, visual impairments. In this study, we interviewed 45 BlindTokers (blind users on TikTok) from various backgrounds to understand their identit…
▽ More
Identity work in Human-Computer Interaction (HCI) has focused on the marginalized group to explore designs to support their asset (what they have). However, little has been explored specifically on the identity work of people with disabilities, specifically, visual impairments. In this study, we interviewed 45 BlindTokers (blind users on TikTok) from various backgrounds to understand their identity work from a positive design perspective. We found that BlindTokers leverage the affordance of the platform to create positive content, share their identities, and build the community with the desire to flourish. We proposed flourishing labor to present the work conducted by BlindTokers for their community's flourishing with implications to support the flourishing labor. This work contributes to understanding blind users' experience in short video platforms and highlights that flourishing is not just an activity for any single Blind user but also a job that needs all stakeholders, including all user groups and the TikTok platform, serious and committed contribution.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points
Authors:
Yi Guo,
Fanliu Kong,
Xiaoyang Li,
Hui Li,
Wei Chen,
Xiaogang Tian,
Jinping Cai,
Yang Zhang,
Shouda Liu
Abstract:
Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degr…
▽ More
Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods.
Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Quantum delocalization on correlation landscape: The key to exponentially fast multipartite entanglement generation
Authors:
Yaoming Chu,
Xiangbei Li,
Jianming Cai
Abstract:
Entanglement, a hallmark of quantum mechanics, is a vital resource for quantum technologies. Generating highly entangled multipartite states is a key goal in current quantum experiments. We unveil a novel framework for understanding entanglement generation dynamics in Hamiltonian systems by quantum delocalization of an effective operator wavefunction on a correlation landscape. Our framework estab…
▽ More
Entanglement, a hallmark of quantum mechanics, is a vital resource for quantum technologies. Generating highly entangled multipartite states is a key goal in current quantum experiments. We unveil a novel framework for understanding entanglement generation dynamics in Hamiltonian systems by quantum delocalization of an effective operator wavefunction on a correlation landscape. Our framework establishes a profound connection between the exponentially fast generation of multipartite entanglement, witnessed by the quantum Fisher information, and the linearly increasing asymptotics of hopping amplitudes governing the delocalization dynamics in Krylov space. We illustrate this connection using the paradigmatic Lipkin-Meshkov-Glick model and highlight potential signatures in chaotic Feingold-Peres tops. Our results provide a transformative tool for understanding and harnessing rapid entanglement production in complex quantum systems, providing a pathway for quantum enhanced technologies by large-scale entanglement.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images
Authors:
Yingjie Xi,
Boyuan Cheng,
Jingyao Cai,
Jian Jun Zhang,
Xiaosong Yang
Abstract:
The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it…
▽ More
The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it lacks adaptability and safety. In our work, We proposed a new method to directly generate the 2D human whole-body X-rays from the human masking images. The predicted images will be similar to the real ones with the same image style and anatomic structure. We employed a data-driven strategy. By leveraging advanced generative techniques, our model MaSkel(Masking image to Skeleton X-rays) could generate a high-quality X-ray image from a human masking image without the need for invasive and harmful radiation exposure, which not only provides a new path to generate highly anatomic and customized data but also reduces health risks. To our knowledge, our model MaSkel is the first work for predicting whole-body X-rays. In this paper, we did two parts of the work. The first one is to solve the data limitation problem, the diffusion-based techniques are utilized to make a data augmentation, which provides two synthetic datasets for preliminary pretraining. Then we designed a two-stage training strategy to train MaSkel. At last, we make qualitative and quantitative evaluations of the generated X-rays. In addition, we invite some professional doctors to assess our predicted data. These evaluations demonstrate the MaSkel's superior ability to generate anatomic X-rays from human masking images. The related code and links of the dataset are available at https://github.com/2022yingjie/MaSkel.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
The magnetism measurements of the two-dimensional van der Waals antiferromagnet CrPS4 using dynamic cantilever magnetometry
Authors:
Qi Li,
Weili Zhen,
Ning Wang,
Yang Yu,
Senyang Pan,
Lin Deng,
Jiaqiang Cai,
Kang Wang,
Lvkuan Zou,
Zhongming Zeng,
Jinglei Zhang,
Haifeng Du
Abstract:
The exploration of van der Waals (vdWs) magnetic materials has sparked great interest in spintronics. However, conventional methods often face challenges in characterizing the magnetic properties of small-sized vdWs materials, especially for antiferromagnets with extremely small magnetic moments. Here, we demonstrate the efficacy of dynamic cantilever magnetometry (DCM) in characterizing the magne…
▽ More
The exploration of van der Waals (vdWs) magnetic materials has sparked great interest in spintronics. However, conventional methods often face challenges in characterizing the magnetic properties of small-sized vdWs materials, especially for antiferromagnets with extremely small magnetic moments. Here, we demonstrate the efficacy of dynamic cantilever magnetometry (DCM) in characterizing the magnetic properties of vdWs magnets, using an antiferromagnetic semiconductor CrPS4. We observe continuous spin axis rotation under a magnetic field, accurately modelled by considering the existance of marked magnetic anisotropies. Furthermore, the dominance of out-of-plane magnetic anisotropy in spin reorientation behavior at low temperatures transitions to the prevalence of in-plane anisotropy with increasing temperature, leading to a sign reversal of the frequency shift in measurements. The peculiar magnetic phase transitions make CrPS4 an intriguing platform for studying two-dimensional magnetism. Our findings underscore the effectiveness of DCM in characterizing magnetic anisotropies and phase transitions in vdWs magnets.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Taming Stable Diffusion for Text to 360° Panorama Image Generation
Authors:
Cheng Zhang,
Qianyi Wu,
Camilo Cruz Gambardella,
Xiaoshui Huang,
Dinh Phung,
Wanli Ouyang,
Jianfei Cai
Abstract:
Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to gen…
▽ More
Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
"We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
Authors:
Michael Xieyang Liu,
Frederick Liu,
Alexander J. Fiannaca,
Terry Koo,
Lucas Dixon,
Michael Terry,
Carrie J. Cai
Abstract:
Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective…
▽ More
Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Authors:
Shengding Hu,
Yuge Tu,
Xu Han,
Chaoqun He,
Ganqu Cui,
Xiang Long,
Zhi Zheng,
Yewei Fang,
Yuxiang Huang,
Weilin Zhao,
Xinrong Zhang,
Zheng Leng Thai,
Kaihuo Zhang,
Chongyi Wang,
Yuan Yao,
Chenyang Zhao,
Jie Zhou,
Jie Cai,
Zhongwu Zhai,
Ning Ding,
Chao Jia,
Guoyang Zeng,
Dahai Li,
Zhiyuan Liu,
Maosong Sun
Abstract:
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce…
▽ More
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .
△ Less
Submitted 3 June, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Authors:
Fanjie Kong,
Yanbei Chen,
Jiarui Cai,
Davide Modolo
Abstract:
Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary description…
▽ More
Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Global strong solution to the inviscid liquid-gas two-phase flow model in $L^p$ framework
Authors:
Zhigang Wu,
Mengqian Liu,
Juanzi Cai
Abstract:
This paper is dedicated to the study of the inviscid liquid-gas two-phase flow model in $\mathbb{R}^d\ (d\geq1)$. We establish the global existence of strong solutions to this system with small initial data in hybrid Besov spaces based on general $L^p$-norms. Additionally, we obtain the decay estimates of solutions rely on the constructed Lyapunov functional.
This paper is dedicated to the study of the inviscid liquid-gas two-phase flow model in $\mathbb{R}^d\ (d\geq1)$. We establish the global existence of strong solutions to this system with small initial data in hybrid Besov spaces based on general $L^p$-norms. Additionally, we obtain the decay estimates of solutions rely on the constructed Lyapunov functional.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
LHAASO-KM2A detector simulation using Geant4
Authors:
Zhen Cao,
F. Aharonian,
Q. An,
Axikegu,
Y. X. Bai,
Y. W. Bao,
D. Bastieri,
X. J. Bi,
Y. J. Bi,
J. T. Cai,
Q. Cao,
W. Y. Cao,
Zhe Cao,
J. Chang,
J. F. Chang,
A. M. Chen,
E. S. Chen,
Liang Chen,
Lin Chen,
Long Chen,
M. J. Chen,
M. L. Chen,
Q. H. Chen,
S. H. Chen,
S. Z. Chen
, et al. (254 additional authors not shown)
Abstract:
KM2A is one of the main sub-arrays of LHAASO, working on gamma ray astronomy and cosmic ray physics at energies above 10 TeV. Detector simulation is the important foundation for estimating detector performance and data analysis. It is a big challenge to simulate the KM2A detector in the framework of Geant4 due to the need to track numerous photons from a large number of detector units (>6000) with…
▽ More
KM2A is one of the main sub-arrays of LHAASO, working on gamma ray astronomy and cosmic ray physics at energies above 10 TeV. Detector simulation is the important foundation for estimating detector performance and data analysis. It is a big challenge to simulate the KM2A detector in the framework of Geant4 due to the need to track numerous photons from a large number of detector units (>6000) with large altitude difference (30 m) and huge coverage (1.3 km^2). In this paper, the design of the KM2A simulation code G4KM2A based on Geant4 is introduced. The process of G4KM2A is optimized mainly in memory consumption to avoid memory overffow. Some simpliffcations are used to signiffcantly speed up the execution of G4KM2A. The running time is reduced by at least 30 times compared to full detector simulation. The particle distributions and the core/angle resolution comparison between simulation and experimental data of the full KM2A array are also presented, which show good agreement.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation
Authors:
Duy-Tho Le,
Hengcan Shi,
Jianfei Cai,
Hamid Rezatofighi
Abstract:
Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the i…
▽ More
Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the inherent denoising property of diffusion, DifFUSER is able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output. In terms of architecture, our DifFUSER blocks are chained together in a hierarchical BiFPN fashion, termed cMini-BiFPN, offering an alternative architecture for latent diffusion. We further introduce a Gated Self-conditioned Modulated (GSM) latent diffusion module together with a Progressive Sensor Dropout Training (PSDT) paradigm, designed to add stronger conditioning to the diffusion process and robustness to sensor failures. Our extensive evaluations on the Nuscenes dataset reveal that DifFUSER not only achieves state-of-the-art performance with a 69.1% mIOU in BEV map segmentation tasks but also competes effectively with leading transformer-based fusion techniques in 3D object detection.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments
Authors:
Duy-Tho Le,
Chenhui Gou,
Stavya Datta,
Hengcan Shi,
Ian Reid,
Jianfei Cai,
Hamid Rezatofighi
Abstract:
Autonomous robot systems have attracted increasing research attention in recent years, where environment understanding is a crucial step for robot navigation, human-robot interaction, and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks, w…
▽ More
Autonomous robot systems have attracted increasing research attention in recent years, where environment understanding is a crucial step for robot navigation, human-robot interaction, and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks, with their reliance on single sensors and limited object classes and scenarios, fail to provide the comprehensive environmental understanding robots need for accurate navigation, interaction, and decision-making. As an extension of JRDB dataset, we unveil JRDB-PanoTrack, a novel open-world panoptic segmentation and tracking benchmark, towards more comprehensive environmental perception. JRDB-PanoTrack includes (1) various data involving indoor and outdoor crowded scenes, as well as comprehensive 2D and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic segmentation and temporal tracking annotations, with additional 3D label projections for further spatial understanding; (3) diverse object classes for closed- and open-world recognition benchmarks, with OSPA-based metrics for evaluation. Extensive evaluation of leading methods shows significant challenges posed by our dataset.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Energy-based Model for Accurate Shapley Value Estimation in Interpretable Deep Learning Predictive Modeling
Authors:
Cheng Lu,
Jiusun Zeng,
Yu Xia,
Jinhui Cai,
Shihua Luo
Abstract:
As a favorable tool for explainable artificial intelligence (XAI), Shapley value has been widely used to interpret deep learning based predictive models. However, accurate and efficient estimation of Shapley value is difficult since the computation load grows exponentially with the increase of input features. Most existing accelerated estimation methods have to compromise on estimation accuracy wi…
▽ More
As a favorable tool for explainable artificial intelligence (XAI), Shapley value has been widely used to interpret deep learning based predictive models. However, accurate and efficient estimation of Shapley value is difficult since the computation load grows exponentially with the increase of input features. Most existing accelerated estimation methods have to compromise on estimation accuracy with efficiency. In this article, we present EmSHAP(Energy-based model for Shapley value estimation) to estimate the expectation of Shapley contribution function under arbitrary subset of features given the rest. The energy-based model estimates the conditional density in the Shapley contribution function, which involves an energy network for approximating the unnormalized conditional density and a GRU (Gated Recurrent Unit) network for approximating the partition function. The GRU network maps the input features onto a hidden space to eliminate the impact of input orderings. In order to theoretically evaluate the performance of different Shapley value estimation methods, Theorems 1, 2 and 3 analyzed the error bounds of EmSHAP as well as two state-of-the-art methods, namely KernelSHAP and VAEAC. It is proved that EmSHAP has tighter error bound than KernelSHAP and VAEAC. Finally, case studies on two application examples show the enhanced estimation accuracy of EmSHAP.
△ Less
Submitted 5 May, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
Continuously tunable uniaxial strain control of van der Waals heterostructure devices
Authors:
Zhaoyu Liu,
Xuetao Ma,
John Cenker,
Jiaqi Cai,
Zaiyao Fei,
Paul Malinowski,
Joshua Mutch,
Yuzhou Zhao,
Kyle Hwangbo,
Zhong Lin,
Arnab Manna,
Jihui Yang,
David Cobden,
Xiaodong Xu,
Matthew Yankowitz,
Jiun-Haw Chu
Abstract:
Uniaxial strain has been widely used as a powerful tool for investigating and controlling the properties of quantum materials. However, existing strain techniques have so far mostly been limited to use with bulk crystals. Although recent progress has been made in extending the application of strain to two-dimensional van der Waals (vdW) heterostructures, these techniques have been limited to optic…
▽ More
Uniaxial strain has been widely used as a powerful tool for investigating and controlling the properties of quantum materials. However, existing strain techniques have so far mostly been limited to use with bulk crystals. Although recent progress has been made in extending the application of strain to two-dimensional van der Waals (vdW) heterostructures, these techniques have been limited to optical characterization and extremely simple electrical device geometries. Here, we report a piezoelectric-based \textit{in situ} uniaxial strain technique enabling simultaneous electrical transport and optical spectroscopy characterization of dual-gated vdW heterostructure devices. Critically, our technique remains compatible with vdW heterostructure devices of arbitrary complexity fabricated on conventional silicon/silicon dioxide wafer substrates. We demonstrate a large and continuously tunable strain of up to $-0.15\%$ at millikelvin temperatures, with larger strain values also likely achievable. We quantify the strain transmission from the silicon wafer to the vdW heterostructure, and further demonstrate the ability of strain to modify the electronic properties of twisted bilayer graphene. Our technique provides a highly versatile new method for exploring the effect of uniaxial strain on both the electrical and optical properties of vdW heterostructures, and can be easily extended to include additional characterization techniques.
△ Less
Submitted 23 May, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images
Authors:
Yushuang Wu,
Luyue Shi,
Junhao Cai,
Weihao Yuan,
Lingteng Qiu,
Zilong Dong,
Liefeng Bo,
Shuguang Cui,
Xiaoguang Han
Abstract:
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes im…
▽ More
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising, allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides, an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning, leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Heterogeneous Network Based Contrastive Learning Method for PolSAR Land Cover Classification
Authors:
Jianfeng Cai,
Yue Ma,
Zhixi Feng,
Shuyuan Yang
Abstract:
Polarimetric synthetic aperture radar (PolSAR) image interpretation is widely used in various fields. Recently, deep learning has made significant progress in PolSAR image classification. Supervised learning (SL) requires a large amount of labeled PolSAR data with high quality to achieve better performance, however, manually labeled data is insufficient. This causes the SL to fail into overfitting…
▽ More
Polarimetric synthetic aperture radar (PolSAR) image interpretation is widely used in various fields. Recently, deep learning has made significant progress in PolSAR image classification. Supervised learning (SL) requires a large amount of labeled PolSAR data with high quality to achieve better performance, however, manually labeled data is insufficient. This causes the SL to fail into overfitting and degrades its generalization performance. Furthermore, the scattering confusion problem is also a significant challenge that attracts more attention. To solve these problems, this article proposes a Heterogeneous Network based Contrastive Learning method(HCLNet). It aims to learn high-level representation from unlabeled PolSAR data for few-shot classification according to multi-features and superpixels. Beyond the conventional CL, HCLNet introduces the heterogeneous architecture for the first time to utilize heterogeneous PolSAR features better. And it develops two easy-to-use plugins to narrow the domain gap between optics and PolSAR, including feature filter and superpixel-based instance discrimination, which the former is used to enhance the complementarity of multi-features, and the latter is used to increase the diversity of negative samples. Experiments demonstrate the superiority of HCLNet on three widely used PolSAR benchmark datasets compared with state-of-the-art methods. Ablation studies also verify the importance of each component. Besides, this work has implications for how to efficiently utilize the multi-features of PolSAR data to learn better high-level representation in CL and how to construct networks suitable for PolSAR data better.
△ Less
Submitted 3 May, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.