subscribe to arXiv mailings

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Authors: Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, Kai Zou

Abstract: Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The data… ▽ More Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.16942 [pdf, other]

Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Authors: Yuanyuan Peng, Aidi Lin, Meng Wang, Tian Lin, Ke Zou, Yinglin Cheng, Tingkun Shi, Xulong Liao, Lixia Feng, Zhen Liang, Xinjian Chen, Huazhu Fu, Haoyu Chen

Abstract: Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RE… ▽ More Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RETFound and UIOS, and got further improvement with thresholding strategy to 98.44%. In the external test sets obtained from other OCT devices, FMUE achieved an accuracy of 88.75% and 92.73% before and after thresholding. Our model is superior to two ophthalmologists with a higher F1 score (95.17% vs. 61.93% &71.72%). Besides, our model correctly predicts high uncertainty scores for samples with ambiguous features, of non-target-category diseases, or with low-quality to prompt manual checks and prevent misdiagnosis. FMUE provides a trustworthy method for automatic retinal anomalies detection in the real-world clinical open set environment. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: All codes are available at https://github.com/yuanyuanpeng0129/FMUE

arXiv:2406.12479 [pdf, other]

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

Authors: Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, Haifeng Li

Abstract: The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to… ▽ More The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 14 pages, 6 figures, 4 tables

arXiv:2406.09317 [pdf, other]

Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered. △ Less

Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08835 [pdf, other]

A Single-Step Non-Autoregressive Automatic Speech Recognition Architecture with High Accuracy and Inference Speed

Authors: Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao

Abstract: Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. To further narrow the gap between the NAR and AR models, we propose a single-step NAR ASR architecture with high accuracy and inference speed, ca… ▽ More Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. To further narrow the gap between the NAR and AR models, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EfficientASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EfficientASR achieves competitive results on the AISHELL-1 and AISHELL-2 benchmarks compared to the state-of-the-art (SOTA) models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the SOTA AR Conformer with about 30x inference speedup. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2405.18167 [pdf, other]

Confidence-aware multi-modality learning for eye disease screening

Authors: Ke Zou, Tian Lin, Zongbo Han, Meng Wang, Xuedong Yuan, Haoyu Chen, Changqing Zhang, Xiaojing Shen, Huazhu Fu

Abstract: Multi-modal ophthalmic image classification plays a key role in diagnosing eye diseases, as it integrates information from different sources to complement their respective performances. However, recent improvements have mainly focused on accuracy, often neglecting the importance of confidence and robustness in predictions for diverse modalities. In this study, we propose a novel multi-modality evi… ▽ More Multi-modal ophthalmic image classification plays a key role in diagnosing eye diseases, as it integrates information from different sources to complement their respective performances. However, recent improvements have mainly focused on accuracy, often neglecting the importance of confidence and robustness in predictions for diverse modalities. In this study, we propose a novel multi-modality evidential fusion pipeline for eye disease screening. It provides a measure of confidence for each modality and elegantly integrates the multi-modality information using a multi-distribution fusion perspective. Specifically, our method first utilizes normal inverse gamma prior distributions over pre-trained models to learn both aleatoric and epistemic uncertainty for uni-modality. Then, the normal inverse gamma distribution is analyzed as the Student's t distribution. Furthermore, within a confidence-aware fusion framework, we propose a mixture of Student's t distributions to effectively integrate different modalities, imparting the model with heavy-tailed properties and enhancing its robustness and reliability. More importantly, the confidence-aware multi-modality ranking regularization term induces the model to more reasonably rank the noisy single-modal and fused-modal confidence, leading to improved reliability and accuracy. Experimental results on both public and internal datasets demonstrate that our model excels in robustness, particularly in challenging scenarios involving Gaussian noise and modality missing conditions. Moreover, our model exhibits strong generalization capabilities to out-of-distribution data, underscoring its potential as a promising solution for multimodal eye disease screening. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 27 pages, 7 figures, 9 tables

arXiv:2405.16910 [pdf, other]

Temperature evolution of the Fermi surface of the FeSe monolayer on STO

Authors: Khalil Zakeri, Ryan Roemer, Ke Zou

Abstract: The origin of superconductivity in FeSe monolayer on SrTiO$_3$ belongs to one of the unresolved mysteries in condensed-matter physics. Here by investigation of the temperature evolution of the dynamic charge response of FeSe/SrTiO$_3$ we demonstrate that the response of the monolayer itself is nearly temperature independent. This indicates a constant Fermi surface over a wide range of temperature,… ▽ More The origin of superconductivity in FeSe monolayer on SrTiO$_3$ belongs to one of the unresolved mysteries in condensed-matter physics. Here by investigation of the temperature evolution of the dynamic charge response of FeSe/SrTiO$_3$ we demonstrate that the response of the monolayer itself is nearly temperature independent. This indicates a constant Fermi surface over a wide range of temperature, in stark contrast to that of the bulk FeSe and other Fe-based superconductors. Our results, which manifest the peculiarity of the electronic structure of the FeSe monolayer, may help for a microscopic understanding of the superconductivity in Fe-chalcogenide monolayers on oxide surfaces in general. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 7 Pages, 3 Figures

arXiv:2405.16573 [pdf, other]

FRCNet Frequency and Region Consistency for Semi-supervised Medical Image Segmentation

Authors: Along He, Tao Li, Yanlin Wu, Ke Zou, Huazhu Fu

Abstract: Limited labeled data hinder the application of deep learning in medical domain. In clinical practice, there are sufficient unlabeled data that are not effectively used, and semi-supervised learning (SSL) is a promising way for leveraging these unlabeled data. However, existing SSL methods ignore frequency domain and region-level information and it is important for lesion regions located at low fre… ▽ More Limited labeled data hinder the application of deep learning in medical domain. In clinical practice, there are sufficient unlabeled data that are not effectively used, and semi-supervised learning (SSL) is a promising way for leveraging these unlabeled data. However, existing SSL methods ignore frequency domain and region-level information and it is important for lesion regions located at low frequencies and with significant scale changes. In this paper, we introduce two consistency regularization strategies for semi-supervised medical image segmentation, including frequency domain consistency (FDC) to assist the feature learning in frequency domain and multi-granularity region similarity consistency (MRSC) to perform multi-scale region-level local context information feature learning. With the help of the proposed FDC and MRSC, we can leverage the powerful feature representation capability of them in an effective and efficient way. We perform comprehensive experiments on two datasets, and the results show that our method achieves large performance gains and exceeds other state-of-the-art methods. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: MICCAI 2024 Early Accept

arXiv:2405.16102 [pdf, other]

Reliable Source Approximation: Source-Free Unsupervised Domain Adaptation for Vestibular Schwannoma MRI Segmentation

Authors: Hongye Zeng, Ke Zou, Zhihao Chen, Rui Zheng, Huazhu Fu

Abstract: Source-Free Unsupervised Domain Adaptation (SFUDA) has recently become a focus in the medical image domain adaptation, as it only utilizes the source model and does not require annotated target data. However, current SFUDA approaches cannot tackle the complex segmentation task across different MRI sequences, such as the vestibular schwannoma segmentation. To address this problem, we proposed Relia… ▽ More Source-Free Unsupervised Domain Adaptation (SFUDA) has recently become a focus in the medical image domain adaptation, as it only utilizes the source model and does not require annotated target data. However, current SFUDA approaches cannot tackle the complex segmentation task across different MRI sequences, such as the vestibular schwannoma segmentation. To address this problem, we proposed Reliable Source Approximation (RSA), which can generate source-like and structure-preserved images from the target domain for updating model parameters and adapting domain shifts. Specifically, RSA deploys a conditional diffusion model to generate multiple source-like images under the guidance of varying edges of one target image. An uncertainty estimation module is then introduced to predict and refine reliable pseudo labels of generated images, and the prediction consistency is developed to select the most reliable generations. Subsequently, all reliable generated images and their pseudo labels are utilized to update the model. Our RSA is validated on vestibular schwannoma segmentation across multi-modality MRI. The experimental results demonstrate that RSA consistently improves domain adaptation performance over other state-of-the-art SFUDA methods. Code is available at https://github.com/zenghy96/Reliable-Source-Approximation. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: Early accepted by MICCAI 2024

arXiv:2405.04294 [pdf, other]

Enhancing the Efficiency and Accuracy of Underlying Asset Reviews in Structured Finance: The Application of Multi-agent Framework

Authors: Xiangpeng Wan, Haicheng Deng, Kai Zou, Shiqi Xu

Abstract: Structured finance, which involves restructuring diverse assets into securities like MBS, ABS, and CDOs, enhances capital market efficiency but presents significant due diligence challenges. This study explores the integration of artificial intelligence (AI) with traditional asset review processes to improve efficiency and accuracy in structured finance. Using both open-sourced and close-sourced l… ▽ More Structured finance, which involves restructuring diverse assets into securities like MBS, ABS, and CDOs, enhances capital market efficiency but presents significant due diligence challenges. This study explores the integration of artificial intelligence (AI) with traditional asset review processes to improve efficiency and accuracy in structured finance. Using both open-sourced and close-sourced large language models (LLMs), we demonstrate that AI can automate the verification of information between loan applications and bank statements effectively. While close-sourced models such as GPT-4 show superior performance, open-sourced models like LLAMA3 offer a cost-effective alternative. Dual-agent systems further increase accuracy, though this comes with higher operational costs. This research highlights AI's potential to minimize manual errors and streamline due diligence, suggesting a broader application of AI in financial document analysis and risk management. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2404.06798 [pdf, other]

MedRG: Medical Report Grounding with Multi-modal Large Language Model

Authors: Ke Zou, Yang Bai, Zhihao Chen, Yang Zhou, Yidi Chen, Kai Ren, Meng Wang, Xuedong Yuan, Xiaojing Shen, Huazhu Fu

Abstract: Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis. However, prevailing visual grounding approaches necessitate the manual extraction of key phrases from medical reports, imposing substantial burdens on both system efficiency and physicians. In this pape… ▽ More Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis. However, prevailing visual grounding approaches necessitate the manual extraction of key phrases from medical reports, imposing substantial burdens on both system efficiency and physicians. In this paper, we introduce a novel framework, Medical Report Grounding (MedRG), an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase by incorporating a unique token, BOX, into the vocabulary to serve as an embedding for unlocking detection capabilities. Subsequently, the vision encoder-decoder jointly decodes the hidden embedding and the input medical image, generating the corresponding grounding box. The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods. This study represents a pioneering exploration of the medical report grounding task, marking the first-ever endeavor in this domain. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 12 pages, 4 figures

arXiv:2403.18388 [pdf, other]

FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion

Authors: Xiaofeng Wu, Velibor Bojkovic, Bin Gu, Kun Suo, Kai Zou

Abstract: Spiking Neural Networks (SNNs) offer a promising avenue for energy-efficient computing compared with Artificial Neural Networks (ANNs), closely mirroring biological neural processes. However, this potential comes with inherent challenges in directly training SNNs through spatio-temporal backpropagation -- stemming from the temporal dynamics of spiking neurons and their discrete signal processing -… ▽ More Spiking Neural Networks (SNNs) offer a promising avenue for energy-efficient computing compared with Artificial Neural Networks (ANNs), closely mirroring biological neural processes. However, this potential comes with inherent challenges in directly training SNNs through spatio-temporal backpropagation -- stemming from the temporal dynamics of spiking neurons and their discrete signal processing -- which necessitates alternative ways of training, most notably through ANN-SNN conversion. In this work, we introduce a lightweight Forward Temporal Bias Correction (FTBC) technique, aimed at enhancing conversion accuracy without the computational overhead. We ground our method on provided theoretical findings that through proper temporal bias calibration the expected error of ANN-SNN conversion can be reduced to be zero after each time step. We further propose a heuristic algorithm for finding the temporal bias only in the forward pass, thus eliminating the computational burden of backpropagation and we evaluate our method on CIFAR-10/100 and ImageNet datasets, achieving a notable increase in accuracy on all datasets. Codes are released at a GitHub repository. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2402.11211 [pdf, other]

Training-free image style alignment for self-adapting domain shift on handheld ultrasound devices

Authors: Hongye Zeng, Ke Zou, Zhihao Chen, Yuchong Gao, Hongbo Chen, Haibin Zhang, Kang Zhou, Meng Wang, Rick Siow Mong Goh, Yong Liu, Chang Jiang, Rui Zheng, Huazhu Fu

Abstract: Handheld ultrasound devices face usage limitations due to user inexperience and cannot benefit from supervised deep learning without extensive expert annotations. Moreover, the models trained on standard ultrasound device data are constrained by training data distribution and perform poorly when directly applied to handheld device data. In this study, we propose the Training-free Image Style Align… ▽ More Handheld ultrasound devices face usage limitations due to user inexperience and cannot benefit from supervised deep learning without extensive expert annotations. Moreover, the models trained on standard ultrasound device data are constrained by training data distribution and perform poorly when directly applied to handheld device data. In this study, we propose the Training-free Image Style Alignment (TISA) framework to align the style of handheld device data to those of standard devices. The proposed TISA can directly infer handheld device images without extra training and is suited for clinical applications. We show that TISA performs better and more stably in medical detection and segmentation tasks for handheld device data. We further validate TISA as the clinical model for automatic measurements of spinal curvature and carotid intima-media thickness. The automatic measurements agree well with manual measurements made by human experts and the measurement errors remain within clinically acceptable ranges. We demonstrate the potential for TISA to facilitate automatic diagnosis on handheld ultrasound devices and expedite their eventual widespread use. △ Less

Submitted 17 February, 2024; originally announced February 2024.

arXiv:2401.07502 [pdf, other]

Compositional Oil Spill Detection Based on Object Detector and Adapted Segment Anything Model from SAR Images

Authors: Wenhui Wu, Man Sing Wong, Xinyu Yu, Guoqiang Shi, Coco Yin Tung Kwok, Kang Zou

Abstract: Semantic segmentation-based methods have attracted extensive attention in oil spill detection from SAR images. However, the existing approaches require a large number of finely annotated segmentation samples in the training stage. To alleviate this issue, we propose a composite oil spill detection framework, SAM-OIL, comprising an object detector (e.g., YOLOv8), an adapted Segment Anything Model (… ▽ More Semantic segmentation-based methods have attracted extensive attention in oil spill detection from SAR images. However, the existing approaches require a large number of finely annotated segmentation samples in the training stage. To alleviate this issue, we propose a composite oil spill detection framework, SAM-OIL, comprising an object detector (e.g., YOLOv8), an adapted Segment Anything Model (SAM), and an Ordered Mask Fusion (OMF) module. SAM-OIL is the first application of the powerful SAM in oil spill detection. Specifically, the SAM-OIL strategy uses YOLOv8 to obtain the categories and bounding boxes of oil spill-related objects, then inputs bounding boxes into the adapted SAM to retrieve category-agnostic masks, and finally adopts the Ordered Mask Fusion (OMF) module to fuse the masks and categories. The adapted SAM, combining a frozen SAM with a learnable Adapter module, can enhance SAM's ability to segment ambiguous objects. The OMF module, a parameter-free method, can effectively resolve pixel category conflicts within SAM. Experimental results demonstrate that SAM-OIL surpasses existing semantic segmentation-based oil spill detection methods, achieving mIoU of 69.52%. The results also indicated that both OMF and Adapter modules can effectively improve the accuracy in SAM-OIL. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures

arXiv:2312.03042 [pdf, other]

Inherent limitations of LLMs regarding spatial information

Authors: He Yan, Xinyao Hu, Xiangpeng Wan, Chengyu Huang, Kai Zou, Shiqi Xu

Abstract: Despite the significant advancements in natural language processing capabilities demonstrated by large language models such as ChatGPT, their proficiency in comprehending and processing spatial information, especially within the domains of 2D and 3D route planning, remains notably underdeveloped. This paper investigates the inherent limitations of ChatGPT and similar models in spatial reasoning an… ▽ More Despite the significant advancements in natural language processing capabilities demonstrated by large language models such as ChatGPT, their proficiency in comprehending and processing spatial information, especially within the domains of 2D and 3D route planning, remains notably underdeveloped. This paper investigates the inherent limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, an area critical for applications ranging from autonomous vehicle guidance to assistive technologies for the visually impaired. In this paper, we introduce a novel evaluation framework complemented by a baseline dataset, meticulously crafted for this study. This dataset is structured around three key tasks: plotting spatial points, planning routes in two-dimensional (2D) spaces, and devising pathways in three-dimensional (3D) environments. We specifically developed this dataset to assess the spatial reasoning abilities of ChatGPT. Our evaluation reveals key insights into the model's capabilities and limitations in spatial understanding. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2310.18827 [pdf, other]

All Things Considered: Detecting Partisan Events from News Media with Cross-Article Comparison

Authors: Yujian Liu, Xinliang Frederick Zhang, Kaijian Zou, Ruihong Huang, Nick Beauchamp, Lu Wang

Abstract: Public opinion is shaped by the information news media provide, and that information in turn may be shaped by the ideological preferences of media outlets. But while much attention has been devoted to media bias via overt ideological language or topic selection, a more unobtrusive way in which the media shape opinion is via the strategic inclusion or omission of partisan events that may support on… ▽ More Public opinion is shaped by the information news media provide, and that information in turn may be shaped by the ideological preferences of media outlets. But while much attention has been devoted to media bias via overt ideological language or topic selection, a more unobtrusive way in which the media shape opinion is via the strategic inclusion or omission of partisan events that may support one side or the other. We develop a latent variable-based framework to predict the ideology of news articles by comparing multiple articles on the same story and identifying partisan events whose inclusion or omission reveals ideology. Our experiments first validate the existence of partisan event selection, and then show that article alignment and cross-document comparison detect partisan events and article ideology better than competitive baselines. Our results reveal the high-level form of media bias, which is present even among mainstream media with strong norms of objectivity and nonpartisanship. Our codebase and dataset are available at https://github.com/launchnlp/ATC. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: EMNLP'23 Main Conference

arXiv:2310.18768 [pdf, other]

Crossing the Aisle: Unveiling Partisan and Counter-Partisan Events in News Reporting

Authors: Kaijian Zou, Xinliang Frederick Zhang, Winston Wu, Nick Beauchamp, Lu Wang

Abstract: News media is expected to uphold unbiased reporting. Yet they may still affect public opinion by selectively including or omitting events that support or contradict their ideological positions. Prior work in NLP has only studied media bias via linguistic style and word usage. In this paper, we study to which degree media balances news reporting and affects consumers through event inclusion or omis… ▽ More News media is expected to uphold unbiased reporting. Yet they may still affect public opinion by selectively including or omitting events that support or contradict their ideological positions. Prior work in NLP has only studied media bias via linguistic style and word usage. In this paper, we study to which degree media balances news reporting and affects consumers through event inclusion or omission. We first introduce the task of detecting both partisan and counter-partisan events: events that support or oppose the author's political ideology. To conduct our study, we annotate a high-quality dataset, PAC, containing 8,511 (counter-)partisan event annotations in 304 news articles from ideologically diverse media outlets. We benchmark PAC to highlight the challenges of this task. Our findings highlight both the ways in which the news subtly shapes opinion and the need for large language models that better understand events within a broader context. Our dataset can be found at https://github.com/launchnlp/Partisan-Event-Dataset. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: EMNLP'23 Findings

arXiv:2310.13800 [pdf, other]

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Authors: Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan

Abstract: Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three… ▽ More Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs in a way which aligns reasonably closely to human judgement despite task-specific variations, with a lower alignment in the GEC task. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: Accepted at EMNLP 2023

arXiv:2310.12111 [pdf, other]

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification

Authors: Yuanyuan Wang, Yang Zhang, Zhiyong Wu, Zhihan Yang, Tao Wei, Kun Zou, Helen Meng

Abstract: Data augmentation is vital to the generalization ability and robustness of deep neural networks (DNNs) models. Existing augmentation methods for speaker verification manipulate the raw signal, which are time-consuming and the augmented samples lack diversity. In this paper, we present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification, which can generate divers… ▽ More Data augmentation is vital to the generalization ability and robustness of deep neural networks (DNNs) models. Existing augmentation methods for speaker verification manipulate the raw signal, which are time-consuming and the augmented samples lack diversity. In this paper, we present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification, which can generate diversified training samples in speaker embedding space with negligible extra computing cost. Firstly, we augment training samples by perturbing speaker embeddings along semantic directions, which are obtained from speaker-wise covariance matrices. Secondly, accurate covariance matrices are estimated from robust speaker embeddings during training, so we introduce difficultyaware additive margin softmax (DAAM-Softmax) to obtain optimal speaker embeddings. Finally, we assume the number of augmented samples goes to infinity and derive a closed-form upper bound of the expected loss with DASA, which achieves compatibility and efficiency. Extensive experiments demonstrate the proposed approach can achieve a remarkable performance improvement. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Accepted by ICASSP 2023

arXiv:2310.03170 [pdf]

Critical Role of Disorder for Superconductivity in the Series of Epitaxial Ti(O,N) Films

Authors: Fengmiao Li, Oliver Dicks, Myung-Geun Han, Solveig Aamlid, Giorgio Levy, Ronny Sutarto, Chong Liu, Hsiang-Hsi Kung, Oleksandr Foyevstov, Simon Godin, Bruce A. Davidson, Andrea Damascelli, Yimei Zhu, Christoph Heil, Ilya Elfimov, George A. Sawatzky, Ke Zou

Abstract: Experimental manipulation of superconductivity is of paramount importance, not only for practical applications but also for identifying the key factors involved in electron pairing. In this work, we have undertaken a meticulous study of the superconductivity in a series of titanium compounds with a rocksalt structure, synthesized as epitaxial films. We find that substituting nitrogen (N) for oxyge… ▽ More Experimental manipulation of superconductivity is of paramount importance, not only for practical applications but also for identifying the key factors involved in electron pairing. In this work, we have undertaken a meticulous study of the superconductivity in a series of titanium compounds with a rocksalt structure, synthesized as epitaxial films. We find that substituting nitrogen (N) for oxygen (O) in titanium monoxide (TiO) with the stoichiometry close to TiO$_{0.6}$N$_{0.4}$ leads to superconductivity with a transition temperature (T$_c$) of ~2.6 K, about five times higher than that of TiO at ~0.5 K and half as high as the T$_c$ of ~6 K in titanium nitride (TiN). However, Eliashberg theoretical calculations predict similar Tc in TiO, Ti oxynitride and TiN. The analysis of electron mean free path suggests the presence of significant disorder in TiO and a remarkable reduction in the impact of disorder in oxynitrides. Density functional theory (DFT) calculations reveal that disorder decreases the coherence of electronic states for non-zero momenta, which would degrade the influence of electron-phonon. Our findings demonstrate the disorder and superconductivity depend strongly on the N/O ratio, highlighting the critical role of disorder for superconductivity in this series of Ti(O,N) materials. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2308.11213 [pdf]

Temporally and Longitudinally Tailored Dynamic Space-Time Wave Packets

Authors: Xinzhou Su, Kaiheng Zou, Huibin Zhou, Hao Song, Yingning Wang, Ruoyu Zeng, Zile Jiang, Yuxiang Duan, Maxim Karpov, Tobias J. Kippenberg, Moshe Tur, Demetrios N. Christodoulides, Alan E. Willner

Abstract: In general, space-time wave packets with correlations between transverse spatial fields and temporal frequency spectra can lead to unique spatiotemporal dynamics, thus enabling control of the instantaneous light properties. However, spatiotemporal dynamics generated in previous approaches manifest themselves at a given propagation distance yet not arbitrarily tailored longitudinally. Here, we prop… ▽ More In general, space-time wave packets with correlations between transverse spatial fields and temporal frequency spectra can lead to unique spatiotemporal dynamics, thus enabling control of the instantaneous light properties. However, spatiotemporal dynamics generated in previous approaches manifest themselves at a given propagation distance yet not arbitrarily tailored longitudinally. Here, we propose and demonstrate a new versatile class of judiciously synthesized wave packets whose spatiotemporal evolution can be arbitrarily engineered to take place at various predesigned distances along the longitudinal propagation path. Spatiotemporal synthesis is achieved by introducing a 2-dimensional spectrum comprising both temporal and longitudinal wavenumbers associated with specific transverse Bessel-Gaussian fields. The resulting spectra are then employed to produce wave packets evolving in both time and axial distance - in full accord with the theoretical analysis. In this respect, various light degrees of freedom can be independently manipulated, such as intensity, polarization, and transverse spatial distribution (e.g., orbital angular momentum). Through a temporal-longitudinal frequency comb spectrum, we simulate the synthesis of the aforementioned wave packet properties, indicating a decrease in relative error compared to the desired phenomena as more spectral components are incorporated. Additionally, we experimentally demonstrate tailorable spatiotemporal fields carrying time- and longitudinal-varying orbital angular momentum, such that the local topological charge evolves every ~1 ps in the time domain and 10 cm axially. We believe that our space-time wave packets can significantly expand the exploration of spatiotemporal dynamics in the longitudinal dimension, and potentially enable novel applications in ultrafast microscopy, light-matter interactions, and nonlinear optics. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2307.04981 [pdf, other]

A Multi-view Impartial Decision Network for Frontotemporal Dementia Diagnosis

Authors: Guoyao Deng, Ke Zou, Meng Wang, Xuedong Yuan, Sancong Ying, Huazhu Fu

Abstract: Frontotemporal Dementia (FTD) diagnosis has been successfully progress using deep learning techniques. However, current FTD identification methods suffer from two limitations. Firstly, they do not exploit the potential of multi-view functional magnetic resonance imaging (fMRI) for classifying FTD. Secondly, they do not consider the reliability of the multi-view FTD diagnosis. To address these limi… ▽ More Frontotemporal Dementia (FTD) diagnosis has been successfully progress using deep learning techniques. However, current FTD identification methods suffer from two limitations. Firstly, they do not exploit the potential of multi-view functional magnetic resonance imaging (fMRI) for classifying FTD. Secondly, they do not consider the reliability of the multi-view FTD diagnosis. To address these limitations, we propose a reliable multi-view impartial decision network (MID-Net) for FTD diagnosis in fMRI. Our MID-Net provides confidence for each view and generates a reliable prediction without any conflict. To achieve this, we employ multiple expert models to extract evidence from the abundant neural network information contained in fMRI images. We then introduce the Dirichlet Distribution to characterize the expert class probability distribution from an evidence level. Additionally, a novel Impartial Decision Maker (IDer) is proposed to combine the different opinions inductively to arrive at an unbiased prediction without additional computation cost. Overall, our MID-Net dynamically integrates the decisions of different experts on FTD disease, especially when dealing with multi-view high-conflict cases. Extensive experiments on a high-quality FTD fMRI dataset demonstrate that our model outperforms previous methods and provides high uncertainty for hard-to-classify examples. We believe that our approach represents a significant step toward the deployment of reliable FTD decision-making under multi-expert conditions. We will release the codes for reproduction after acceptance. △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2307.04973 [pdf, other]

SAM-U: Multi-box prompts triggered uncertainty estimation for reliable SAM in medical image

Authors: Guoyao Deng, Ke Zou, Kai Ren, Meng Wang, Xuedong Yuan, Sancong Ying, Huazhu Fu

Abstract: Recently, Segmenting Anything has taken an important step towards general artificial intelligence. At the same time, its reliability and fairness have also attracted great attention, especially in the field of health care. In this study, we propose multi-box prompts triggered uncertainty estimation for SAM cues to demonstrate the reliability of segmented lesions or tissues. We estimate the distrib… ▽ More Recently, Segmenting Anything has taken an important step towards general artificial intelligence. At the same time, its reliability and fairness have also attracted great attention, especially in the field of health care. In this study, we propose multi-box prompts triggered uncertainty estimation for SAM cues to demonstrate the reliability of segmented lesions or tissues. We estimate the distribution of SAM predictions via Monte Carlo with prior distribution parameters, which employs different prompts as formulation of test-time augmentation. Our experimental results found that multi-box prompts augmentation improve the SAM performance, and endowed each pixel with uncertainty. This provides the first paradigm for a reliable SAM. △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2304.03981 [pdf, other]

Uncertainty-inspired Open Set Learning for Retinal Anomaly Identification

Authors: Meng Wang, Tian Lin, Lianyu Wang, Aidi Lin, Ke Zou, Xinxing Xu, Yi Zhou, Yuanyuan Peng, Qingquan Meng, Yiming Qian, Guoyao Deng, Zhiqun Wu, Junhong Chen, Jianhong Lin, Mingzhi Zhang, Weifang Zhu, Changqing Zhang, Daoqiang Zhang, Rick Siow Mong Goh, Yong Liu, Chi Pui Pang, Xinjian Chen, Haoyu Chen, Huazhu Fu

Abstract: Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calcul… ▽ More Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calculated an uncertainty score to express its confidence. Our UIOS model with thresholding strategy achieved an F1 score of 99.55%, 97.01% and 91.91% for the internal testing set, external target categories (TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1 score of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS correctly predicted high uncertainty scores, which would prompt the need for a manual check in the datasets of non-target categories retinal diseases, low-quality fundus images, and non-fundus images. UIOS provides a robust method for real-world screening of retinal anomalies. △ Less

Submitted 29 August, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

arXiv:2303.16611 [pdf, other]

doi 10.1145/3653455

4D Facial Expression Diffusion Model

Authors: Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo

Abstract: Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on diff… ▽ More Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at \url{https://github.com/ZOUKaifeng/4DFM}. △ Less

Submitted 15 April, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

arXiv:2303.13033 [pdf, other]

doi 10.1007/978-3-031-43895-0_21

Federated Uncertainty-Aware Aggregation for Fundus Diabetic Retinopathy Staging

Authors: Meng Wang, Lianyu Wang, Xinxing Xu, Ke Zou, Yiming Qian, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), whic… ▽ More Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), which considers the reliability of each client and produces a confidence estimation for the DR staging. In our FedUAA, an aggregated encoder is shared by all clients for learning a global representation of fundus images, while a novel temperature-warmed uncertainty head (TWEU) is utilized for each client for local personalized staging criteria. Our TWEU employs an evidential deep layer to produce the uncertainty score with the DR staging results for client reliability evaluation. Furthermore, we developed a novel uncertainty-aware weighting module (UAW) to dynamically adjust the weights of model aggregation based on the uncertainty score distribution of each client. In our experiments, we collect five publicly available datasets from different institutions to conduct a dataset for federated DR staging to satisfy the real non-iid condition. The experimental results demonstrate that our FedUAA achieves better DR staging performance with higher reliability compared to other federated learning methods. Our proposed FedUAA paradigm effectively addresses the challenges of collaboratively training DR staging models across multiple institutions, and provides a robust and reliable solution for the deployment of DR diagnosis models in real-world clinical scenarios. △ Less

Submitted 22 July, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

Report number: 978-3-031-43894-3

Journal ref: Medical Image Computing and Computer Assisted Intervention(MICCAI 2023)

arXiv:2303.10049 [pdf, other]

Uncertainty-informed Mutual Learning for Joint Medical Image Classification and Segmentation

Authors: Kai Ren, Ke Zou, Xianjie Liu, Yidi Chen, Xuedong Yuan, Xiaojing Shen, Meng Wang, Huazhu Fu

Abstract: Classification and segmentation are crucial in medical image analysis as they enable accurate diagnosis and disease monitoring. However, current methods often prioritize the mutual learning features and shared model parameters, while neglecting the reliability of features and performances. In this paper, we propose a novel Uncertainty-informed Mutual Learning (UML) framework for reliable and inter… ▽ More Classification and segmentation are crucial in medical image analysis as they enable accurate diagnosis and disease monitoring. However, current methods often prioritize the mutual learning features and shared model parameters, while neglecting the reliability of features and performances. In this paper, we propose a novel Uncertainty-informed Mutual Learning (UML) framework for reliable and interpretable medical image analysis. Our UML introduces reliability to joint classification and segmentation tasks, leveraging mutual learning with uncertainty to improve performance. To achieve this, we first use evidential deep learning to provide image-level and pixel-wise confidences. Then, an Uncertainty Navigator Decoder is constructed for better using mutual features and generating segmentation results. Besides, an Uncertainty Instructor is proposed to screen reliable masks for classification. Overall, UML could produce confidence estimation in features and performance for each link (classification and segmentation). The experiments on the public datasets demonstrate that our UML outperforms existing methods in terms of both accuracy and robustness. Our UML has the potential to explore the development of more reliable and explainable medical image analysis models. We will release the codes for reproduction after acceptance. △ Less

Submitted 2 August, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: 13 pages

arXiv:2303.09790 [pdf, other]

Reliable Multimodality Eye Disease Screening via Mixture of Student's t Distributions

Authors: Ke Zou, Tian Lin, Xuedong Yuan, Haoyu Chen, Xiaojing Shen, Meng Wang, Huazhu Fu

Abstract: Multimodality eye disease screening is crucial in ophthalmology as it integrates information from diverse sources to complement their respective performances. However, the existing methods are weak in assessing the reliability of each unimodality, and directly fusing an unreliable modality may cause screening errors. To address this issue, we introduce a novel multimodality evidential fusion pipel… ▽ More Multimodality eye disease screening is crucial in ophthalmology as it integrates information from diverse sources to complement their respective performances. However, the existing methods are weak in assessing the reliability of each unimodality, and directly fusing an unreliable modality may cause screening errors. To address this issue, we introduce a novel multimodality evidential fusion pipeline for eye disease screening, EyeMoSt, which provides a measure of confidence for unimodality and elegantly integrates the multimodality information from a multi-distribution fusion perspective. Specifically, our model estimates both local uncertainty for unimodality and global uncertainty for the fusion modality to produce reliable classification results. More importantly, the proposed mixture of Student's $t$ distributions adaptively integrates different modalities to endow the model with heavy-tailed properties, increasing robustness and reliability. Our experimental findings on both public and in-house datasets show that our model is more reliable than current methods. Additionally, EyeMost has the potential ability to serve as a data quality discriminator, enabling reliable decision-making for multimodality eye disease screening. △ Less

Submitted 29 August, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: MICCAI 2023 (Early accept):11 pages, 4 figures

arXiv:2302.08119 [pdf, other]

A Review of Uncertainty Estimation and its Application in Medical Imaging

Authors: Ke Zou, Zhihao Chen, Xuedong Yuan, Xiaojing Shen, Meng Wang, Huazhu Fu

Abstract: The use of AI systems in healthcare for the early screening of diseases is of great clinical importance. Deep learning has shown great promise in medical imaging, but the reliability and trustworthiness of AI systems limit their deployment in real clinical scenes, where patient safety is at stake. Uncertainty estimation plays a pivotal role in producing a confidence evaluation along with the predi… ▽ More The use of AI systems in healthcare for the early screening of diseases is of great clinical importance. Deep learning has shown great promise in medical imaging, but the reliability and trustworthiness of AI systems limit their deployment in real clinical scenes, where patient safety is at stake. Uncertainty estimation plays a pivotal role in producing a confidence evaluation along with the prediction of the deep model. This is particularly important in medical imaging, where the uncertainty in the model's predictions can be used to identify areas of concern or to provide additional information to the clinician. In this paper, we review the various types of uncertainty in deep learning, including aleatoric uncertainty and epistemic uncertainty. We further discuss how they can be estimated in medical imaging. More importantly, we review recent advances in deep learning models that incorporate uncertainty estimation in medical imaging. Finally, we discuss the challenges and future directions in uncertainty estimation in deep learning for medical imaging. We hope this review will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of uncertainty estimation models in medical imaging. △ Less

Submitted 15 May, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: 11 pages, 3 figures, 3 tables

arXiv:2301.12798 [pdf, other]

Reliable Federated Disentangling Network for Non-IID Domain Feature

Authors: Meng Wang, Kai Yu, Chun-Mei Feng, Yiming Qian, Ke Zou, Lianyu Wang, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Federated learning (FL), as an effective decentralized distributed learning approach, enables multiple institutions to jointly train a model without sharing their local data. However, the domain feature shift caused by different acquisition devices/clients substantially degrades the performance of the FL model. Furthermore, most existing FL approaches aim to improve accuracy without considering re… ▽ More Federated learning (FL), as an effective decentralized distributed learning approach, enables multiple institutions to jointly train a model without sharing their local data. However, the domain feature shift caused by different acquisition devices/clients substantially degrades the performance of the FL model. Furthermore, most existing FL approaches aim to improve accuracy without considering reliability (e.g., confidence or uncertainty). The predictions are thus unreliable when deployed in safety-critical applications. Therefore, aiming at improving the performance of FL in non-Domain feature issues while enabling the model more reliable. In this paper, we propose a novel reliable federated disentangling network, termed RFedDis, which utilizes feature disentangling to enable the ability to capture the global domain-invariant cross-client representation and preserve local client-specific feature learning. Meanwhile, to effectively integrate the decoupled features, an uncertainty-aware decision fusion is also introduced to guide the network for dynamically integrating the decoupled features at the evidence level, while producing a reliable prediction with an estimated uncertainty. To the best of our knowledge, our proposed RFedDis is the first work to develop an FL approach based on evidential uncertainty combined with feature disentangling, which enhances the performance and reliability of FL in non-IID domain features. Extensive experimental results show that our proposed RFedDis provides outstanding performance with a high degree of reliability as compared to other state-of-the-art FL approaches. △ Less

Submitted 19 September, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

arXiv:2301.00349 [pdf, other]

Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty

Authors: Ke Zou, Yidi Chen, Ling Huang, Xuedong Yuan, Xiaojing Shen, Meng Wang, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Medical image segmentation is critical for disease diagnosis and treatment assessment. However, concerns regarding the reliability of segmentation regions persist among clinicians, mainly attributed to the absence of confidence assessment, robustness, and calibration to accuracy. To address this, we introduce DEviS, an easily implementable foundational model that seamlessly integrates into various… ▽ More Medical image segmentation is critical for disease diagnosis and treatment assessment. However, concerns regarding the reliability of segmentation regions persist among clinicians, mainly attributed to the absence of confidence assessment, robustness, and calibration to accuracy. To address this, we introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks. DEviS not only enhances the calibration and robustness of baseline segmentation accuracy but also provides high-efficiency uncertainty estimation for reliable predictions. By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation. Here, the Dirichlet distribution parameterizes the distribution of probabilities for different classes of the segmentation results. To generate calibrated predictions and uncertainty, we develop a trainable calibrated uncertainty penalty. Furthermore, DEviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data within the dataset. We conducted validation studies to assess both the accuracy and robustness of DEviS segmentation, along with evaluating the efficiency and reliability of uncertainty estimation. These evaluations were performed using publicly available datasets including ISIC2018, LiTS2017, and BraTS2019. Additionally, two potential clinical trials are being conducted at Johns Hopkins OCT, Duke-OCT-DME, and FIVES datasets to demonstrate their efficacy in filtering high-quality or out-of-distribution data. Our code has been released in https://github.com/Cocofeat/DEviS. △ Less

Submitted 13 April, 2024; v1 submitted 1 January, 2023; originally announced January 2023.

Comments: 34 pages, 11 figures

arXiv:2212.00330

Reliable Joint Segmentation of Retinal Edema Lesions in OCT Images

Authors: Meng Wang, Kai Yu, Chun-Mei Feng, Ke Zou, Yanyu Xu, Qingquan Meng, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Focusing on the complicated pathological features, such as blurred boundaries, severe scale differences between symptoms, background noise interference, etc., in the task of retinal edema lesions joint segmentation from OCT images and enabling the segmentation results more reliable. In this paper, we propose a novel reliable multi-scale wavelet-enhanced transformer network, which can provide accur… ▽ More Focusing on the complicated pathological features, such as blurred boundaries, severe scale differences between symptoms, background noise interference, etc., in the task of retinal edema lesions joint segmentation from OCT images and enabling the segmentation results more reliable. In this paper, we propose a novel reliable multi-scale wavelet-enhanced transformer network, which can provide accurate segmentation results with reliability assessment. Specifically, aiming at improving the model's ability to learn the complex pathological features of retinal edema lesions in OCT images, we develop a novel segmentation backbone that integrates a wavelet-enhanced feature extractor network and a multi-scale transformer module of our newly designed. Meanwhile, to make the segmentation results more reliable, a novel uncertainty segmentation head based on the subjective logical evidential theory is introduced to generate the final segmentation results with a corresponding overall uncertainty evaluation score map. We conduct comprehensive experiments on the public database of AI-Challenge 2018 for retinal edema lesions segmentation, and the results show that our proposed method achieves better segmentation accuracy with a high degree of reliability as compared to other state-of-the-art segmentation approaches. The code will be released on: https://github.com/LooKing9218/ReliableRESeg. △ Less

Submitted 1 January, 2024; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: Improving algorithm

arXiv:2210.14793 [pdf, other]

M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

Authors: Hanxue Liang, Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tianlong Chen, Kai Zou, Yu Cheng, Cong Hao, Zhangyang Wang

Abstract: Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at… ▽ More Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M$^3$ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse expert pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. When executing single-task inference, M$^{3}$ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88% inference FLOPs. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2.4 times, while achieving energy efficiency up to 9.23 times higher than a comparable FPGA baseline. Code is available at: https://github.com/VITA-Group/M3ViT. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.11273 [pdf]

doi 10.1088/2040-8986/ace4dc

Roadmap on spatiotemporal light fields

Authors: Yijie Shen, Qiwen Zhan, Logan G. Wright, Demetrios N. Christodoulides, Frank W. Wise, Alan E. Willner, Zhe Zhao, Kai-heng Zou, Chen-Ting Liao, Carlos Hernández-García, Margaret Murnane, Miguel A. Porras, Andy Chong, Chenhao Wan, Konstantin Y. Bliokh, Murat Yessenov, Ayman F. Abouraddy, Liang Jie Wong, Michael Go, Suraj Kumar, Cheng Guo, Shanhui Fan, Nikitas Papasimakis, Nikolay I. Zheludev, Lu Chen , et al. (20 additional authors not shown)

Abstract: Spatiotemporal sculpturing of light pulse with ultimately sophisticated structures represents the holy grail of the human everlasting pursue of ultrafast information transmission and processing as well as ultra-intense energy concentration and extraction. It also holds the key to unlock new extraordinary fundamental physical effects. Traditionally, spatiotemporal light pulses are always treated as… ▽ More Spatiotemporal sculpturing of light pulse with ultimately sophisticated structures represents the holy grail of the human everlasting pursue of ultrafast information transmission and processing as well as ultra-intense energy concentration and extraction. It also holds the key to unlock new extraordinary fundamental physical effects. Traditionally, spatiotemporal light pulses are always treated as spatiotemporally separable wave packet as solution of the Maxwell's equations. In the past decade, however, more generalized forms of spatiotemporally nonseparable solution started to emerge with growing importance for their striking physical effects. This roadmap intends to highlight the recent advances in the creation and control of increasingly complex spatiotemporally sculptured pulses, from spatiotemporally separable to complex nonseparable states, with diverse geometric and topological structures, presenting a bird's eye viewpoint on the zoology of spatiotemporal light fields and the outlook of future trends and open challenges. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: This is the version of the article before peer review or editing, as submitted by an author to Journal of Optics. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it

arXiv:2207.06642 [pdf]

doi 10.1116/6.0001899

Realistic simulation of reflection high-energy electron diffraction patterns for two-dimensional lattices using Ewald construction

Authors: Chong Liu, Kai Chang, Ke Zou

Abstract: Reflection high-energy electron diffraction (RHEED) is a powerful tool for characterizing crystal surface structures. However, the setup geometry leads to distorted and complicated patterns, which are not straightforward to link to the real-space structures. A program with a graphical user interface is provided here to simulate the RHEED patterns. Following the Ewald construction in the kinematic… ▽ More Reflection high-energy electron diffraction (RHEED) is a powerful tool for characterizing crystal surface structures. However, the setup geometry leads to distorted and complicated patterns, which are not straightforward to link to the real-space structures. A program with a graphical user interface is provided here to simulate the RHEED patterns. Following the Ewald construction in the kinematic theory, we find out the exact geometric transformation in this model that determines the positions of diffraction spots. The program can deal with many forms of surface structures, including surface reconstructions or domains. The simulations exhibit great agreement with the experimental results in various cases. This program will benefit the structure analysis in thin film growth and surface science studies. △ Less

Submitted 24 August, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: 15 pages, 5 figures. This article may be downloaded for personal use only. Any other use requires prior permission of the author and AIP Publishing. This article appeared in Journal of Vacuum Science & Technology B

Journal ref: Journal of Vacuum Science & Technology B 40 (2022) 054002

arXiv:2207.00592 [pdf, other]

Dissecting Service Mesh Overheads

Authors: Xiangfeng Zhu, Guozhen She, Bowen Xue, Yu Zhang, Yongsu Zhang, Xuan Kelvin Zou, Xiongchun Duan, Peng He, Arvind Krishnamurthy, Matthew Lentz, Danyang Zhuo, Ratul Mahajan

Abstract: Service meshes play a central role in the modern application ecosystem by providing an easy and flexible way to connect different services that form a distributed application. However, because of the way they interpose on application traffic, they can substantially increase application latency and resource consumption. We develop a decompositional approach and a tool, called MeshInsight, to system… ▽ More Service meshes play a central role in the modern application ecosystem by providing an easy and flexible way to connect different services that form a distributed application. However, because of the way they interpose on application traffic, they can substantially increase application latency and resource consumption. We develop a decompositional approach and a tool, called MeshInsight, to systematically characterize the overhead of service meshes and to help developers quantify overhead in deployment scenarios of interest. Using MeshInsight, we confirm that service meshes can have high overhead -- up to 185% higher latency and up to 92% more virtual CPU cores for our benchmark applications -- but the severity is intimately tied to how they are configured and the application workload. The primary contributors to overhead vary based on the configuration too. IPC (inter-process communication) and socket writes dominate when the service mesh operates as a TCP proxy, but protocol parsing dominates when it operates as an HTTP proxy. MeshInsight also enables us to study the end-to-end impact of optimizations to service meshes. We show that not all seemingly-promising optimizations lead to a notable overhead reduction in realistic settings. △ Less

Submitted 2 July, 2022; originally announced July 2022.

arXiv:2206.09309 [pdf, other]

TBraTS: Trusted Brain Tumor Segmentation

Authors: Ke Zou, Xuedong Yuan, Xiaojing Shen, Meng Wang, Huazhu Fu

Abstract: Despite recent improvements in the accuracy of brain tumor segmentation, the results still exhibit low levels of confidence and robustness. Uncertainty estimation is one effective way to change this situation, as it provides a measure of confidence in the segmentation results. In this paper, we propose a trusted brain tumor segmentation network which can generate robust segmentation results and re… ▽ More Despite recent improvements in the accuracy of brain tumor segmentation, the results still exhibit low levels of confidence and robustness. Uncertainty estimation is one effective way to change this situation, as it provides a measure of confidence in the segmentation results. In this paper, we propose a trusted brain tumor segmentation network which can generate robust segmentation results and reliable uncertainty estimations without excessive computational burden and modification of the backbone network. In our method, uncertainty is modeled explicitly using subjective logic theory, which treats the predictions of backbone neural network as subjective opinions by parameterizing the class probabilities of the segmentation as a Dirichlet distribution. Meanwhile, the trusted segmentation framework learns the function that gathers reliable evidence from the feature leading to the final segmentation results. Overall, our unified trusted segmentation framework endows the model with reliability and robustness to out-of-distribution samples. To evaluate the effectiveness of our model in robustness and reliability, qualitative and quantitative experiments are conducted on the BraTS 2019 dataset. △ Less

Submitted 28 July, 2022; v1 submitted 18 June, 2022; originally announced June 2022.

Comments: 11 pages, 4 figures, Accepted by MICCAI 2022

arXiv:2203.13005 [pdf, other]

GX-Plug: a Middleware for Plugging Accelerators to Distributed Graph Processing

Authors: Kai Zou, Xike Xie, Qi Li, Deyu Kong

Abstract: Recently, research communities highlight the necessity of formulating a scalability continuum for large-scale graph processing, which gains the scale-out benefits from distributed graph systems, and the scale-up benefits from high-performance accelerators. To this end, we propose a middleware, called the GX-plug, for the ease of integrating the merits of both. As a middleware, the GX-plug is versa… ▽ More Recently, research communities highlight the necessity of formulating a scalability continuum for large-scale graph processing, which gains the scale-out benefits from distributed graph systems, and the scale-up benefits from high-performance accelerators. To this end, we propose a middleware, called the GX-plug, for the ease of integrating the merits of both. As a middleware, the GX-plug is versatile in supporting different runtime environments, computation models, and programming models. More, for improving the middleware performance, we study a series of techniques, including pipeline shuffle, synchronization caching and skipping, and workload balancing, for intra-, inter-, and beyond-iteration optimizations, respectively. Experiments show that our middleware efficiently plugs accelerators to representative distributed graph systems, e.g., GraphX and Powergraph, with up-to 20x acceleration ratio. △ Less

Submitted 31 March, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: 13 pages

arXiv:2203.03367 [pdf, other]

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Authors: Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jian Xu, Guanjun Jiang, Luxi Xing, Ping Yang

Abstract: Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially fo… ▽ More Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially for specific domains, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain. Nevertheless, a passage retrieval system built on in-domain annotated dataset can achieve significant improvement, which indeed demonstrates the necessity of domain labeled data for further optimization. We hope the release of the Multi-CPR dataset could benchmark Chinese passage retrieval task in specific domain and also make advances for future studies. △ Less

Submitted 24 April, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: SIGIR 2022 Resource Track

arXiv:2202.13238 [pdf, other]

The categorical form of Fargues' conjecture for tori

Authors: Konrad Zou

Abstract: We prove the main conjecture of arXiv:2102.13459 for integral coefficients in the case of tori. Along the way we prove that the spectral action as constructed in that manuscript is compatible with the action of the excursion algebra and preserves the grading by $π_1(G)_Q$ on both sides. We additionally develop a (non-solidified) version of condensed group (co)homology and show that many constructi… ▽ More We prove the main conjecture of arXiv:2102.13459 for integral coefficients in the case of tori. Along the way we prove that the spectral action as constructed in that manuscript is compatible with the action of the excursion algebra and preserves the grading by $π_1(G)_Q$ on both sides. We additionally develop a (non-solidified) version of condensed group (co)homology and show that many constructions from classical group (co)homology extend to that case. △ Less

Submitted 26 February, 2022; originally announced February 2022.

Report number: MPIM-Bonn-2022

arXiv:2201.07660 [pdf]

doi 10.1007/978-3-030-89029-2_29

DSNet: Dynamic Skin Deformation Prediction by Recurrent Neural Network

Authors: Hyewon Seo, Kaifeng Zou, Frederic Cordier

Abstract: Skin dynamics contributes to the enriched realism of human body models in rendered scenes. Traditional methods rely on physics-based simulations to accurately reproduce the dynamic behavior of soft tissues. Due to the model complexity and thus the heavy computation, however, they do not directly offer practical solutions to domains where real-time performance is desirable. The quality shapes obtai… ▽ More Skin dynamics contributes to the enriched realism of human body models in rendered scenes. Traditional methods rely on physics-based simulations to accurately reproduce the dynamic behavior of soft tissues. Due to the model complexity and thus the heavy computation, however, they do not directly offer practical solutions to domains where real-time performance is desirable. The quality shapes obtained by physics-based simulations are not fully exploited by example-based or more recent datadriven methods neither, with most of them having focused on the modeling of static skin shapes by leveraging quality data. To address these limitations, we present a learningbased method for dynamic skin deformation. At the core of our work is a recurrent neural network that learns to predict the nonlinear, dynamics-dependent shape change over time from pre-existing mesh deformation sequence data. Our network also learns to predict the variation of skin dynamics across different individuals with varying body shapes. After training the network delivers realistic, high-quality skin dynamics that is specific to a person in a real-time course. We obtain results that significantly saves the computational time, while maintaining comparable prediction quality compared to state-of-the-art results. △ Less

Submitted 26 November, 2021; originally announced January 2022.

Journal ref: Lecture Notes in Computer Science, Springer, 2021, Lecture Notes in Computer Science, 13002, pp.365-377

arXiv:2201.05307 [pdf, other]

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Authors: Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng, Zichuan Xu, Pan Zhou

Abstract: Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any pai… ▽ More Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches. △ Less

Submitted 14 January, 2022; originally announced January 2022.

Comments: Accepted by AAAI2022

arXiv:2112.09414 [pdf, other]

doi 10.1007/s00371-022-02755-0

Disentangled representations: towards interpretation of sex determination from hip bone

Authors: Kaifeng Zou, Sylvain Faisan, Fabrice Heitz, Marie Epain, Pierre Croisille, Laurent Fanton, Sébastien Valette

Abstract: By highlighting the regions of the input image that contribute the most to the decision, saliency maps have become a popular method to make neural networks interpretable. In medical imaging, they are particularly well-suited to explain neural networks in the context of abnormality localization. However, from our experiments, they are less suited to classification problems where the features that a… ▽ More By highlighting the regions of the input image that contribute the most to the decision, saliency maps have become a popular method to make neural networks interpretable. In medical imaging, they are particularly well-suited to explain neural networks in the context of abnormality localization. However, from our experiments, they are less suited to classification problems where the features that allow to distinguish between the different classes are spatially correlated, scattered and definitely non-trivial. In this paper we thus propose a new paradigm for better interpretability. To this end we provide the user with relevant and easily interpretable information so that he can form his own opinion. We use Disentangled Variational Auto-Encoders which latent representation is divided into two components: the non-interpretable part and the disentangled part. The latter accounts for the categorical variables explicitly representing the different classes of interest. In addition to providing the class of a given input sample, such a model offers the possibility to transform the sample from a given class to a sample of another class, by modifying the value of the categorical variables in the latent representation. This paves the way to easier interpretation of class differences. We illustrate the relevance of this approach in the context of automatic sex determination from hip bones in forensic medicine. The features encoded by the model, that distinguish the different classes were found to be consistent with expert knowledge. △ Less

Submitted 17 December, 2021; originally announced December 2021.

Journal ref: The Visual Computer (2023)

arXiv:2105.09596 [pdf, other]

AGSFCOS: Based on attention mechanism and Scale-Equalizing pyramid network of object detection

Authors: Li Wang, Wei Xiang, Ruhui Xue, Kaida Zou, Laili Zhu

Abstract: Recently, the anchor-free object detection model has shown great potential for accuracy and speed to exceed anchor-based object detection. Therefore, two issues are mainly studied in this article: (1) How to let the backbone network in the anchor-free object detection model learn feature extraction? (2) How to make better use of the feature pyramid network? In order to solve the above problems, Ex… ▽ More Recently, the anchor-free object detection model has shown great potential for accuracy and speed to exceed anchor-based object detection. Therefore, two issues are mainly studied in this article: (1) How to let the backbone network in the anchor-free object detection model learn feature extraction? (2) How to make better use of the feature pyramid network? In order to solve the above problems, Experiments show that our model has a certain improvement in accuracy compared with the current popular detection models on the COCO dataset, the designed attention mechanism module can capture contextual information well, improve detection accuracy, and use sepc network to help balance abstract and detailed information, and reduce the problem of semantic gap in the feature pyramid network. Whether it is anchor-based network model YOLOv3, Faster RCNN, or anchor-free network model Foveabox, FSAF, FCOS. Our optimal model can get 39.5% COCO AP under the background of ResNet50. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: 9 pages,9 figures

arXiv:2103.14139 [pdf, other]

Inverse-designed multi-dimensional silicon photonic transmitters

Authors: Ki Youl Yang, Alexander D. White, Farshid Ashtiani, Chinmay Shirpurkar, Srinivas V. Pericherla, Lin Chang, Hao Song, Kaiheng Zou, Huibin Zhou, Kai Pang, Joshua Yang, Melissa A. Guidry, Daniil M. Lukin, Han Hao, Lawrence Trask, Geun Ho Ahn, Andy Netherton, Travis C. Briles, Jordan R. Stone, Lior Rechtman, Jeffery S. Stone, Kasper Van Gasse, Jinhie L. Skarda, Logan Su, Dries Vercruysse , et al. (11 additional authors not shown)

Abstract: Modern microelectronic processors have migrated towards parallel computing architectures with many-core processors. However, such expansion comes with diminishing returns exacted by the high cost of data movement between individual processors. The use of optical interconnects has burgeoned as a promising technology that can address the limits of this data transfer. While recent pushes to enhance o… ▽ More Modern microelectronic processors have migrated towards parallel computing architectures with many-core processors. However, such expansion comes with diminishing returns exacted by the high cost of data movement between individual processors. The use of optical interconnects has burgeoned as a promising technology that can address the limits of this data transfer. While recent pushes to enhance optical communication have focused on developing wavelength-division multiplexing technology, this approach will eventually saturate the usable bandwidth, and new dimensions of data transfer will be paramount to fulfill the ever-growing need for speed. Here we demonstrate an integrated intra- and inter-chip multi-dimensional communication scheme enabled by photonic inverse design. Using inverse-designed mode-division multiplexers, we combine wavelength- and mode- multiplexing and send massively parallel data through nano-photonic waveguides and optical fibres. Crucially, as we take advantage of an orthogonal optical basis, our approach is inherently scalable to a multiplicative enhancement over the current state of the art. △ Less

Submitted 10 October, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

Comments: Fig.2-4 present new experimental results -- (i) demonstration of a broadband, low cross-talk multiplexer, (ii) a silicon photonic mode-division multiplexing with a chip-scale soliton microcomb source, and (iii) a chip-to-chip optical interconnect using a multimode-matched fibre and inverse-designed beam couplers

arXiv:2101.09967 [pdf]

Turbulence-Resilient Coherent Free-Space Optical Communications using Automatic Power-Efficient Pilot-Assisted Optoelectronic Beam Mixing of Many Modes

Authors: Runzhou Zhang, Nanzhe Hu, Huibin Zhou, Kaiheng Zou, Xinzhou Su, Yiyu Zhou, Haoqian Song, Kai Pang, Hao Song, Amir Minoofar, Zhe Zhao, Cong Liu, Karapet Manukyan, Ahmed Almaiman, Brittany Lynn, Robert W. Boyd, Moshe Tur, Alan E. Willner

Abstract: Atmospheric turbulence generally limits free-space optical (FSO) communications, and this problem is severely exacerbated when implementing highly sensitive and spectrally efficient coherent detection. Specifically, turbulence induces power coupling from the transmitted Gaussian mode to higher-order Laguerre-Gaussian (LG) modes, resulting in a significant decrease of the power that mixes with a si… ▽ More Atmospheric turbulence generally limits free-space optical (FSO) communications, and this problem is severely exacerbated when implementing highly sensitive and spectrally efficient coherent detection. Specifically, turbulence induces power coupling from the transmitted Gaussian mode to higher-order Laguerre-Gaussian (LG) modes, resulting in a significant decrease of the power that mixes with a single-mode local oscillator (LO). Instead, we transmit a frequency-offset Gaussian pilot tone along with the data signal, such that both experience similar turbulence and modal power coupling. Subsequently, the photodetector (PD) optoelectronically mixes all corresponding pairs of the beams' modes. During mixing, a conjugate of the turbulence experienced by the pilot tone is automatically generated and compensates the turbulence experienced by the data, and nearly all orders of the same corresponding modes efficiently mix. We demonstrate a 12-Gbit/s 16-quadrature-amplitude-modulation (16-QAM) polarization-multiplexed (PolM) FSO link that exhibits resilience to emulated turbulence. Experimental results for turbulence D/r_0~5.5 show up to ~20 dB reduction in the mixing power loss over a conventional coherent receiver. Therefore, our approach automatically recovers nearly all the captured data power to enable high-performance coherent FSO systems. △ Less

Submitted 25 January, 2021; originally announced January 2021.

arXiv:2012.06730 [pdf, other]

doi 10.1021/acsphotonics.1c00730

Fractal superconducting nanowires detect infrared single photons with 84% system detection efficiency, 1.02 polarization sensitivity, and 20.8 ps timing resolution

Authors: Yun Meng, Kai Zou, Nan Hu, Liang Xu, Xiaojian Lan, Stephan Steinhauer, Samuel Gyger, Val Zwiller, Xiaolong Hu

Abstract: The near-unity system detection efficiency (SDE) and excellent timing resolution of superconducting nanowire single-photon detectors (SNSPDs), combined with their other merits, have enabled many classical and quantum photonic applications. However, the prevalent design based on meandering nanowires makes SDE dependent on the polarization states of the incident photons; for unpolarized light, the m… ▽ More The near-unity system detection efficiency (SDE) and excellent timing resolution of superconducting nanowire single-photon detectors (SNSPDs), combined with their other merits, have enabled many classical and quantum photonic applications. However, the prevalent design based on meandering nanowires makes SDE dependent on the polarization states of the incident photons; for unpolarized light, the major merit of high SDE would get compromised, which could be detrimental for photon-starved applications. Here, we create SNSPDs with an arced fractal geometry that almost completely eliminates this polarization dependence of the SDE, and we experimentally demonstrate 84$\pm$3$\%$ SDE, 1.02$^{+0.06}_{-0.02}$ polarization sensitivity at the wavelength of 1575 nm, and 20.8 ps timing jitter in a 0.1-W closed-cycle Gifford-McMahon cryocooler, at the base temperature of 2.0 K. This demonstration provides a novel, practical device structure of SNSPDs, allowing for operation in the visible, near-, and mid-infrared spectral ranges, and paves the way for polarization-insensitive single-photon detection with high SDE and high timing resolution △ Less

Submitted 31 March, 2022; v1 submitted 12 December, 2020; originally announced December 2020.

Comments: 8 pages, 4 figures

arXiv:2005.09581 [pdf]

doi 10.1103/PhysRevB.101.214105

Controlling the electrical and magnetic ground states by doping in the complete phase diagram of titanate Eu1-xLaxTiO3 thin films

Authors: Hyungki Shin, Chong Liu, Fengmiao Li, Ronny Sutarto, Bruce A. Davidson, Ke Zou

Abstract: EuTiO3, a band insulator, and LaTiO3, a Mott insulator, are both antiferromagnetic with transition temperatures ~ 5.5 K and ~ 160 K, respectively. Here, we report the synthesis of Eu1-xLaxTiO3 thin films with x = 0 to 1 by oxide molecular beam epitaxy. The films in the full range have high crystalline quality and show no phase segregation, allowing us carry out transport measurements to study thei… ▽ More EuTiO3, a band insulator, and LaTiO3, a Mott insulator, are both antiferromagnetic with transition temperatures ~ 5.5 K and ~ 160 K, respectively. Here, we report the synthesis of Eu1-xLaxTiO3 thin films with x = 0 to 1 by oxide molecular beam epitaxy. The films in the full range have high crystalline quality and show no phase segregation, allowing us carry out transport measurements to study their electrical and magnetic properties. From x = 0.03 to 0.95, Eu1-xLaxTiO3 films show conduction by electrons as charge carriers, with differences in carrier densities and mobilities, contrary to the insulating nature of pure EuTiO3 and LaTiO3. Following a rich phase diagram, the magnetic ground states of the films vary with increasing La-doping level, changing Eu1-xLaxTiO3 from an antiferromagnetic insulator to an antiferromagnetic metal, a ferromagnetic metal, a paramagnetic metal, and back to an antiferromagnetic insulator. These emergent properties reflect the evolutions of the band structure, mainly at the Ti t2g bands near the Fermi level, when Eu2+ are gradually replaced by La3+. This work sheds light on this method for designing the electrical and magnetic properties in strongly-correlated oxides and completes the phase diagram of the titanate Eu1-xLaxTiO3. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Journal ref: Physical Review B, 2020

arXiv:2003.09916 [pdf, other]

A platform for high performance photon correlation measurements

Authors: Iman Esmaeil Zadeh, Johannes W. N. Los, Ronan B. M. Gourgues, Jin Chang, Ali W. Elshaari, Julien Zichi, Yuri J. van Staaden, Jeroen Swens, Nima Kalhor, Antonio Guardiani, Yun Meng, Kai Zou, Sergiy Dobrovolskiy, Andreas W. Fognini, Dennis R. Schaart, Dan Dalacu, Philip J. Poole, Michael E. Reimer, Xiaolong Hu, Silvania F. Pereira, Val Zwiller, Sander N. Dorenbos

Abstract: A broad range of scientific and industrial disciplines require precise optical measurements at very low light levels. Single-photon detectors combining high efficiency and high time resolution are pivotal in such experiments. By using relatively thick films of NbTiN (8-11\,nm) and improving the pattern fidelity of the nano-structure of the superconducting nanowire single-photon detectors (SNSPD),… ▽ More A broad range of scientific and industrial disciplines require precise optical measurements at very low light levels. Single-photon detectors combining high efficiency and high time resolution are pivotal in such experiments. By using relatively thick films of NbTiN (8-11\,nm) and improving the pattern fidelity of the nano-structure of the superconducting nanowire single-photon detectors (SNSPD), we fabricated devices demonstrating superior performance over all previously reported detectors in the combination of efficiency and time resolution. Our findings prove that small variations in the nanowire width, in the order of a few nanometers, can lead to a significant penalty on their temporal response. Addressing these issues, we consistently achieved high time resolution (best device 7.7\,ps, other devices $\sim$10-16\,ps) simultaneously with high system detection efficiencies ($80-90\%$) in the wavelength range of 780-1000\,nm, as well as in the telecom bands (1310-1550\,nm). The use of thicker films allowed us to fabricate large-area multi-pixel devices with homogeneous pixel performance. We first fabricated and characterized a $100\times100\, μm^2$ 16-pixel detector and showed there was little variation among individual pixels. Additionally, to showcase the power of our platform, we fabricated and characterized 4-pixel multimode fiber-coupled detectors and carried out photon correlation experiments on a nanowire quantum dot resulting in $g^2(0)$ values lower than 0.04. The multi-pixel detectors alleviate the need for beamsplitters and can be used for higher order correlations with promising prospects not only in the field of quantum optics, but also in bio-imaging applications, such as fluorescence microscopy and positron emission tomography. △ Less

Submitted 22 March, 2020; originally announced March 2020.

arXiv:2002.06314 [pdf]

doi 10.1103/PhysRevB.101.140502

Tuning stoichiometry and its impact on superconductivity of monolayer and multilayer FeSe on SrTiO3

Authors: Chong Liu, Ke Zou

Abstract: Synthesis of monolayer FeSe on SrTiO3, with greatly enhanced superconductivity compared to bulk FeSe, remains difficult. Lengthy annealing within a certain temperature window is always required to achieve superconducting samples as reported by different groups around the world, but the mechanism of annealing in inducing superconductivity has not been elucidated. We grow FeSe films on SrTiO3 by mol… ▽ More Synthesis of monolayer FeSe on SrTiO3, with greatly enhanced superconductivity compared to bulk FeSe, remains difficult. Lengthy annealing within a certain temperature window is always required to achieve superconducting samples as reported by different groups around the world, but the mechanism of annealing in inducing superconductivity has not been elucidated. We grow FeSe films on SrTiO3 by molecular beam epitaxy and adjust the stoichiometry by depositing additional small amounts of Fe atoms. The monolayer films become superconducting after the Fe deposition without any annealing, and show similar superconducting transition temperatures as those of the annealed films in transport measurements. We also demonstrate on the 5-unit-cell films that the FeSe multilayer can be reversibly tuned between the non-superconducting $\sqrt{5} \times \sqrt{5}$ phase with Fe-vacancies and superconducting $1 \times 1$ phase. Our results reveal that the traditional anneal process in essence removes Fe vacancies and the additional Fe deposition serves as a more efficient way to achieve superconductivity. This work highlights the significance of stoichiometry in the superconductivity of FeSe thin films and provides an easy path for superconducting samples. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: 14 pages, 5 figures

Journal ref: Phys. Rev. B 101, 140502 (2020)

Showing 1–50 of 78 results for author: Zou, K