subscribe to arXiv mailings

arXiv:2407.05256 [pdf, other]

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Authors: Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

Abstract: Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneer… ▽ More Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2406.06216 [pdf, other]

Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

Authors: Xin Jin, Pengyi Jiao, Zheng-Peng Duan, Xingchao Yang, Chun-Le Guo, Bo Ren, Chongyi Li

Abstract: Volumetric rendering based methods, like NeRF, excel in HDR view synthesis from RAWimages, especially for nighttime scenes. While, they suffer from long training times and cannot perform real-time rendering due to dense sampling requirements. The advent of 3D Gaussian Splatting (3DGS) enables real-time rendering and faster training. However, implementing RAW image-based view synthesis directly usi… ▽ More Volumetric rendering based methods, like NeRF, excel in HDR view synthesis from RAWimages, especially for nighttime scenes. While, they suffer from long training times and cannot perform real-time rendering due to dense sampling requirements. The advent of 3D Gaussian Splatting (3DGS) enables real-time rendering and faster training. However, implementing RAW image-based view synthesis directly using 3DGS is challenging due to its inherent drawbacks: 1) in nighttime scenes, extremely low SNR leads to poor structure-from-motion (SfM) estimation in distant views; 2) the limited representation capacity of spherical harmonics (SH) function is unsuitable for RAW linear color space; and 3) inaccurate scene structure hampers downstream tasks such as refocusing. To address these issues, we propose LE3D (Lighting Every darkness with 3DGS). Our method proposes Cone Scatter Initialization to enrich the estimation of SfM, and replaces SH with a Color MLP to represent the RAW linear color space. Additionally, we introduce depth distortion and near-far regularizations to improve the accuracy of scene structure for downstream tasks. These designs enable LE3D to perform real-time novel view synthesis, HDR rendering, refocusing, and tone-mapping changes. Compared to previous volumetric rendering based methods, LE3D reduces training time to 1% and improves rendering speed by up to 4,000 times for 2K resolution images in terms of FPS. Code and viewer can be found in https://github.com/Srameo/LE3D . △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.02147 [pdf, other]

UA-Track: Uncertainty-Aware End-to-End 3D Multi-Object Tracking

Authors: Lijun Zhou, Tao Tang, Pengkun Hao, Zihang He, Kalok Ho, Shuo Gu, Wenbo Hou, Zhihui Hao, Haiyang Sun, Kun Zhan, Peng Jia, Xianpeng Lang, Xiaodan Liang

Abstract: 3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception. Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task. However, existing methods overlook the uncertainty issue, which refers to the lack of precise confidence about the state and location of tracked objects. Uncertainty arises… ▽ More 3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception. Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task. However, existing methods overlook the uncertainty issue, which refers to the lack of precise confidence about the state and location of tracked objects. Uncertainty arises owing to various factors during motion observation by cameras, especially occlusions and the small size of target objects, resulting in an inaccurate estimation of the object's position, label, and identity. To this end, we propose an Uncertainty-Aware 3D MOT framework, UA-Track, which tackles the uncertainty problem from multiple aspects. Specifically, we first introduce an Uncertainty-aware Probabilistic Decoder to capture the uncertainty in object prediction with probabilistic attention. Secondly, we propose an Uncertainty-guided Query Denoising strategy to further enhance the training process. We also utilize Uncertainty-reduced Query Initialization, which leverages predicted 2D object location and depth information to reduce query uncertainty. As a result, our UA-Track achieves state-of-the-art performance on the nuScenes benchmark, i.e., 66.3% AMOTA on the test split, surpassing the previous best end-to-end solution by a significant margin of 8.9% AMOTA. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01349 [pdf, other]

Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

Authors: Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, Di Lin, Kaicheng Yu

Abstract: Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomous driving models as the generated videos are usually less than 8 frames and the spatial and tempora… ▽ More Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomous driving models as the generated videos are usually less than 8 frames and the spatial and temporal inconsistencies are not negligible. To this end, we propose Delphi, a novel diffusion-based long video generation method with a shared noise modeling mechanism across the multi-views to increase spatial consistency, and a feature-aligned module to achieves both precise controllability and temporal consistency. Our method can generate up to 40 frames of video without loss of consistency which is about 5 times longer compared with state-of-the-art methods. Instead of randomly generating new data, we further design a sampling policy to let Delphi generate new data that are similar to those failure cases to improve the sample efficiency. This is achieved by building a failure-case driven framework with the help of pre-trained visual language models. Our extensive experiment demonstrates that our Delphi generates a higher quality of long videos surpassing previous state-of-the-art methods. Consequentially, with only generating 4% of the training dataset size, our framework is able to go beyond perception and prediction tasks, for the first time to the best of our knowledge, boost the planning performance of the end-to-end autonomous driving model by a margin of 25%. △ Less

Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: Project Page: https://westlake-autolab.github.io/delphi.github.io/, 8 figures

arXiv:2405.14702 [pdf, other]

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

Authors: Pengyue Jia, Yiding Liu, Xiaopeng Li, Xiangyu Zhao, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin

Abstract: Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily con… ▽ More Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG). In particular, G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification to optimize both retrieval and generation phases of worldwide geolocalization. During Geo-alignment, our solution jointly learns expressive multi-modal representations for images, GPS and textual descriptions, which allows us to capture location-aware semantics for retrieving nearby images for a given query. During Geo-diversification, we leverage a prompt ensembling method that is robust to inconsistent retrieval performance for different image queries. Finally, we combine both retrieved and generated GPS candidates in Geo-verification for location prediction. Experiments on two well-established datasets IM2GPS3k and YFCC4k verify the superiority of G3 compared to other state-of-the-art methods. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.10890 [pdf, other]

A Versatile Framework for Analyzing Galaxy Image Data by Implanting Human-in-the-loop on a Large Vision Model

Authors: Mingxiang Fu, Yu Song, Jiameng Lv, Liang Cao, Peng Jia, Nan Li, Xiangru Li, Jifeng Liu, A-Li Luo, Bo Qiu, Shiyin Shen, Liangping Tu, Lili Wang, Shoulin Wei, Haifeng Yang, Zhenping Yi, Zhiqiang Zou

Abstract: The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe. However, effectively analyzing this vast amount of data poses a significant challenge. Astronomers are turning to deep learning techniques to address this, but the methods are limited by their specific training sets, leading to considerable duplicate workloads too. He… ▽ More The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe. However, effectively analyzing this vast amount of data poses a significant challenge. Astronomers are turning to deep learning techniques to address this, but the methods are limited by their specific training sets, leading to considerable duplicate workloads too. Hence, as an example to present how to overcome the issue, we built a framework for general analysis of galaxy images, based on a large vision model (LVM) plus downstream tasks (DST), including galaxy morphological classification, image restoration, object detection, parameter extraction, and more. Considering the low signal-to-noise ratio of galaxy images and the imbalanced distribution of galaxy categories, we have incorporated a Human-in-the-loop (HITL) module into our large vision model, which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively. The proposed framework exhibits notable few-shot learning capabilities and versatile adaptability to all the abovementioned tasks on galaxy images in the DESI legacy imaging surveys. Expressly, for object detection, trained by 1000 data points, our DST upon the LVM achieves an accuracy of 96.7%, while ResNet50 plus Mask R-CNN gives an accuracy of 93.1%; for morphology classification, to obtain AUC ~0.9, LVM plus DST and HITL only requests 1/50 training sets compared to ResNet18. Expectedly, multimodal data can be integrated similarly, which opens up possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-message astronomy. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: 26 pages, 10 figures, to be published on Chinese Physics C

arXiv:2405.03408 [pdf, other]

An Image Quality Evaluation and Masking Algorithm Based On Pre-trained Deep Neural Networks

Authors: Peng Jia, Yu Song, Jiameng Lv, Runyu Ning

Abstract: With the growing amount of astronomical data, there is an increasing need for automated data processing pipelines, which can extract scientific information from observation data without human interventions. A critical aspect of these pipelines is the image quality evaluation and masking algorithm, which evaluates image qualities based on various factors such as cloud coverage, sky brightness, scat… ▽ More With the growing amount of astronomical data, there is an increasing need for automated data processing pipelines, which can extract scientific information from observation data without human interventions. A critical aspect of these pipelines is the image quality evaluation and masking algorithm, which evaluates image qualities based on various factors such as cloud coverage, sky brightness, scattering light from the optical system, point spread function size and shape, and read-out noise. Occasionally, the algorithm requires masking of areas severely affected by noise. However, the algorithm often necessitates significant human interventions, reducing data processing efficiency. In this study, we present a deep learning based image quality evaluation algorithm that uses an autoencoder to learn features of high quality astronomical images. The trained autoencoder enables automatic evaluation of image quality and masking of noise affected areas. We have evaluated the performance of our algorithm using two test cases: images with point spread functions of varying full width half magnitude, and images with complex backgrounds. In the first scenario, our algorithm could effectively identify variations of the point spread functions, which can provide valuable reference information for photometry. In the second scenario, our method could successfully mask regions affected by complex regions, which could significantly increase the photometry accuracy. Our algorithm can be employed to automatically evaluate image quality obtained by different sky surveying projects, further increasing the speed and robustness of data processing pipelines. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Accepted by the AJ. The code could be downloaded from: https://nadc.china-vo.org/res/r101415/ with DOI of: 10.12149/101415

arXiv:2405.02008 [pdf, other]

DiffMap: Enhancing Map Segmentation with Map Prior Using Diffusion Model

Authors: Peijin Jia, Tuopu Wen, Ziang Luo, Mengmeng Yang, Kun Jiang, Zhiquan Lei, Xuewei Tang, Ziyuan Liu, Le Cui, Kehua Sheng, Bo Zhang, Diange Yang

Abstract: Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited… ▽ More Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.17820 [pdf, other]

doi 10.1002/rob.22345

Motion planning for off-road autonomous driving based on human-like cognition and weight adaptation

Authors: Yuchun Wang, Cheng Gong, Jianwei Gong, Peng Jia

Abstract: Driving in an off-road environment is challenging for autonomous vehicles due to the complex and varied terrain. To ensure stable and efficient travel, the vehicle requires consideration and balancing of environmental factors, such as undulations, roughness, and obstacles, to generate optimal trajectories that can adapt to changing scenarios. However, traditional motion planners often utilize a fi… ▽ More Driving in an off-road environment is challenging for autonomous vehicles due to the complex and varied terrain. To ensure stable and efficient travel, the vehicle requires consideration and balancing of environmental factors, such as undulations, roughness, and obstacles, to generate optimal trajectories that can adapt to changing scenarios. However, traditional motion planners often utilize a fixed cost function for trajectory optimization, making it difficult to adapt to different driving strategies in challenging irregular terrains and uncommon scenarios. To address these issues, we propose an adaptive motion planner based on human-like cognition and cost evaluation for off-road driving. First, we construct a multi-layer map describing different features of off-road terrains, including terrain elevation, roughness, obstacle, and artificial potential field map. Subsequently, we employ a CNN-LSTM network to learn the trajectories planned by human drivers in various off-road scenarios. Then, based on human-like generated trajectories in different environments, we design a primitive-based trajectory planner that aims to mimic human trajectories and cost weight selection, generating trajectories that are consistent with the dynamics of off-road vehicles. Finally, we compute optimal cost weights and select and extend behavioral primitives to generate highly adaptive, stable, and efficient trajectories. We validate the effectiveness of the proposed method through experiments in a desert off-road environment with complex terrain and varying road conditions. The experimental results show that the proposed human-like motion planner has excellent adaptability to different off-road conditions. It shows real-time operation, greater stability, and more human-like planning ability in diverse and challenging scenarios. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Journal ref: Journal of Field Robotics,2024,1-22

arXiv:2404.01780 [pdf, other]

CSST Strong Lensing Preparation: a Framework for Detecting Strong Lenses in the Multi-color Imaging Survey by the China Survey Space Telescope (CSST)

Authors: Xu Li, Ruiqi Sun, Jiameng Lv, Peng Jia, Nan Li, Chengliang Wei, Zou Hu, Xinzhong Er, Yun Chen, Zhang Ban, Yuedong Fang, Qi Guo, Dezi Liu, Guoliang Li, Lin Lin, Ming Li, Ran Li, Xiaobo Li, Yu Luo, Xianmin Meng, Jundan Nie, Zhaoxiang Qi, Yisheng Qiu, Li Shao, Hao Tian , et al. (7 additional authors not shown)

Abstract: Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to… ▽ More Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to cut-out-centered galaxies. However, according to the design and survey strategy of optical surveys by CSST, preparing cutouts with multiple bands requires considerable efforts. To overcome these challenges, we have developed a framework based on a hierarchical visual Transformer with a sliding window technique to search for strong lensing systems within entire images. Moreover, given that multi-color images of strong lensing systems can provide insights into their physical characteristics, our framework is specifically crafted to identify strong lensing systems in images with any number of channels. As evaluated using CSST mock data based on an Semi-Analytic Model named CosmoDC2, our framework achieves precision and recall rates of 0.98 and 0.90, respectively. To evaluate the effectiveness of our method in real observations, we have applied it to a subset of images from the DESI Legacy Imaging Surveys and media images from Euclid Early Release Observations. 61 new strong lensing system candidates are discovered by our method. However, we also identified false positives arising primarily from the simplified galaxy morphology assumptions within the simulation. This underscores the practical limitations of our approach while simultaneously highlighting potential avenues for future improvements. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: The paper is accepted by the AJ. The complete code could be downloaded with DOI of: 10.12149/101393. Comments are welcome

arXiv:2403.19589 [pdf, other]

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Authors: Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, Kun Zhan, Peng Jia, Xiaoxiao Long, Yilun Chen, Hao Zhao

Abstract: 3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual… ▽ More 3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes. To this end, we introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes. Notably, our TOD3Cap network can effectively localize and caption 3D objects in outdoor scenes, which outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). Code, data, and models are publicly available at https://github.com/jxbbb/TOD3Cap. △ Less

Submitted 5 June, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: Code, data, and models are publicly available at https://github.com/jxbbb/TOD3Cap

arXiv:2403.17297 [pdf, other]

InternLM2 Technical Report

Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.12660 [pdf, other]

ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems

Authors: Pengyue Jia, Yejing Wang, Zhaocheng Du, Xiangyu Zhao, Yichao Wang, Bo Chen, Wanyu Wang, Huifeng Guo, Ruiming Tang

Abstract: Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core chal… ▽ More Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction. △ Less

Submitted 19 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: Accepted to KDD 2024

arXiv:2403.12398 [pdf, other]

Hierarchical Digital Twin for Efficient 6G Network Orchestration via Adaptive Attribute Selection and Scalable Network Modeling

Authors: Pengyi Jia, Xianbin Wang, Xuemin Shen

Abstract: Achieving a holistic and long-term understanding through accurate network modeling is essential for orchestrating future networks with increasing service diversity and infrastructure complexities. However, due to unselective data collection and uniform processing, traditional modeling approaches undermine the efficacy and timeliness of network orchestration. Additionally, temporal disparities aris… ▽ More Achieving a holistic and long-term understanding through accurate network modeling is essential for orchestrating future networks with increasing service diversity and infrastructure complexities. However, due to unselective data collection and uniform processing, traditional modeling approaches undermine the efficacy and timeliness of network orchestration. Additionally, temporal disparities arising from various modeling delays further impair the centralized decision-making with distributed models. In this paper, we propose a new hierarchical digital twin paradigm adapting to real-time network situations for problem-centered model construction. Specifically, we introduce an adaptive attribute selection mechanism that evaluates the distinct modeling values of diverse network attributes, considering their relevance to current network scenarios and inherent modeling complexity. By prioritizing critical attributes at higher layers, an efficient evaluation of network situations is achieved to identify target areas. Subsequently, scalable network modeling facilitates the inclusion of all identified elements at the lower layers, where more fine-grained digital twins are developed to generate targeted solutions for user association and power allocation. Furthermore, virtual-physical domain synchronization is implemented to maintain accurate temporal alignment between the digital twins and their physical counterparts, spanning from the construction to the utilization of the proposed paradigm. Extensive simulations validate the proposed approach, demonstrating its effectiveness in efficiently identifying pressing issues and delivering network orchestration solutions in complex 6G HetNets. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.10206 [pdf, other]

A Data-Driven Approach for Mitigating Dark Current Noise and Bad Pixels in Complementary Metal Oxide Semiconductor Cameras for Space-based Telescopes

Authors: Peng Jia, Chao Lv, Yushan Li, Yongyang Sun, Shu Niu, Zhuoxiao Wang

Abstract: In recent years, there has been a gradual increase in the performance of Complementary Metal Oxide Semiconductor (CMOS) cameras. These cameras have gained popularity as a viable alternative to charge-coupled device (CCD) cameras in a wide range of applications. One particular application is the CMOS camera installed in small space telescopes. However, the limited power and spatial resources availa… ▽ More In recent years, there has been a gradual increase in the performance of Complementary Metal Oxide Semiconductor (CMOS) cameras. These cameras have gained popularity as a viable alternative to charge-coupled device (CCD) cameras in a wide range of applications. One particular application is the CMOS camera installed in small space telescopes. However, the limited power and spatial resources available on satellites present challenges in maintaining ideal observation conditions, including temperature and radiation environment. Consequently, images captured by CMOS cameras are susceptible to issues such as dark current noise and defective pixels. In this paper, we introduce a data-driven framework for mitigating dark current noise and bad pixels for CMOS cameras. Our approach involves two key steps: pixel clustering and function fitting. During pixel clustering step, we identify and group pixels exhibiting similar dark current noise properties. Subsequently, in the function fitting step, we formulate functions that capture the relationship between dark current and temperature, as dictated by the Arrhenius law. Our framework leverages ground-based test data to establish distinct temperature-dark current relations for pixels within different clusters. The cluster results could then be utilized to estimate the dark current noise level and detect bad pixels from real observational data. To assess the effectiveness of our approach, we have conducted tests using real observation data obtained from the Yangwang-1 satellite, equipped with a near-ultraviolet telescope and an optical telescope. The results show a considerable improvement in the detection efficiency of space-based telescopes. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted by the AJ, comments are welcome. The complete code could be downloaded from: DOI: 10.12149/101387

arXiv:2402.17487 [pdf, other]

Bit Rate Matching Algorithm Optimization in JPEG-AI Verification Model

Authors: Panqi Jia, A. Burakhan Koyuncu, Jue Mao, Ze Cui, Yi Ma, Tiansheng Guo, Timofey Solovyev, Alexander Karabutov, Yin Zhao, Jing Wang, Elena Alshina, Andre Kaup

Abstract: The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties… ▽ More The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties evoked the attention of both scientific and industrial communities, resulting in the standardization activity JPEG-AI. The verification model for the standardization process of JPEG-AI is already in development and has surpassed the advanced VVC intra codec. To generate reconstructed images with the desired bits per pixel and assess the BD-rate performance of both the JPEG-AI verification model and VVC intra, bit rate matching is employed. However, the current state of the JPEG-AI verification model experiences significant slowdowns during bit rate matching, resulting in suboptimal performance due to an unsuitable model. The proposed methodology offers a gradual algorithmic optimization for matching bit rates, resulting in a fourfold acceleration and over 1% improvement in BD-rate at the base operation point. At the high operation point, the acceleration increases up to sixfold. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted at (IEEE) PCS 2024; 6 pages

arXiv:2402.17470 [pdf, other]

Bit Distribution Study and Implementation of Spatial Quality Map in the JPEG-AI Standardization

Authors: Panqi Jia, Jue Mao, Esin Koyuncu, A. Burakhan Koyuncu, Timofey Solovyev, Alexander Karabutov, Yin Zhao, Elena Alshina, Andre Kaup

Abstract: Currently, there is a high demand for neural network-based image compression codecs. These codecs employ non-linear transforms to create compact bit representations and facilitate faster coding speeds on devices compared to the hand-crafted transforms used in classical frameworks. The scientific and industrial communities are highly interested in these properties, leading to the standardization ef… ▽ More Currently, there is a high demand for neural network-based image compression codecs. These codecs employ non-linear transforms to create compact bit representations and facilitate faster coding speeds on devices compared to the hand-crafted transforms used in classical frameworks. The scientific and industrial communities are highly interested in these properties, leading to the standardization effort of JPEG-AI. The JPEG-AI verification model has been released and is currently under development for standardization. Utilizing neural networks, it can outperform the classic codec VVC intra by over 10% BD-rate operating at base operation point. Researchers attribute this success to the flexible bit distribution in the spatial domain, in contrast to VVC intra's anchor that is generated with a constant quality point. However, our study reveals that VVC intra displays a more adaptable bit distribution structure through the implementation of various block sizes. As a result of our observations, we have proposed a spatial bit allocation method to optimize the JPEG-AI verification model's bit distribution and enhance the visual quality. Furthermore, by applying the VVC bit distribution strategy, the objective performance of JPEG-AI verification mode can be further improved, resulting in a maximum gain of 0.45 dB in PSNR-Y. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 5 pages, 3 figures, 4 tables

arXiv:2402.12289 [pdf, other]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Authors: Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

Abstract: A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scen… ▽ More A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments. △ Less

Submitted 25 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: Project Page: https://tsinghua-mars-lab.github.io/DriveVLM/

arXiv:2402.09325 [pdf, other]

PC-NeRF: Parent-Child Neural Radiance Fields Using Sparse LiDAR Frames in Autonomous Driving Environments

Authors: Xiuzhong Hu, Guangming Xiong, Zheng Zang, Peng Jia, Yuxuan Han, Junyi Ma

Abstract: Large-scale 3D scene reconstruction and novel view synthesis are vital for autonomous vehicles, especially utilizing temporally sparse LiDAR frames. However, conventional explicit representations remain a significant bottleneck towards representing the reconstructed and synthetic scenes at unlimited resolution. Although the recently developed neural radiance fields (NeRF) have shown compelling res… ▽ More Large-scale 3D scene reconstruction and novel view synthesis are vital for autonomous vehicles, especially utilizing temporally sparse LiDAR frames. However, conventional explicit representations remain a significant bottleneck towards representing the reconstructed and synthetic scenes at unlimited resolution. Although the recently developed neural radiance fields (NeRF) have shown compelling results in implicit representations, the problem of large-scale 3D scene reconstruction and novel view synthesis using sparse LiDAR frames remains unexplored. To bridge this gap, we propose a 3D scene reconstruction and novel view synthesis framework called parent-child neural radiance field (PC-NeRF). Based on its two modules, parent NeRF and child NeRF, the framework implements hierarchical spatial partitioning and multi-level scene representation, including scene, segment, and point levels. The multi-level scene representation enhances the efficient utilization of sparse LiDAR point cloud data and enables the rapid acquisition of an approximate volumetric scene representation. With extensive experiments, PC-NeRF is proven to achieve high-precision novel LiDAR view synthesis and 3D reconstruction in large-scale scenes. Moreover, PC-NeRF can effectively handle situations with sparse LiDAR frames and demonstrate high deployment efficiency with limited training epochs. Our approach implementation and the pre-trained models are available at https://github.com/biter0088/pc-nerf. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2310.00874

arXiv:2401.01065 [pdf, other]

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

Authors: Tao Tang, Dafeng Wei, Zhengyu Jia, Tian Gao, Changwei Cai, Chengkai Hou, Peng Jia, Kun Zhan, Haiyang Sun, Jingchen Fan, Yixing Zhao, Fu Liu, Xiaodan Liang, Xianpeng Lang, Yang Wang

Abstract: The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for… ▽ More The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird's Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-text and text-to-scene retrieval respectively. Codes and datasets will be available. △ Less

Submitted 18 June, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

arXiv:2312.16108 [pdf, other]

LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving

Authors: Tianyu Li, Peijin Jia, Bangjun Wang, Li Chen, Kun Jiang, Junchi Yan, Hongyang Li

Abstract: A map, as crucial information for downstream applications of an autonomous driving system, is usually represented in lanelines or centerlines. However, existing literature on map learning primarily focuses on either detecting geometry-based lanelines or perceiving topology relationships of centerlines. Both of these methods ignore the intrinsic relationship of lanelines and centerlines, that lanel… ▽ More A map, as crucial information for downstream applications of an autonomous driving system, is usually represented in lanelines or centerlines. However, existing literature on map learning primarily focuses on either detecting geometry-based lanelines or perceiving topology relationships of centerlines. Both of these methods ignore the intrinsic relationship of lanelines and centerlines, that lanelines bind centerlines. While simply predicting both types of lane in one model is mutually excluded in learning objective, we advocate lane segment as a new representation that seamlessly incorporates both geometry and topology information. Thus, we introduce LaneSegNet, the first end-to-end mapping network generating lane segments to obtain a complete representation of the road structure. Our algorithm features two key modifications. One is a lane attention module to capture pivotal region details within the long-range feature space. Another is an identical initialization strategy for reference points, which enhances the learning of positional priors for lane attention. On the OpenLane-V2 dataset, LaneSegNet outperforms previous counterparts by a substantial gain across three tasks, \textit{i.e.}, map element detection (+4.8 mAP), centerline perception (+6.9 DET$_l$), and the newly defined one, lane segment perception (+5.6 mAP). Furthermore, it obtains a real-time inference speed of 14.7 FPS. Code is accessible at https://github.com/OpenDriveLab/LaneSegNet. △ Less

Submitted 26 February, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

Comments: Accepted in ICLR 2024

arXiv:2312.15450 [pdf, other]

Agent4Ranking: Semantic Robust Ranking via Personalized Query Rewriting Using Multi-agent LLM

Authors: Xiaopeng Li, Lixin Su, Pengyue Jia, Xiangyu Zhao, Suqi Cheng, Junfeng Wang, Dawei Yin

Abstract: Search engines are crucial as they provide an efficient and easy way to access vast amounts of information on the internet for diverse information needs. User queries, even with a specific need, can differ significantly. Prior research has explored the resilience of ranking models against typical query variations like paraphrasing, misspellings, and order changes. Yet, these works overlook how div… ▽ More Search engines are crucial as they provide an efficient and easy way to access vast amounts of information on the internet for diverse information needs. User queries, even with a specific need, can differ significantly. Prior research has explored the resilience of ranking models against typical query variations like paraphrasing, misspellings, and order changes. Yet, these works overlook how diverse demographics uniquely formulate identical queries. For instance, older individuals tend to construct queries more naturally and in varied order compared to other groups. This demographic diversity necessitates enhancing the adaptability of ranking models to diverse query formulations. To this end, in this paper, we propose a framework that integrates a novel rewriting pipeline that rewrites queries from various demographic perspectives and a novel framework to enhance ranking robustness. To be specific, we use Chain of Thought (CoT) technology to utilize Large Language Models (LLMs) as agents to emulate various demographic profiles, then use them for efficient query rewriting, and we innovate a robust Multi-gate Mixture of Experts (MMoE) architecture coupled with a hybrid loss function, collectively strengthening the ranking models' robustness. Our extensive experimentation on both public and industrial datasets assesses the efficacy of our query rewriting approach and the enhanced accuracy and robustness of the ranking model. The findings highlight the sophistication and effectiveness of our proposed model. △ Less

Submitted 24 December, 2023; originally announced December 2023.

arXiv:2311.18214 [pdf, other]

Perception of Misalignment States for Sky Survey Telescopes with the Digital Twin and the Deep Neural Networks

Authors: Miao Zhang, Peng Jia, Zhengyang Li, Wennan Xiang, Jiameng Lv, Rui Sun

Abstract: Sky survey telescopes play a critical role in modern astronomy, but misalignment of their optical elements can introduce significant variations in point spread functions, leading to reduced data quality. To address this, we need a method to obtain misalignment states, aiding in the reconstruction of accurate point spread functions for data processing methods or facilitating adjustments of optical… ▽ More Sky survey telescopes play a critical role in modern astronomy, but misalignment of their optical elements can introduce significant variations in point spread functions, leading to reduced data quality. To address this, we need a method to obtain misalignment states, aiding in the reconstruction of accurate point spread functions for data processing methods or facilitating adjustments of optical components for improved image quality. Since sky survey telescopes consist of many optical elements, they result in a vast array of potential misalignment states, some of which are intricately coupled, posing detection challenges. However, by continuously adjusting the misalignment states of optical elements, we can disentangle coupled states. Based on this principle, we propose a deep neural network to extract misalignment states from continuously varying point spread functions in different field of views. To ensure sufficient and diverse training data, we recommend employing a digital twin to obtain data for neural network training. Additionally, we introduce the state graph to store misalignment data and explore complex relationships between misalignment states and corresponding point spread functions, guiding the generation of training data from experiments. Once trained, the neural network estimates misalignment states from observation data, regardless of the impacts caused by atmospheric turbulence, noise, and limited spatial sampling rates in the detector. The method proposed in this paper could be used to provide prior information for the active optics system and the optical system alignment. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: The aforementioned submission has been accepted by Optics Express. We kindly request any feedback or comments to be directed to the corresponding author, Peng Jia (robinmartin20@gmail.com), or the second corresponding author, Zhengyang Li (lizy@niaot.ac.cn). Please note that Zhengyang is currently stationed in the South Antarctica and will not be available until after February 1st, 2024

arXiv:2311.16974 [pdf, other]

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

Authors: Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo

Abstract: Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands design-oriented planning, reasoning, and layer-wise generation. Unlike the recent CanvaGPT, which integrates GPT-4 with existing design templates to build a custom GPT, this paper introduces the COLE system - a hierarchical generation framework designed… ▽ More Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands design-oriented planning, reasoning, and layer-wise generation. Unlike the recent CanvaGPT, which integrates GPT-4 with existing design templates to build a custom GPT, this paper introduces the COLE system - a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a vague intention prompt into a high-quality multi-layered graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text. Furthermore, we construct the DESIGNINTENTION benchmark to demonstrate the superiority of our COLE system over existing methods in generating high-quality graphic designs from user intent. Last, we present a Canva-like multi-layered image editing tool to support flexible editing of the generated multi-layered graphic design images. We perceive our COLE system as an important step towards addressing more complex and multi-layered graphic design generation tasks in the future. △ Less

Submitted 18 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: Technical report. Project page: https://graphic-design-generation-github-io.vercel.app/

arXiv:2311.15643 [pdf, other]

A Survey on Monocular Re-Localization: From the Perspective of Scene Map Representation

Authors: Jinyu Miao, Kun Jiang, Tuopu Wen, Yunlong Wang, Peijing Jia, Xuhe Zhao, Qian Cheng, Zhongyang Xiao, Jin Huang, Zhihua Zhong, Diange Yang

Abstract: Monocular Re-Localization (MRL) is a critical component in autonomous applications, estimating 6 degree-of-freedom ego poses w.r.t. the scene map based on monocular images. In recent decades, significant progress has been made in the development of MRL techniques. Numerous algorithms have accomplished extraordinary success in terms of localization accuracy and robustness. In MRL, scene maps are re… ▽ More Monocular Re-Localization (MRL) is a critical component in autonomous applications, estimating 6 degree-of-freedom ego poses w.r.t. the scene map based on monocular images. In recent decades, significant progress has been made in the development of MRL techniques. Numerous algorithms have accomplished extraordinary success in terms of localization accuracy and robustness. In MRL, scene maps are represented in various forms, and they determine how MRL methods work and how MRL methods perform. However, to the best of our knowledge, existing surveys do not provide systematic reviews about the relationship between MRL solutions and their used scene map representation. This survey fills the gap by comprehensively reviewing MRL methods from such a perspective, promoting further research. 1) We commence by delving into the problem definition of MRL, exploring current challenges, and comparing ours with existing surveys. 2) Many well-known MRL methods are categorized and reviewed into five classes according to the representation forms of utilized map, i.e., geo-tagged frames, visual landmarks, point clouds, vectorized semantic map, and neural network-based map. 3) To quantitatively and fairly compare MRL methods with various map, we introduce some public datasets and provide the performances of some state-of-the-art MRL methods. The strengths and weakness of MRL methods with different map are analyzed. 4) We finally introduce some topics of interest in this field and give personal opinions. This survey can serve as a valuable referenced materials for MRL, and a continuously updated summary of this survey is publicly available to the community at: https://github.com/jinyummiao/map-in-mono-reloc. △ Less

Submitted 12 January, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: 33 pages, 10 tables, 16 figures, under review

arXiv:2311.10380 [pdf, other]

MSE-Nets: Multi-annotated Semi-supervised Ensemble Networks for Improving Segmentation of Medical Image with Ambiguous Boundaries

Authors: Shuai Wang, Tengjin Weng, Jingyi Wang, Yang Shen, Zhidong Zhao, Yixiu Liu, Pengfei Jiao, Zhiming Cheng, Yaqi Wang

Abstract: Medical image segmentation annotations exhibit variations among experts due to the ambiguous boundaries of segmented objects and backgrounds in medical images. Although using multiple annotations for each image in the fully-supervised has been extensively studied for training deep models, obtaining a large amount of multi-annotated data is challenging due to the substantial time and manpower costs… ▽ More Medical image segmentation annotations exhibit variations among experts due to the ambiguous boundaries of segmented objects and backgrounds in medical images. Although using multiple annotations for each image in the fully-supervised has been extensively studied for training deep models, obtaining a large amount of multi-annotated data is challenging due to the substantial time and manpower costs required for segmentation annotations, resulting in most images lacking any annotations. To address this, we propose Multi-annotated Semi-supervised Ensemble Networks (MSE-Nets) for learning segmentation from limited multi-annotated and abundant unannotated data. Specifically, we introduce the Network Pairwise Consistency Enhancement (NPCE) module and Multi-Network Pseudo Supervised (MNPS) module to enhance MSE-Nets for the segmentation task by considering two major factors: (1) to optimize the utilization of all accessible multi-annotated data, the NPCE separates (dis)agreement annotations of multi-annotated data at the pixel level and handles agreement and disagreement annotations in different ways, (2) to mitigate the introduction of imprecise pseudo-labels, the MNPS extends the training data by leveraging consistent pseudo-labels from unannotated data. Finally, we improve confidence calibration by averaging the predictions of base networks. Experiments on the ISIC dataset show that we reduced the demand for multi-annotated data by 97.75\% and narrowed the gap with the best fully-supervised baseline to just a Jaccard index of 4\%. Furthermore, compared to other semi-supervised methods that rely only on a single annotation or a combined fusion approach, the comprehensive experimental results on ISIC and RIGA datasets demonstrate the superior performance of our proposed method in medical image segmentation with ambiguous boundaries. △ Less

Submitted 17 November, 2023; originally announced November 2023.

arXiv:2311.03897 [pdf, other]

doi 10.1007/978-3-031-43415-0_40

Temporal Graph Representation Learning with Adaptive Augmentation Contrastive

Authors: Hongjiang Chen, Pengfei Jiao, Huijun Tang, Huaming Wu

Abstract: Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contr… ▽ More Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contrastive learning has shown promise in dealing with noise, it only applies to static graphs or snapshots and may not be suitable for handling time-dependent noise. To alleviate the above challenge, we propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model. The adaptive augmentation on the temporal graph is made by combining prior knowledge with temporal information, and the contrastive objective function is constructed by defining the augmented inter-view contrast and intra-view contrast. To complement TGAC, we propose three adaptive augmentation strategies that modify topological features to reduce noise from the network. Our extensive experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Journal ref: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 683-699. 2023

arXiv:2311.00186 [pdf, other]

Image Restoration with Point Spread Function Regularization and Active Learning

Authors: Peng Jia, Jiameng Lv, Runyu Ning, Yu Song, Nan Li, Kaifan Ji, Chenzhou Cui, Shanshan Li

Abstract: Large-scale astronomical surveys can capture numerous images of celestial objects, including galaxies and nebulae. Analysing and processing these images can reveal intricate internal structures of these objects, allowing researchers to conduct comprehensive studies on their morphology, evolution, and physical properties. However, varying noise levels and point spread functions can hamper the accur… ▽ More Large-scale astronomical surveys can capture numerous images of celestial objects, including galaxies and nebulae. Analysing and processing these images can reveal intricate internal structures of these objects, allowing researchers to conduct comprehensive studies on their morphology, evolution, and physical properties. However, varying noise levels and point spread functions can hamper the accuracy and efficiency of information extraction from these images. To mitigate these effects, we propose a novel image restoration algorithm that connects a deep learning-based restoration algorithm with a high-fidelity telescope simulator. During the training stage, the simulator generates images with different levels of blur and noise to train the neural network based on the quality of restored images. After training, the neural network can directly restore images obtained by the telescope, as represented by the simulator. We have tested the algorithm using real and simulated observation data and have found that it effectively enhances fine structures in blurry images and increases the quality of observation images. This algorithm can be applied to large-scale sky survey data, such as data obtained by LSST, Euclid, and CSST, to further improve the accuracy and efficiency of information extraction, promoting advances in the field of astronomical research. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: To be published in the MNRAS

arXiv:2310.19056 [pdf, other]

MILL: Mutual Verification with Large Language Models for Zero-Shot Query Expansion

Authors: Pengyue Jia, Yiding Liu, Xiangyu Zhao, Xiaopeng Li, Changying Hao, Shuaiqiang Wang, Dawei Yin

Abstract: Query expansion, pivotal in search engines, enhances the representation of user information needs with additional terms. While existing methods expand queries using retrieved or generated contextual documents, each approach has notable limitations. Retrieval-based methods often fail to accurately capture search intent, particularly with brief or ambiguous queries. Generation-based methods, utilizi… ▽ More Query expansion, pivotal in search engines, enhances the representation of user information needs with additional terms. While existing methods expand queries using retrieved or generated contextual documents, each approach has notable limitations. Retrieval-based methods often fail to accurately capture search intent, particularly with brief or ambiguous queries. Generation-based methods, utilizing large language models (LLMs), generally lack corpus-specific knowledge and entail high fine-tuning costs. To address these gaps, we propose a novel zero-shot query expansion framework utilizing LLMs for mutual verification. Specifically, we first design a query-query-document generation method, leveraging LLMs' zero-shot reasoning ability to produce diverse sub-queries and corresponding documents. Then, a mutual verification process synergizes generated and retrieved documents for optimal expansion. Our proposed method is fully zero-shot, and extensive experiments on three public benchmark datasets are conducted to demonstrate its effectiveness over existing methods. Our code is available online at https://github.com/Applied-Machine-Learning-Lab/MILL to ease reproduction. △ Less

Submitted 28 March, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

Comments: Accepted to NAACL 2024

arXiv:2310.00874 [pdf, other]

PC-NeRF: Parent-Child Neural Radiance Fields under Partial Sensor Data Loss in Autonomous Driving Environments

Authors: Xiuzhong Hu, Guangming Xiong, Zheng Zang, Peng Jia, Yuxuan Han, Junyi Ma

Abstract: Reconstructing large-scale 3D scenes is essential for autonomous vehicles, especially when partial sensor data is lost. Although the recently developed neural radiance fields (NeRF) have shown compelling results in implicit representations, the large-scale 3D scene reconstruction using partially lost LiDAR point cloud data still needs to be explored. To bridge this gap, we propose a novel 3D scene… ▽ More Reconstructing large-scale 3D scenes is essential for autonomous vehicles, especially when partial sensor data is lost. Although the recently developed neural radiance fields (NeRF) have shown compelling results in implicit representations, the large-scale 3D scene reconstruction using partially lost LiDAR point cloud data still needs to be explored. To bridge this gap, we propose a novel 3D scene reconstruction framework called parent-child neural radiance field (PC-NeRF). The framework comprises two modules, the parent NeRF and the child NeRF, to simultaneously optimize scene-level, segment-level, and point-level scene representations. Sensor data can be utilized more efficiently by leveraging the segment-level representation capabilities of child NeRFs, and an approximate volumetric representation of the scene can be quickly obtained even with limited observations. With extensive experiments, our proposed PC-NeRF is proven to achieve high-precision 3D reconstruction in large-scale scenes. Moreover, PC-NeRF can effectively tackle situations where partial sensor data is lost and has high deployment efficiency with limited training time. Our approach implementation and the pre-trained models will be available at https://github.com/biter0088/pc-nerf. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.03539 [pdf, other]

YOLO series target detection algorithms for underwater environments

Authors: Chenjie Zhang, Pengcheng Jiao

Abstract: You Only Look Once (YOLO) algorithm is a representative target detection algorithm emerging in 2016, which is known for its balance of computing speed and accuracy, and now plays an important role in various fields of human production and life. However, there are still many limitations in the application of YOLO algorithm in underwater environments due to problems such as dim light and turbid wate… ▽ More You Only Look Once (YOLO) algorithm is a representative target detection algorithm emerging in 2016, which is known for its balance of computing speed and accuracy, and now plays an important role in various fields of human production and life. However, there are still many limitations in the application of YOLO algorithm in underwater environments due to problems such as dim light and turbid water. With limited land area resources, the ocean must have great potential for future human development. In this paper, starting from the actual needs of marine engineering applications, taking underwater structural health monitoring (SHM) and underwater biological detection as examples, we propose improved methods for the application of underwater YOLO algorithms, and point out the problems that still exist. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2307.16289 [pdf]

Implementing Edge Based Object Detection For Microplastic Debris

Authors: Amardeep Singh, Prof. Charles Jia, Prof. Donald Kirk

Abstract: Plastic has imbibed itself as an indispensable part of our day to day activities, becoming a source of problems due to its non-biodegradable nature and cheaper production prices. With these problems, comes the challenge of mitigating and responding to the aftereffects of disposal or the lack of proper disposal which leads to waste concentrating in locations and disturbing ecosystems for both plant… ▽ More Plastic has imbibed itself as an indispensable part of our day to day activities, becoming a source of problems due to its non-biodegradable nature and cheaper production prices. With these problems, comes the challenge of mitigating and responding to the aftereffects of disposal or the lack of proper disposal which leads to waste concentrating in locations and disturbing ecosystems for both plants and animals. As plastic debris levels continue to rise with the accumulation of waste in garbage patches in landfills and more hazardously in natural water bodies, swift action is necessary to plug or cease this flow. While manual sorting operations and detection can offer a solution, they can be augmented using highly advanced computer imagery linked with robotic appendages for removing wastes. The primary application of focus in this report are the much-discussed Computer Vision and Open Vision which have gained novelty for their light dependence on internet and ability to relay information in remote areas. These applications can be applied to the creation of edge-based mobility devices that can as a counter to the growing problem of plastic debris in oceans and rivers, demanding little connectivity and still offering the same results with reasonably timed maintenance. The principal findings of this project cover the various methods that were tested and deployed to detect waste in images, as well as comparing them against different waste types. The project has been able to produce workable models that can perform on time detection of sampled images using an augmented CNN approach. Latter portions of the project have also achieved a better interpretation of the necessary preprocessing steps required to arrive at the best accuracies, including the best hardware for expanding waste detection studies to larger environments. △ Less

Submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.00313 [pdf, other]

PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers

Authors: Peidong Jia, Jiaming Liu, Senqiao Yang, Jiarui Wu, Xiaodong Xie, Shanghang Zhang

Abstract: The Transformer-based detectors (i.e., DETR) have demonstrated impressive performance on end-to-end object detection. However, transferring DETR to different data distributions may lead to a significant performance degradation. Existing adaptation techniques focus on model-based approaches, which aim to leverage feature alignment to narrow the distribution shift between different domains. In this… ▽ More The Transformer-based detectors (i.e., DETR) have demonstrated impressive performance on end-to-end object detection. However, transferring DETR to different data distributions may lead to a significant performance degradation. Existing adaptation techniques focus on model-based approaches, which aim to leverage feature alignment to narrow the distribution shift between different domains. In this study, we propose a hierarchical Prompt Domain Memory (PDM) for adapting detection transformers to different distributions. PDM comprehensively leverages the prompt memory to extract domain-specific knowledge and explicitly constructs a long-term memory space for the data distribution, which represents better domain diversity compared to existing methods. Specifically, each prompt and its corresponding distribution value are paired in the memory space, and we inject top M distribution-similar prompts into the input and multi-level embeddings of DETR. Additionally, we introduce the Prompt Memory Alignment (PMA) to reduce the discrepancy between the source and target domains by fully leveraging the domain-specific knowledge extracted from the prompt domain memory. Extensive experiments demonstrate that our method outperforms state-of-the-art domain adaptive object detection methods on three benchmarks, including scene, synthetic to real, and weather adaptation. Codes will be released. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Comments: cs.cv

MSC Class: 68T07 ACM Class: I.5.1

arXiv:2306.14287 [pdf, other]

Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression

Authors: A. Burakhan Koyuncu, Panqi Jia, Atanas Boev, Elena Alshina, Eckehard Steinbach

Abstract: Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autor… ▽ More Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models. △ Less

Submitted 27 February, 2024; v1 submitted 25 June, 2023; originally announced June 2023.

Comments: Accepted for IEEE TCSVT (14 pages, 10 figures, 9 tables)

arXiv:2306.04344 [pdf, other]

ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

Authors: Jiaming Liu, Senqiao Yang, Peidong Jia, Renrui Zhang, Ming Lu, Yandong Guo, Wei Xue, Shanghang Zhang

Abstract: Since real-world machine systems are running in non-stationary environments, Continual Test-Time Adaptation (CTTA) task is proposed to adapt the pre-trained model to continually changing target domains. Recently, existing methods mainly focus on model-based adaptation, which aims to leverage a self-training manner to extract the target domain knowledge. However, pseudo labels can be noisy and the… ▽ More Since real-world machine systems are running in non-stationary environments, Continual Test-Time Adaptation (CTTA) task is proposed to adapt the pre-trained model to continually changing target domains. Recently, existing methods mainly focus on model-based adaptation, which aims to leverage a self-training manner to extract the target domain knowledge. However, pseudo labels can be noisy and the updated model parameters are unreliable under dynamic data distributions, leading to error accumulation and catastrophic forgetting in the continual adaptation process. To tackle these challenges and maintain the model plasticity, we design a Visual Domain Adapter (ViDA) for CTTA, explicitly handling both domain-specific and domain-shared knowledge. Specifically, we first comprehensively explore the different domain representations of the adapters with trainable high-rank or low-rank embedding spaces. Then we inject ViDAs into the pre-trained model, which leverages high-rank and low-rank features to adapt the current domain distribution and maintain the continual domain-shared knowledge, respectively. To exploit the low-rank and high-rank ViDAs more effectively, we further propose a Homeostatic Knowledge Allotment (HKA) strategy, which adaptively combines different knowledge from each ViDA. Extensive experiments conducted on four widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance in both classification and segmentation CTTA tasks. Note that, our method can be regarded as a novel transfer paradigm for large-scale models, delivering promising results in adaptation to continually changing distributions. Project page: https://sites.google.com/view/iclr2024-vida/home. △ Less

Submitted 27 March, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted by ICLR2024

arXiv:2304.10440 [pdf, other]

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

Authors: Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Bangjun Wang, Peijin Jia, Yuting Wang, Shengyin Jiang, Feng Wen, Hang Xu, Ping Luo, Junchi Yan, Wei Zhang, Hongyang Li

Abstract: Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for tra… ▽ More Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. The objective of the presented dataset is to advance research in understanding the structure of road scenes by examining the relationship between perceived entities, such as traffic elements and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes. It comprises three primary sub-tasks, including the 3D lane detection inherited from OpenLane, accompanied by corresponding metrics to evaluate the model's performance. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes. △ Less

Submitted 28 October, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

Comments: Accepted by NeurIPS 2023 Track on Datasets and Benchmarks | OpenLane-V2 Dataset: https://github.com/OpenDriveLab/OpenLane-V2

arXiv:2302.12647 [pdf]

doi 10.1109/RBME.2022.3210015

Bioinspired soft robotics: How do we learn from creatures?

Authors: Yang Yang, Zhiguo He, Pengcheng Jiao, Hongliang Ren

Abstract: Soft robotics has opened a unique path to flexibility and environmental adaptability, learning from nature and reproducing biological behaviors. Nature implies answers for how to apply robots to real life. To find out how we learn from creatures to design and apply soft robots, in this Review, we propose a classification method to summarize soft robots based on different functions of biological sy… ▽ More Soft robotics has opened a unique path to flexibility and environmental adaptability, learning from nature and reproducing biological behaviors. Nature implies answers for how to apply robots to real life. To find out how we learn from creatures to design and apply soft robots, in this Review, we propose a classification method to summarize soft robots based on different functions of biological systems: self-growing, self-healing, self-responsive, and self-circulatory. The bio-function based classification logic is presented to explain why we learn from creatures. State-of-art technologies, characteristics, pros, cons, challenges, and potential applications of these categories are analyzed to illustrate what we learned from creatures. By intersecting these categories, the existing and potential bio-inspired applications are overviewed and outlooked to finally find the answer, that is, how we learn from creatures. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: 15 pages, 6 figures, journal article IEEE Reviews in biomedical engineering(2022), Early Access

arXiv:2302.11747 [pdf, other]

Amos-SLAM: An Anti-Dynamics Two-stage SLAM Approach

Authors: Yaoming Zhuang, Pengrun Jia, Zheng Liu, Li Li, Chengdong Wu, Wei cui, Zhanlin Liu

Abstract: The traditional Simultaneous Localization And Mapping (SLAM) systems rely on the assumption of a static environment and fail to accurately estimate the system's location when dynamic objects are present in the background. While learning-based dynamic SLAM systems have difficulties in handling unknown moving objects, geometry-based methods have limited success in addressing the residual effects of… ▽ More The traditional Simultaneous Localization And Mapping (SLAM) systems rely on the assumption of a static environment and fail to accurately estimate the system's location when dynamic objects are present in the background. While learning-based dynamic SLAM systems have difficulties in handling unknown moving objects, geometry-based methods have limited success in addressing the residual effects of unidentified dynamic objects on location estimation. To address these issues, we propose an anti-dynamics two-stage SLAM approach. Firstly, the potential motion regions of both prior and non-prior dynamic objects are extracted and pose estimates for dynamic discrimination are quickly obtained using optical flow tracking and model generation methods. Secondly, dynamic points in each frame are removed through dynamic judgment. For non-prior dynamic objects, we present a approach that uses super-pixel extraction and geometric clustering to determine the potential motion regions based on color and geometric information in the image. Evaluations on multiple low and high dynamic sequences in a public RGB-D dataset show that our proposed method outperforms state-of-the-art dynamic SLAM methods. △ Less

Submitted 22 February, 2023; originally announced February 2023.

arXiv:2302.08217 [pdf]

doi 10.1089/soro.2021.0158

Copebot: Underwater soft robot with copepod-like locomotion

Authors: Zhiguo He, Yang Yang, Pengcheng Jiao, Haipeng Wang, Guanzheng Lin, Thomas Pähtz

Abstract: It has been a great challenge to develop robots that are able to perform complex movement patterns with high speed and, simultaneously, high accuracy. Copepods are animals found in freshwater and saltwater habitats that can have extremely fast escape responses when a predator is sensed by performing explosive curved jumps. Here, we present a design and build prototypes of a combustion-driven under… ▽ More It has been a great challenge to develop robots that are able to perform complex movement patterns with high speed and, simultaneously, high accuracy. Copepods are animals found in freshwater and saltwater habitats that can have extremely fast escape responses when a predator is sensed by performing explosive curved jumps. Here, we present a design and build prototypes of a combustion-driven underwater soft robot, the "copebot", that, like copepods, is able to accurately reach nearby predefined locations in space within a single curved jump. Because of an improved thrust force transmission unit, causing a large initial acceleration peak (850 Bodylength*s-2), the copebot is 8 times faster than previous combustion-driven underwater soft robots, whilst able to perform a complete 360° rotation during the jump. Thrusts generated by the copebot are tested to quantitatively determine the actuation performance, and parametric studies are conducted to investigate the sensitivities of the input parameters to the kinematic performance of the copebot. We demonstrate the utility of our design by building a prototype that rapidly jumps out of the water, accurately lands on its feet on a small platform, wirelessly transmits data, and jumps back into the water. Our copebot design opens the way toward high-performance biomimetic robots for multifunctional applications. △ Less

Submitted 29 April, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Journal ref: Soft Robotics 10 (2), 314-325 (2023)

arXiv:2212.05497 [pdf]

doi 10.3847/1538-4365/acab02

Target Detection Framework for Lobster Eye X-Ray Telescopes with Machine Learning Algorithms

Authors: Peng Jia, Wenbo Liu, Yuan Liu, Haiwu Pan

Abstract: Lobster eye telescopes are ideal monitors to detect X-ray transients, because they could observe celestial objects over a wide field of view in X-ray band. However, images obtained by lobster eye telescopes are modified by their unique point spread functions, making it hard to design a high efficiency target detection algorithm. In this paper, we integrate several machine learning algorithms to bu… ▽ More Lobster eye telescopes are ideal monitors to detect X-ray transients, because they could observe celestial objects over a wide field of view in X-ray band. However, images obtained by lobster eye telescopes are modified by their unique point spread functions, making it hard to design a high efficiency target detection algorithm. In this paper, we integrate several machine learning algorithms to build a target detection framework for data obtained by lobster eye telescopes. Our framework would firstly generate two 2D images with different pixel scales according to positions of photons on the detector. Then an algorithm based on morphological operations and two neural networks would be used to detect candidates of celestial objects with different flux from these 2D images. At last, a random forest algorithm will be used to pick up final detection results from candidates obtained by previous steps. Tested with simulated data of the Wide-field X-ray Telescope onboard the Einstein Probe, our detection framework could achieve over 94% purity and over 90% completeness for targets with flux more than 3 mCrab (9.6 * 10-11 erg/cm2/s) and more than 94% purity and moderate completeness for targets with lower flux at acceptable time cost. The framework proposed in this paper could be used as references for data processing methods developed for other lobster eye X-ray telescopes. △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: Accepted by the APJS Journal. Full source code could be downloaded from the China VO with DOI of https://doi.org/10.12149/101175. Docker version of the code could be obtained under request to the corresponding author

arXiv:2211.05972 [pdf, other]

doi 10.3847/1538-3881/aca1c2

Detection of Strongly Lensed Arcs in Galaxy Clusters with Transformers

Authors: Peng Jia, Ruiqi Sun, Nan Li, Yu Song, Runyu Ning, Hongyan Wei, Rui Luo

Abstract: Strong lensing in galaxy clusters probes properties of dense cores of dark matter halos in mass, studies the distant universe at flux levels and spatial resolutions otherwise unavailable, and constrains cosmological models independently. The next-generation large scale sky imaging surveys are expected to discover thousands of cluster-scale strong lenses, which would lead to unprecedented opportuni… ▽ More Strong lensing in galaxy clusters probes properties of dense cores of dark matter halos in mass, studies the distant universe at flux levels and spatial resolutions otherwise unavailable, and constrains cosmological models independently. The next-generation large scale sky imaging surveys are expected to discover thousands of cluster-scale strong lenses, which would lead to unprecedented opportunities for applying cluster-scale strong lenses to solve astrophysical and cosmological problems. However, the large dataset challenges astronomers to identify and extract strong lensing signals, particularly strongly lensed arcs, because of their complexity and variety. Hence, we propose a framework to detect cluster-scale strongly lensed arcs, which contains a transformer-based detection algorithm and an image simulation algorithm. We embed prior information of strongly lensed arcs at cluster-scale into the training data through simulation and then train the detection algorithm with simulated images. We use the trained transformer to detect strongly lensed arcs from simulated and real data. Results show that our approach could achieve 99.63 % accuracy rate, 90.32 % recall rate, 85.37 % precision rate and 0.23 % false positive rate in detection of strongly lensed arcs from simulated images and could detect almost all strongly lensed arcs in real observation images. Besides, with an interpretation method, we have shown that our method could identify important information embedded in simulated data. Next step, to test the reliability and usability of our approach, we will apply it to available observations (e.g., DESI Legacy Imaging Surveys) and simulated data of upcoming large-scale sky surveys, such as the Euclid and the CSST. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: Submitted to the Astronomical Journal, source code could be obtained from PaperData sponsored by China-VO group with DOI of 10.12149/101172. Cloud computing resources would be released under request

arXiv:2210.07482 [pdf, other]

Cargo Ecosystem Dependency-Vulnerability Knowledge Graph Construction and Vulnerability Propagation Study

Authors: Peiyang Jia, Chengwei Liu, Hongyu Sun, Chengyi Sun, Mianxue Gu, Yang Liu, Yuqing Zhang

Abstract: Currently, little is known about the structure of the Cargo ecosystem and the potential for vulnerability propagation. Many empirical studies generalize third-party dependency governance strategies from a single software ecosystem to other ecosystems but ignore the differences in the technical structures of different software ecosystems, making it difficult to directly generalize security governan… ▽ More Currently, little is known about the structure of the Cargo ecosystem and the potential for vulnerability propagation. Many empirical studies generalize third-party dependency governance strategies from a single software ecosystem to other ecosystems but ignore the differences in the technical structures of different software ecosystems, making it difficult to directly generalize security governance strategies from other ecosystems to the Cargo ecosystem. To fill the gap in this area, this paper constructs a knowledge graph of dependency vulnerabilities for the Cargo ecosystem using techniques related to knowledge graphs to address this challenge. This paper is the first large-scale empirical study in a related research area to address vulnerability propagation in the Cargo ecosystem. This paper proposes a dependency-vulnerability knowledge graph parsing algorithm to determine the vulnerability propagation path and propagation range and empirically studies the characteristics of vulnerabilities in the Cargo ecosystem, the propagation range, and the factors that cause vulnerability propagation. Our research has found that the Cargo ecosystem's security vulnerabilities are primarily memory-related. 18% of the libraries affected by the vulnerability is still affected by the vulnerability in the latest version of the library. The number of versions affected by the propagation of the vulnerabilities is 19.78% in the entire Cargo ecosystem. This paper looks at the characteristics and propagation factors triggering vulnerabilities in the Cargo ecosystem. It provides some practical resolution strategies for administrators of the Cargo community, developers who use Cargo to manage third-party libraries, and library owners. This paper provides new ideas for improving the overall security of the Cargo ecosystem. △ Less

Submitted 13 October, 2022; originally announced October 2022.

arXiv:2205.10199 [pdf, other]

A Novel Underwater Image Enhancement and Improved Underwater Biological Detection Pipeline

Authors: Zheng Liu, Yaoming Zhuang, Pengrun Jia, Chengdong Wu, Hongli Xu ang Zhanlin Liu

Abstract: For aquaculture resource evaluation and ecological environment monitoring, automatic detection and identification of marine organisms is critical. However, due to the low quality of underwater images and the characteristics of underwater biological, a lack of abundant features may impede traditional hand-designed feature extraction approaches or CNN-based object detection algorithms, particularly… ▽ More For aquaculture resource evaluation and ecological environment monitoring, automatic detection and identification of marine organisms is critical. However, due to the low quality of underwater images and the characteristics of underwater biological, a lack of abundant features may impede traditional hand-designed feature extraction approaches or CNN-based object detection algorithms, particularly in complex underwater environment. Therefore, the goal of this paper is to perform object detection in the underwater environment. This paper proposed a novel method for capturing feature information, which adds the convolutional block attention module (CBAM) to the YOLOv5 backbone. The interference of underwater creature characteristics on object characteristics is decreased, and the output of the backbone network to object information is enhanced. In addition, the self-adaptive global histogram stretching algorithm (SAGHS) is designed to eliminate the degradation problems such as low contrast and color loss caused by underwater environmental information to better restore image quality. Extensive experiments and comprehensive evaluation on the URPC2021 benchmark dataset demonstrate the effectiveness and adaptivity of our methods. Beyond that, this paper conducts an exhaustive analysis of the role of training data on performance. △ Less

Submitted 20 May, 2022; originally announced May 2022.

Comments: 14 pages,14 figures

arXiv:2202.06257 [pdf]

Fine-Grained Population Mobility Data-Based Community-Level COVID-19 Prediction Model

Authors: Pengyue Jia, Ling Chen, Dandan Lyu

Abstract: Predicting the number of infections in the anti-epidemic process is extremely beneficial to the government in developing anti-epidemic strategies, especially in fine-grained geographic units. Previous works focus on low spatial resolution prediction, e.g., county-level, and preprocess data to the same geographic level, which loses some useful information. In this paper, we propose a fine-grained p… ▽ More Predicting the number of infections in the anti-epidemic process is extremely beneficial to the government in developing anti-epidemic strategies, especially in fine-grained geographic units. Previous works focus on low spatial resolution prediction, e.g., county-level, and preprocess data to the same geographic level, which loses some useful information. In this paper, we propose a fine-grained population mobility data-based model (FGC-COVID) utilizing data of two geographic levels for community-level COVID-19 prediction. We use the population mobility data between Census Block Groups (CBGs), which is a finer-grained geographic level than community, to build the graph and capture the dependencies between CBGs using graph neural networks (GNNs). To mine as finer-grained patterns as possible for prediction, a spatial weighted aggregation module is introduced to aggregate the embeddings of CBGs to community level based on their geographic affiliation and spatial autocorrelation. Extensive experiments on 300 days LA city COVID-19 data indicate our model outperforms existing forecasting models on community-level COVID-19 prediction. △ Less

Submitted 15 July, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: Accepted by Cybernetics and Systems

arXiv:2201.06972 [pdf, other]

Representation Learning on Heterostructures via Heterogeneous Anonymous Walks

Authors: Xuan Guo, Pengfei Jiao, Ting Pan, Wang Zhang, Mengyu Jia, Danyang Shi, Wenjun Wang

Abstract: Capturing structural similarity has been a hot topic in the field of network embedding recently due to its great help in understanding the node functions and behaviors. However, existing works have paid very much attention to learning structures on homogeneous networks while the related study on heterogeneous networks is still a void. In this paper, we try to take the first step for representation… ▽ More Capturing structural similarity has been a hot topic in the field of network embedding recently due to its great help in understanding the node functions and behaviors. However, existing works have paid very much attention to learning structures on homogeneous networks while the related study on heterogeneous networks is still a void. In this paper, we try to take the first step for representation learning on heterostructures, which is very challenging due to their highly diverse combinations of node types and underlying structures. To effectively distinguish diverse heterostructures, we firstly propose a theoretically guaranteed technique called heterogeneous anonymous walk (HAW) and its variant coarse HAW (CHAW). Then, we devise the heterogeneous anonymous walk embedding (HAWE) and its variant coarse HAWE in a data-driven manner to circumvent using an extremely large number of possible walks and train embeddings by predicting occurring walks in the neighborhood of each node. Finally, we design and apply extensive and illustrative experiments on synthetic and real-world networks to build a benchmark on heterostructure learning and evaluate the effectiveness of our methods. The results demonstrate our methods achieve outstanding performance compared with both homogeneous and heterogeneous classic methods, and can be applied on large-scale networks. △ Less

Submitted 18 January, 2022; originally announced January 2022.

Comments: 13 pages, 6 figures, 5 tables

MSC Class: 68T30 ACM Class: J.4.3

arXiv:2112.13507 [pdf, other]

Block Modeling-Guided Graph Convolutional Neural Networks

Authors: Dongxiao He, Chundong Liang, Huixin Liu, Mingxiang Wen, Pengfei Jiao, Zhiyong Feng

Abstract: Graph Convolutional Network (GCN) has shown remarkable potential of exploring graph representation. However, the GCN aggregating mechanism fails to generalize to networks with heterophily where most nodes have neighbors from different classes, which commonly exists in real-world networks. In order to make the propagation and aggregation mechanism of GCN suitable for both homophily and heterophily… ▽ More Graph Convolutional Network (GCN) has shown remarkable potential of exploring graph representation. However, the GCN aggregating mechanism fails to generalize to networks with heterophily where most nodes have neighbors from different classes, which commonly exists in real-world networks. In order to make the propagation and aggregation mechanism of GCN suitable for both homophily and heterophily (or even their mixture), we introduce block modeling into the framework of GCN so that it can realize "block-guided classified aggregation", and automatically learn the corresponding aggregation rules for neighbors of different classes. By incorporating block modeling into the aggregation process, GCN is able to aggregate information from homophilic and heterophilic neighbors discriminately according to their homophily degree. We compared our algorithm with state-of-art methods which deal with the heterophily problem. Empirical results demonstrate the superiority of our new approach over existing methods in heterophilic datasets while maintaining a competitive performance in homophilic datasets. △ Less

Submitted 27 December, 2021; v1 submitted 26 December, 2021; originally announced December 2021.

Comments: Accepted by Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

arXiv:2111.11030 [pdf, ps, other]

doi 10.1016/j.neucom.2023.126689

Reinforcement Learning for Few-Shot Text Generation Adaptation

Authors: Pengsen Cheng, Jinqiao Dai, Jiamiao Liu, Jiayong Liu, Peng Jia

Abstract: Controlling the generative model to adapt a new domain with limited samples is a difficult challenge and it is receiving increasing attention. Recently, methods based on meta-learning have shown promising results for few-shot domain adaptation. However, meta-learning-based methods usually suffer from the problem of overfitting, which results in a lack of diversity in the generated texts. To avoid… ▽ More Controlling the generative model to adapt a new domain with limited samples is a difficult challenge and it is receiving increasing attention. Recently, methods based on meta-learning have shown promising results for few-shot domain adaptation. However, meta-learning-based methods usually suffer from the problem of overfitting, which results in a lack of diversity in the generated texts. To avoid this problem, in this study, a novel framework based on reinforcement learning (RL) is proposed. In this framework, to increase the sample utilization of RL and decrease its sample requirement, maximum likelihood estimation learning is incorporated into the RL process. When there are only a few in-domain samples available, experimental results on five target domains in two few-shot configurations show that this framework performs better than baselines. △ Less

Submitted 7 December, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

Journal ref: Neurocomputing, 2023, 126689

arXiv:2111.08264 [pdf, other]

Analysis of 5G academic Network based on graph representation learning method

Authors: Xiaoming Li, Guangquan Xu, Wei Yu, Pengfei Jiao, Xiangyu Song

Abstract: With the rapid development of 5th Generation Mobile Communication Technology (5G), the diverse forms of collaboration and extensive data in academic social networks constructed by 5G papers make the management and analysis of academic social networks increasingly challenging. Despite the particular success achieved by representation learning in analyzing academic and social networks, most present… ▽ More With the rapid development of 5th Generation Mobile Communication Technology (5G), the diverse forms of collaboration and extensive data in academic social networks constructed by 5G papers make the management and analysis of academic social networks increasingly challenging. Despite the particular success achieved by representation learning in analyzing academic and social networks, most present presentation learning models focus on maintaining the first-order and second-order similarity of nodes. They rarely possess similar structural characteristics of spatial independence in the network. This paper proposes a Low-order Network representation Learning Model (LNLM) based on Non-negative Matrix Factorization (NMF) to solve these problems. The model uses the random walk method to extract low-order features of nodes and map multiple components to a low-dimensional space, effectively maintaining the internal correlation between members. This paper verifies the performance of this model, conducts comparative experiments on four test datasets and four real network datasets through downstream tasks such as multi-label classification, clustering, and link prediction. Comparing eight mainstream network representation learning models shows that the proposed model can significantly improve the detection efficiency and learning methods and effectively extract local and low-order features of the network. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: 38 pages, 9 figures

arXiv:2107.08379 [pdf, other]

A Survey on Role-Oriented Network Embedding

Authors: Pengfei Jiao, Xuan Guo, Ting Pan, Wang Zhang, Yulong Pei

Abstract: Recently, Network Embedding (NE) has become one of the most attractive research topics in machine learning and data mining. NE approaches have achieved promising performance in various of graph mining tasks including link prediction and node clustering and classification. A wide variety of NE methods focus on the proximity of networks. They learn community-oriented embedding for each node, where t… ▽ More Recently, Network Embedding (NE) has become one of the most attractive research topics in machine learning and data mining. NE approaches have achieved promising performance in various of graph mining tasks including link prediction and node clustering and classification. A wide variety of NE methods focus on the proximity of networks. They learn community-oriented embedding for each node, where the corresponding representations are similar if two nodes are closer to each other in the network. Meanwhile, there is another type of structural similarity, i.e., role-based similarity, which is usually complementary and completely different from the proximity. In order to preserve the role-based structural similarity, the problem of role-oriented NE is raised. However, compared to community-oriented NE problem, there are only a few role-oriented embedding approaches proposed recently. Although less explored, considering the importance of roles in analyzing networks and many applications that role-oriented NE can shed light on, it is necessary and timely to provide a comprehensive overview of existing role-oriented NE methods. In this review, we first clarify the differences between community-oriented and role-oriented network embedding. Afterwards, we propose a general framework for understanding role-oriented NE and a two-level categorization to better classify existing methods. Then, we select some representative methods according to the proposed categorization and briefly introduce them by discussing their motivation, development and differences. Moreover, we conduct comprehensive experiments to empirically evaluate these methods on a variety of role-related tasks including node classification and clustering (role discovery), top-k similarity search and visualization using some widely used synthetic and real-world datasets... △ Less

Submitted 18 July, 2021; originally announced July 2021.

Comments: 20 pages,9 figures, 5 tables

ACM Class: J.4.3

arXiv:2107.07957 [pdf, other]

Automatic Task Requirements Writing Evaluation via Machine Reading Comprehension

Authors: Shiting Xu, Guowei Xu, Peilei Jia, Wenbiao Ding, Zhongqin Wu, Zitao Liu

Abstract: Task requirements (TRs) writing is an important question type in Key English Test and Preliminary English Test. A TR writing question may include multiple requirements and a high-quality essay must respond to each requirement thoroughly and accurately. However, the limited teacher resources prevent students from getting detailed grading instantly. The majority of existing automatic essay scoring s… ▽ More Task requirements (TRs) writing is an important question type in Key English Test and Preliminary English Test. A TR writing question may include multiple requirements and a high-quality essay must respond to each requirement thoroughly and accurately. However, the limited teacher resources prevent students from getting detailed grading instantly. The majority of existing automatic essay scoring systems focus on giving a holistic score but rarely provide reasons to support it. In this paper, we proposed an end-to-end framework based on machine reading comprehension (MRC) to address this problem to some extent. The framework not only detects whether an essay responds to a requirement question, but clearly marks where the essay answers the question. Our framework consists of three modules: question normalization module, ELECTRA based MRC module and response locating module. We extensively explore state-of-the-art MRC methods. Our approach achieves 0.93 accuracy score and 0.85 F1 score on a real-world educational dataset. To encourage reproducible results, we make our code publicly available at \url{https://github.com/aied2021TRMRC/AIED_2021_TRMRC_code}. △ Less

Submitted 15 July, 2021; originally announced July 2021.

Comments: AIED'21: The 22nd International Conference on Artificial Intelligence in Education, 2021

Showing 1–50 of 71 results for author: Jia, P