subscribe to arXiv mailings

NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer

Authors: Meng You, Zhiyu Zhu, Hui Liu, Junhui Hou

Abstract: By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS) paradigm that operates \textit{without} the need for training. NVS-Solver adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences from single or multiple views of static scenes… ▽ More By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS) paradigm that operates \textit{without} the need for training. NVS-Solver adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the significant superiority of our NVS-Solver over state-of-the-art methods both quantitatively and qualitatively. \textit{ Source code in } \href{https://github.com/ZHU-Zhiyu/NVS_Solver}{https://github.com/ZHU-Zhiyu/NVS$\_$Solver}. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Technical Report

arXiv:2404.01954 [pdf, other]

HyperCLOVA X Technical Report

Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs. △ Less

Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 44 pages; updated authors list and fixed author names

arXiv:2401.12998 [pdf]

Evaluating and Enhancing Large Language Models Performance in Domain-specific Medicine: Osteoarthritis Management with DocOA

Authors: Xi Chen, MingKe You, Li Wang, WeiZhi Liu, Yu Fu, Jie Xu, Shaoting Zhang, Gang Chen, Kang Li, Jian Li

Abstract: The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. This study focused on evaluating and enhancing the clinical capabilities of LLMs in specific domains, using osteoarthritis (OA) management as a case study. A domain specific benchmark framework was developed, which evaluate LL… ▽ More The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. This study focused on evaluating and enhancing the clinical capabilities of LLMs in specific domains, using osteoarthritis (OA) management as a case study. A domain specific benchmark framework was developed, which evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM tailored for OA management that integrates retrieval-augmented generation (RAG) and instruction prompts, was developed. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results showed that general LLMs like GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. This study introduces a novel benchmark framework which assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 16 Pages, 7 Figures

arXiv:2307.10846 [pdf, other]

Goal-Conditioned Reinforcement Learning with Disentanglement-based Reachability Planning

Authors: Zhifeng Qian, Mingyu You, Hongjun Zhou, Xuanhui Xu, Bin He

Abstract: Goal-Conditioned Reinforcement Learning (GCRL) can enable agents to spontaneously set diverse goals to learn a set of skills. Despite the excellent works proposed in various fields, reaching distant goals in temporally extended tasks remains a challenge for GCRL. Current works tackled this problem by leveraging planning algorithms to plan intermediate subgoals to augment GCRL. Their methods need t… ▽ More Goal-Conditioned Reinforcement Learning (GCRL) can enable agents to spontaneously set diverse goals to learn a set of skills. Despite the excellent works proposed in various fields, reaching distant goals in temporally extended tasks remains a challenge for GCRL. Current works tackled this problem by leveraging planning algorithms to plan intermediate subgoals to augment GCRL. Their methods need two crucial requirements: (i) a state representation space to search valid subgoals, and (ii) a distance function to measure the reachability of subgoals. However, they struggle to scale to high-dimensional state space due to their non-compact representations. Moreover, they cannot collect high-quality training data through standard GC policies, which results in an inaccurate distance function. Both affect the efficiency and performance of planning and policy learning. In the paper, we propose a goal-conditioned RL algorithm combined with Disentanglement-based Reachability Planning (REPlan) to solve temporally extended tasks. In REPlan, a Disentangled Representation Module (DRM) is proposed to learn compact representations which disentangle robot poses and object positions from high-dimensional observations in a self-supervised manner. A simple REachability discrimination Module (REM) is also designed to determine the temporal distance of subgoals. Moreover, REM computes intrinsic bonuses to encourage the collection of novel states for training. We evaluate our REPlan in three vision-based simulation tasks and one real-world task. The experiments demonstrate that our REPlan significantly outperforms the prior state-of-the-art methods in solving temporally extended tasks. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted by 2023 RAL with ICRA

arXiv:2304.01716 [pdf, other]

Decoupling Dynamic Monocular Videos for Dynamic View Synthesis

Authors: Meng You, Junhui Hou

Abstract: The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the \textbf{dynamic objects} of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flo… ▽ More The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the \textbf{dynamic objects} of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by off-the-shelf methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision. △ Less

Submitted 30 May, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

arXiv:2302.10561 [pdf, ps, other]

doi 10.1109/LWC.2023.3337709

Model-free Optimization and Experimental Validation of RIS-assisted Wireless Communications under Rich Multipath Fading

Authors: Tianrui Chen, Minglei You, Yangyishi Zhang, Gan Zheng, Jean Baptiste Gros, Geoffroy Lerosey, Youssef Nasser, Fraser Burton, Gabriele Gradoni

Abstract: Reconfigurable intelligent surface (RIS) devices have emerged as an effective way to control the propagation channels for enhancing the end-users' performance. However, RIS optimization involves configuring the radio frequency response of a large number of radiating elements, which is challenging in real-world applications due to high computational complexity. In this paper, a model-free cross-ent… ▽ More Reconfigurable intelligent surface (RIS) devices have emerged as an effective way to control the propagation channels for enhancing the end-users' performance. However, RIS optimization involves configuring the radio frequency response of a large number of radiating elements, which is challenging in real-world applications due to high computational complexity. In this paper, a model-free cross-entropy (CE) algorithm is proposed to optimize the binary RIS configuration for improving the signal-to-noise ratio (SNR) at the receiver. One key advantage of the proposed method is that it only requires system performance indicators, e.g., the received SNR, without the need for channel models or channel state information. Both simulations and experiments are conducted to evaluate the performance of the proposed CE algorithm. This study provides an experimental demonstration of the channel hardening effect in a multi-antenna RIS-assisted wireless system under rich multipath fading. △ Less

Submitted 15 February, 2024; v1 submitted 21 February, 2023; originally announced February 2023.

Comments: accepted by IEEE Wireless Communications Letters

arXiv:2209.05013 [pdf, other]

Learning A Locally Unified 3D Point Cloud for View Synthesis

Authors: Meng You, Mantang Guo, Xianqiang Lyu, Hui Liu, Junhui Hou

Abstract: In this paper, we explore the problem of 3D point cloud representation-based view synthesis from a set of sparse source views. To tackle this challenging problem, we propose a new deep learning-based view synthesis paradigm that learns a locally unified 3D point cloud from source views. Specifically, we first construct sub-point clouds by projecting source views to 3D space based on their depth ma… ▽ More In this paper, we explore the problem of 3D point cloud representation-based view synthesis from a set of sparse source views. To tackle this challenging problem, we propose a new deep learning-based view synthesis paradigm that learns a locally unified 3D point cloud from source views. Specifically, we first construct sub-point clouds by projecting source views to 3D space based on their depth maps. Then, we learn the locally unified 3D point cloud by adaptively fusing points at a local neighborhood defined on the union of the sub-point clouds. Besides, we also propose a 3D geometry-guided image restoration module to fill the holes and recover high-frequency details of the rendered novel views. Experimental results on three benchmark datasets demonstrate that our method can improve the average PSNR by more than 4 dB while preserving more accurate visual details, compared with state-of-the-art view synthesis methods. △ Less

Submitted 30 September, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

Comments: Accepted to TIP

arXiv:2209.01996 [pdf, other]

Bridging Music and Text with Crowdsourced Music Comments: A Sequence-to-Sequence Framework for Thematic Music Comments Generation

Authors: Peining Zhang, Junliang Guo, Linli Xu, Mu You, Junming Yin

Abstract: We consider a novel task of automatically generating text descriptions of music. Compared with other well-established text generation tasks such as image caption, the scarcity of well-paired music and text datasets makes it a much more challenging task. In this paper, we exploit the crowd-sourced music comments to construct a new dataset and propose a sequence-to-sequence model to generate text de… ▽ More We consider a novel task of automatically generating text descriptions of music. Compared with other well-established text generation tasks such as image caption, the scarcity of well-paired music and text datasets makes it a much more challenging task. In this paper, we exploit the crowd-sourced music comments to construct a new dataset and propose a sequence-to-sequence model to generate text descriptions of music. More concretely, we use the dilated convolutional layer as the basic component of the encoder and a memory based recurrent neural network as the decoder. To enhance the authenticity and thematicity of generated texts, we further propose to fine-tune the model with a discriminator as well as a novel topic evaluator. To measure the quality of generated texts, we also propose two new evaluation metrics, which are more aligned with human evaluation than traditional metrics such as BLEU. Experimental results verify that our model is capable of generating fluent and meaningful comments while containing thematic and content information of the original music. △ Less

Submitted 5 September, 2022; originally announced September 2022.

arXiv:2207.01779 [pdf, other]

doi 10.1109/LRA.2022.3188098

3D Part Assembly Generation with Instance Encoded Transformer

Authors: Rufeng Zhang, Tao Kong, Weihao Wang, Xuan Han, Mingyu You

Abstract: It is desirable to enable robots capable of automatic assembly. Structural understanding of object parts plays a crucial role in this task yet remains relatively unexplored. In this paper, we focus on the setting of furniture assembly from a complete set of part geometries, which is essentially a 6-DoF part pose estimation problem. We propose a multi-layer transformer-based framework that involves… ▽ More It is desirable to enable robots capable of automatic assembly. Structural understanding of object parts plays a crucial role in this task yet remains relatively unexplored. In this paper, we focus on the setting of furniture assembly from a complete set of part geometries, which is essentially a 6-DoF part pose estimation problem. We propose a multi-layer transformer-based framework that involves geometric and relational reasoning between parts to update the part poses iteratively. We carefully design a unique instance encoding to solve the ambiguity between geometrically-similar parts so that all parts can be distinguished. In addition to assembling from scratch, we extend our framework to a new task called in-process part assembly. Analogous to furniture maintenance, it requires robots to continue with unfinished products and assemble the remaining parts into appropriate positions. Our method achieves far more than 10% improvements over the current state-of-the-art in multiple metrics on the public PartNet dataset. Extensive experiments and quantitative comparisons demonstrate the effectiveness of the proposed framework. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: 8 pages, 7 figures

Journal ref: IROS 2022 and IEEE Robotics and Automation Letters (RA-L), 2022

arXiv:2202.13624 [pdf, other]

doi 10.1109/LRA.2022.3141148

Weakly Supervised Disentangled Representation for Goal-conditioned Reinforcement Learning

Authors: Zhifeng Qian, Mingyu You, Hongjun Zhou, Bin He

Abstract: Goal-conditioned reinforcement learning is a crucial yet challenging algorithm which enables agents to achieve multiple user-specified goals when learning a set of skills in a dynamic environment. However, it typically requires millions of the environmental interactions explored by agents, which is sample-inefficient. In the paper, we propose a skill learning framework DR-GRL that aims to improve… ▽ More Goal-conditioned reinforcement learning is a crucial yet challenging algorithm which enables agents to achieve multiple user-specified goals when learning a set of skills in a dynamic environment. However, it typically requires millions of the environmental interactions explored by agents, which is sample-inefficient. In the paper, we propose a skill learning framework DR-GRL that aims to improve the sample efficiency and policy generalization by combining the Disentangled Representation learning and Goal-conditioned visual Reinforcement Learning. In a weakly supervised manner, we propose a Spatial Transform AutoEncoder (STAE) to learn an interpretable and controllable representation in which different parts correspond to different object attributes (shape, color, position). Due to the high controllability of the representations, STAE can simply recombine and recode the representations to generate unseen goals for agents to practice themselves. The manifold structure of the learned representation maintains consistency with the physical position, which is beneficial for reward calculation. We empirically demonstrate that DR-GRL significantly outperforms the previous methods in sample efficiency and policy generalization. In addition, DR-GRL is also easy to expand to the real robot. △ Less

Submitted 28 February, 2022; originally announced February 2022.

Comments: 8 pages; Accepted by RAL with ICRA 2022

Journal ref: IEEE RAL 2022

arXiv:2109.14423 [pdf, ps, other]

doi 10.1016/j.apenergy.2021.117899

Digital Twins based Day-ahead Integrated Energy System Scheduling under Load and Renewable Energy Uncertainties

Authors: Minglei You, Qian Wang, Hongjian Sun, Ivan Castro, Jing Jiang

Abstract: By constructing digital twins (DT) of an integrated energy system (IES), one can benefit from DT's predictive capabilities to improve coordinations among various energy converters, hence enhancing energy efficiency, cost savings and carbon emission reduction. This paper is motivated by the fact that practical IESs suffer from multiple uncertainty sources, and complicated surrounding environment. T… ▽ More By constructing digital twins (DT) of an integrated energy system (IES), one can benefit from DT's predictive capabilities to improve coordinations among various energy converters, hence enhancing energy efficiency, cost savings and carbon emission reduction. This paper is motivated by the fact that practical IESs suffer from multiple uncertainty sources, and complicated surrounding environment. To address this problem, a novel DT-based day-ahead scheduling method is proposed. The physical IES is modelled as a multi-vector energy system in its virtual space that interacts with the physical IES to manipulate its operations. A deep neural network is trained to make statistical cost-saving scheduling by learning from both historical forecasting errors and day-ahead forecasts. Case studies of IESs show that the proposed DT-based method is able to reduce the operating cost of IES by 63.5%, comparing to the existing forecast-based scheduling methods. It is also found that both electric vehicles and thermal energy storages play proactive roles in the proposed method, highlighting their importance in future energy system integration and decarbonisation. △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: 28 pages, 8 figures, journal paper accepted by Applied Energy

Journal ref: Applied Energy, 2022, vol 305, 117899

arXiv:2109.07819 [pdf, other]

Model-driven Learning for Generic MIMO Downlink Beamforming With Uplink Channel Information

Authors: Juping Zhang, Minglei You, Gan Zheng, Ioannis Krikidis, Liqiang Zhao

Abstract: Accurate downlink channel information is crucial to the beamforming design, but it is difficult to obtain in practice. This paper investigates a deep learning-based optimization approach of the downlink beamforming to maximize the system sum rate, when only the uplink channel information is available. Our main contribution is to propose a model-driven learning technique that exploits the structure… ▽ More Accurate downlink channel information is crucial to the beamforming design, but it is difficult to obtain in practice. This paper investigates a deep learning-based optimization approach of the downlink beamforming to maximize the system sum rate, when only the uplink channel information is available. Our main contribution is to propose a model-driven learning technique that exploits the structure of the optimal downlink beamforming to design an effective hybrid learning strategy with the aim to maximize the sum rate performance. This is achieved by jointly considering the learning performance of the downlink channel, the power and the sum rate in the training stage. The proposed approach applies to generic cases in which the uplink channel information is available, but its relation to the downlink channel is unknown and does not require an explicit downlink channel estimation. We further extend the developed technique to massive multiple-input multiple-output scenarios and achieve a distributed learning strategy for multicell systems without an inter-cell signalling overhead. Simulation results verify that our proposed method provides the performance close to the state of the art numerical algorithms with perfect downlink channel information and significantly outperforms existing data-driven methods in terms of the sum rate. △ Less

Submitted 16 September, 2021; originally announced September 2021.

Comments: Accepted in IEEE Transactions on Wireless Communications

arXiv:2104.11026 [pdf, other]

doi 10.1016/j.knosys.2021.107534

MeSIN: Multilevel Selective and Interactive Network for Medication Recommendation

Authors: Yang An, Liang Zhang, Mao You, Xueqing Tian, Bo Jin, Xiaopeng Wei

Abstract: Recommending medications for patients using electronic health records (EHRs) is a crucial data mining task for an intelligent healthcare system. It can assist doctors in making clinical decisions more efficiently. However, the inherent complexity of the EHR data renders it as a challenging task: (1) Multilevel structures: the EHR data typically contains multilevel structures which are closely rela… ▽ More Recommending medications for patients using electronic health records (EHRs) is a crucial data mining task for an intelligent healthcare system. It can assist doctors in making clinical decisions more efficiently. However, the inherent complexity of the EHR data renders it as a challenging task: (1) Multilevel structures: the EHR data typically contains multilevel structures which are closely related with the decision-making pathways, e.g., laboratory results lead to disease diagnoses, and then contribute to the prescribed medications; (2) Multiple sequences interactions: multiple sequences in EHR data are usually closely correlated with each other; (3) Abundant noise: lots of task-unrelated features or noise information within EHR data generally result in suboptimal performance. To tackle the above challenges, we propose a multilevel selective and interactive network (MeSIN) for medication recommendation. Specifically, MeSIN is designed with three components. First, an attentional selective module (ASM) is applied to assign flexible attention scores to different medical codes embeddings by their relevance to the recommended medications in every admission. Second, we incorporate a novel interactive long-short term memory network (InLSTM) to reinforce the interactions of multilevel medical sequences in EHR data with the help of the calibrated memory-augmented cell and an enhanced input gate. Finally, we employ a global selective fusion module (GSFM) to infuse the multi-sourced information embeddings into final patient representations for medications recommendation. To validate our method, extensive experiments have been conducted on a real-world clinical dataset. The results demonstrate a consistent superiority of our framework over several baselines and testify the effectiveness of our proposed approach. △ Less

Submitted 22 April, 2021; originally announced April 2021.

Comments: 15 pages, 6 figures

arXiv:2012.11187 [pdf, other]

Diverse Knowledge Distillation for End-to-End Person Search

Authors: Xinyu Zhang, Xinlong Wang, Jia-Wang Bian, Chunhua Shen, Mingyu You

Abstract: Person search aims to localize and identify a specific person from a gallery of images. Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches. The former views person search as two independent tasks and achieves dominant results using separately trained person detection and re-identification (Re-ID) models. The latter performs person search in an end-to-end fa… ▽ More Person search aims to localize and identify a specific person from a gallery of images. Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches. The former views person search as two independent tasks and achieves dominant results using separately trained person detection and re-identification (Re-ID) models. The latter performs person search in an end-to-end fashion. Although the end-to-end approaches yield higher inference efficiency, they largely lag behind those two-step counterparts in terms of accuracy. In this paper, we argue that the gap between the two kinds of methods is mainly caused by the Re-ID sub-networks of end-to-end methods. To this end, we propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck. We also design a spatial-invariant augmentation to assist model to be invariant to inaccurate detection results. Experimental results on the CUHK-SYSU and PRW datasets demonstrate the superiority of our method against existing approaches -- it achieves on par accuracy with state-of-the-art two-step methods while maintaining high efficiency due to the single joint model. Code is available at: https://git.io/DKD-PersonSearch. △ Less

Submitted 21 December, 2020; originally announced December 2020.

Comments: Accepted to AAAI, 2021. Code is available at: https://git.io/DKD-PersonSearch

arXiv:2003.11712 [pdf, other]

Mask Encoding for Single Shot Instance Segmentation

Authors: Rufeng Zhang, Zhi Tian, Chunhua Shen, Mingyu You, Youliang Yan

Abstract: To date, instance segmentation is dominated by twostage methods, as pioneered by Mask R-CNN. In contrast, one-stage alternatives cannot compete with Mask R-CNN in mask AP, mainly due to the difficulty of compactly representing masks, making the design of one-stage methods very challenging. In this work, we propose a simple singleshot instance segmentation framework, termed mask encoding based inst… ▽ More To date, instance segmentation is dominated by twostage methods, as pioneered by Mask R-CNN. In contrast, one-stage alternatives cannot compete with Mask R-CNN in mask AP, mainly due to the difficulty of compactly representing masks, making the design of one-stage methods very challenging. In this work, we propose a simple singleshot instance segmentation framework, termed mask encoding based instance segmentation (MEInst). Instead of predicting the two-dimensional mask directly, MEInst distills it into a compact and fixed-dimensional representation vector, which allows the instance segmentation task to be incorporated into one-stage bounding-box detectors and results in a simple yet efficient instance segmentation framework. The proposed one-stage MEInst achieves 36.4% in mask AP with single-model (ResNeXt-101-FPN backbone) and single-scale testing on the MS-COCO benchmark. We show that the much simpler and flexible one-stage instance segmentation method, can also achieve competitive performance. This framework can be easily adapted for other instance-level recognition tasks. Code is available at: https://git.io/AdelaiDet △ Less

Submitted 6 May, 2020; v1 submitted 25 March, 2020; originally announced March 2020.

Comments: Accepted to Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2020

arXiv:2002.12589 [pdf, other]

doi 10.1109/TWC.2020.2977340

Deep Learning Enabled Optimization of Downlink Beamforming Under Per-Antenna Power Constraints: Algorithms and Experimental Demonstration

Authors: Juping Zhang, Wenchao Xia, Minglei You, Gan Zheng, Sangarapillai Lambotharan, Kai-Kit Wong

Abstract: This paper studies fast downlink beamforming algorithms using deep learning in multiuser multiple-input-single-output systems where each transmit antenna at the base station has its own power constraint. We focus on the signal-to-interference-plus-noise ratio (SINR) balancing problem which is quasi-convex but there is no efficient solution available. We first design a fast subgradient algorithm th… ▽ More This paper studies fast downlink beamforming algorithms using deep learning in multiuser multiple-input-single-output systems where each transmit antenna at the base station has its own power constraint. We focus on the signal-to-interference-plus-noise ratio (SINR) balancing problem which is quasi-convex but there is no efficient solution available. We first design a fast subgradient algorithm that can achieve near-optimal solution with reduced complexity. We then propose a deep neural network structure to learn the optimal beamforming based on convolutional networks and exploitation of the duality of the original problem. Two strategies of learning various dual variables are investigated with different accuracies, and the corresponding recovery of the original solution is facilitated by the subgradient algorithm. We also develop a generalization method of the proposed algorithms so that they can adapt to the varying number of users and antennas without re-training. We carry out intensive numerical simulations and testbed experiments to evaluate the performance of the proposed algorithms. Results show that the proposed algorithms achieve close to optimal solution in simulations with perfect channel information and outperform the alleged theoretically optimal solution in experiments, illustrating a better performance-complexity tradeoff than existing schemes. △ Less

Submitted 28 February, 2020; originally announced February 2020.

Comments: This paper was accepted for publication in IEEE Transactions on Wireless Communications

arXiv:1909.06023 [pdf, other]

Part-Guided Attention Learning for Vehicle Instance Retrieval

Authors: Xinyu Zhang, Rufeng Zhang, Jiewei Cao, Dong Gong, Mingyu You, Chunhua Shen

Abstract: Vehicle instance retrieval often requires one to recognize the fine-grained visual differences between vehicles. Besides the holistic appearance of vehicles which is easily affected by the viewpoint variation and distortion, vehicle parts also provide crucial cues to differentiate near-identical vehicles. Motivated by these observations, we introduce a Part-Guided Attention Network (PGAN) to pinpo… ▽ More Vehicle instance retrieval often requires one to recognize the fine-grained visual differences between vehicles. Besides the holistic appearance of vehicles which is easily affected by the viewpoint variation and distortion, vehicle parts also provide crucial cues to differentiate near-identical vehicles. Motivated by these observations, we introduce a Part-Guided Attention Network (PGAN) to pinpoint the prominent part regions and effectively combine the global and part information for discriminative feature learning. PGAN first detects the locations of different part components and salient regions regardless of the vehicle identity, which serve as the bottom-up attention to narrow down the possible searching regions. To estimate the importance of detected parts, we propose a Part Attention Module (PAM) to adaptively locate the most discriminative regions with high-attention weights and suppress the distraction of irrelevant parts with relatively low weights. The PAM is guided by the instance retrieval loss and therefore provides top-down attention that enables attention to be calculated at the level of car parts and other salient regions. Finally, we aggregate the global appearance and part features to improve the feature performance further. The PGAN combines part-guided bottom-up and top-down attention, global and part visual features in an end-to-end framework. Extensive experiments demonstrate that the proposed method achieves new state-of-the-art vehicle instance retrieval performance on four large-scale benchmark datasets. △ Less

Submitted 26 September, 2020; v1 submitted 12 September, 2019; originally announced September 2019.

Comments: 12 pages

arXiv:1907.13315 [pdf, other]

Self-training with progressive augmentation for unsupervised cross-domain person re-identification

Authors: Xinyu Zhang, Jiewei Cao, Chunhua Shen, Mingyu You

Abstract: Person re-identification (Re-ID) has achieved great improvement with deep learning and a large amount of labelled training data. However, it remains a challenging task for adapting a model trained in a source domain of labelled data to a target domain of only unlabelled data available. In this work, we develop a self-training method with progressive augmentation framework (PAST) to promote the mod… ▽ More Person re-identification (Re-ID) has achieved great improvement with deep learning and a large amount of labelled training data. However, it remains a challenging task for adapting a model trained in a source domain of labelled data to a target domain of only unlabelled data available. In this work, we develop a self-training method with progressive augmentation framework (PAST) to promote the model performance progressively on the target dataset. Specially, our PAST framework consists of two stages, namely, conservative stage and promoting stage. The conservative stage captures the local structure of target-domain data points with triplet-based loss functions, leading to improved feature representations. The promoting stage continuously optimizes the network by appending a changeable classification layer to the last layer of the model, enabling the use of global information about the data distribution. Importantly, we propose a new self-training strategy that progressively augments the model capability by adopting conservative and promoting stages alternately. Furthermore, to improve the reliability of selected triplet samples, we introduce a ranking-based triplet loss in the conservative stage, which is a label-free objective function basing on the similarities between data pairs. Experiments demonstrate that the proposed method achieves state-of-the-art person Re-ID performance under the unsupervised cross-domain setting. Code is available at: https://tinyurl.com/PASTReID △ Less

Submitted 31 July, 2019; originally announced July 2019.

Comments: Accepted to Proc. Int. Conf. Computer Vision, 2019. Code is available at: https://tinyurl.com/PASTReID

arXiv:1809.06551 [pdf]

Block Chain based Intelligent Industrial Network (DSDIN)

Authors: Barco You, Matthias Hub, Mengzhe You, Bo Xu, Mingzhi Yu, Ivan Uemlianin

Abstract: The manufacturing industry featured centralization in the past due to technical limitations, and factories (especially large manufacturers) gathered almost all of the resources for manufacturing, including: technologies, raw materials, equipment, workers, market information, etc. However, such centralized production is costly, inefficient and inflexible, and difficult to respond to rapidly changin… ▽ More The manufacturing industry featured centralization in the past due to technical limitations, and factories (especially large manufacturers) gathered almost all of the resources for manufacturing, including: technologies, raw materials, equipment, workers, market information, etc. However, such centralized production is costly, inefficient and inflexible, and difficult to respond to rapidly changing, diverse and personalized user needs. This paper introduces an Intelligent Industrial Network (DSDIN), which provides a fully distributed manufacturing network where everyone can participate in manufacturing due to decentralization and no intermediate links, allowing them to quickly get the products or services they want and also to be authorized, recognized and get returns in a low-cost way due to their efforts (such as providing creative ideas, designs or equipment, raw materials or physical strength). DSDIN is a blockchain based IoT and AI technology platform, and also an IoT based intelligent service standard. Due to the intelligent network formed by DSDIN, the manufacturing center is no longer a factory, and actually there are no manufacturing centers. DSDIN provides a multi-participation peer-to-peer network for people and things (including raw materials, equipment, finished / semi-finished products, etc.). The information transmitted through the network is called Intelligent Service Algorithm (ISA). The user can send a process model, formula or control parameter to a device via an ISA, and every transaction in DSDIN is an intelligent service defined by ISA. △ Less

Submitted 18 September, 2018; originally announced September 2018.

arXiv:1707.03124 [pdf, other]

Adversarial Generation of Training Examples: Applications to Moving Vehicle License Plate Recognition

Authors: Xinlong Wang, Zhipeng Man, Mingyu You, Chunhua Shen

Abstract: Generative Adversarial Networks (GAN) have attracted much research attention recently, leading to impressive results for natural image generation. However, to date little success was observed in using GAN generated images for improving classification tasks. Here we attempt to explore, in the context of car license plate recognition, whether it is possible to generate synthetic training data using… ▽ More Generative Adversarial Networks (GAN) have attracted much research attention recently, leading to impressive results for natural image generation. However, to date little success was observed in using GAN generated images for improving classification tasks. Here we attempt to explore, in the context of car license plate recognition, whether it is possible to generate synthetic training data using GAN to improve recognition accuracy. With a carefully-designed pipeline, we show that the answer is affirmative. First, a large-scale image set is generated using the generator of GAN, without manual annotation. Then, these images are fed to a deep convolutional neural network (DCNN) followed by a bidirectional recurrent neural network (BRNN) with long short-term memory (LSTM), which performs the feature learning and sequence labelling. Finally, the pre-trained model is fine-tuned on real images. Our experimental results on a few data sets demonstrate the effectiveness of using GAN images: an improvement of 7.5% over a strong baseline with moderate-sized real data being available. We show that the proposed framework achieves competitive recognition accuracy on challenging test datasets. We also leverage the depthwise separate convolution to construct a lightweight convolutional RNN, which is about half size and 2x faster on CPU. Combining this framework and the proposed pipeline, we make progress in performing accurate recognition on mobile and embedded devices. △ Less

Submitted 10 November, 2017; v1 submitted 11 July, 2017; originally announced July 2017.

arXiv:1612.03882 [pdf, ps, other]

Unified Framework for the Effective Rate Analysis of Wireless Communication Systems over MISO Fading Channels

Authors: Minglei You, Hongjian Sun, Jing Jiang, Jiayi Zhang

Abstract: This paper proposes a unified framework for the effective rate analysis over arbitrary correlated and not necessarily identical multiple inputs single output (MISO) fading channels, which uses moment generating function (MGF) based approach and H transform representation. The proposed framework has the potential to simplify the cumbersome analysis procedure compared to the probability density func… ▽ More This paper proposes a unified framework for the effective rate analysis over arbitrary correlated and not necessarily identical multiple inputs single output (MISO) fading channels, which uses moment generating function (MGF) based approach and H transform representation. The proposed framework has the potential to simplify the cumbersome analysis procedure compared to the probability density function (PDF) based approach. Moreover, the effective rates over two specific fading scenarios are investigated, namely independent but not necessarily identical distributed (i.n.i.d.) MISO hyper Fox's H fading channels and arbitrary correlated generalized K fading channels. The exact analytical representations for these two scenarios are also presented. By substituting corresponding parameters, the effective rates in various practical fading scenarios, such as Rayleigh, Nakagami-m, Weibull/Gamma and generalized K fading channels, are readily available. In addition, asymptotic approximations are provided for the proposed H transform and MGF based approach as well as for the effective rate over i.n.i.d. MISO hyper Fox's H fading channels. Simulations under various fading scenarios are also presented, which support the validity of the proposed method. △ Less

Submitted 12 December, 2016; originally announced December 2016.

Showing 1–21 of 21 results for author: You, M