subscribe to arXiv mailings

doi 10.1109/JIOT.2024.3398548

MADRL-Based Rate Adaptation for 360° Video Streaming with Multi-Viewpoint Prediction

Authors: Haopeng Wang, Zijian Long, Haiwei Dong, Abdulmotaleb El Saddik

Abstract: Over the last few years, 360° video traffic on the network has grown significantly. A key challenge of 360° video playback is ensuring a high quality of experience (QoE) with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate (ABR) streaming based on single viewport prediction to reduce bandwidth consumption. However, the performance of models for single-viewpo… ▽ More Over the last few years, 360° video traffic on the network has grown significantly. A key challenge of 360° video playback is ensuring a high quality of experience (QoE) with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate (ABR) streaming based on single viewport prediction to reduce bandwidth consumption. However, the performance of models for single-viewpoint prediction is severely limited by the inherent uncertainty in head movement, which can not cope with the sudden movement of users very well. This paper first presents a multimodal spatial-temporal attention transformer to generate multiple viewpoint trajectories with their probabilities given a historical trajectory. The proposed method models viewpoint prediction as a classification problem and uses attention mechanisms to capture the spatial and temporal characteristics of input video frames and viewpoint trajectories for multi-viewpoint prediction. After that, a multi-agent deep reinforcement learning (MADRL)-based ABR algorithm utilizing multi-viewpoint prediction for 360° video streaming is proposed for maximizing different QoE objectives under various network conditions. We formulate the ABR problem as a decentralized partially observable Markov decision process (Dec-POMDP) problem and present a MAPPO algorithm based on centralized training and decentralized execution (CTDE) framework to solve the problem. The experimental results show that our proposed method improves the defined QoE metric by up to 85.5% compared to existing ABR methods. △ Less

Submitted 17 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted by IEEE Internet of Things Journal

arXiv:2404.11336 [pdf, other]

Vision-based control for landing an aerial vehicle on a marine vessel

Authors: Haohua Dong

Abstract: This work addresses the landing problem of an aerial vehicle, exemplified by a simple quadrotor, on a moving platform using image-based visual servo control. First, the mathematical model of the quadrotor aircraft is introduced, followed by the design of the inner-loop control. At the second stage, the image features on the textured target plane are exploited to derive a vision-based control law.… ▽ More This work addresses the landing problem of an aerial vehicle, exemplified by a simple quadrotor, on a moving platform using image-based visual servo control. First, the mathematical model of the quadrotor aircraft is introduced, followed by the design of the inner-loop control. At the second stage, the image features on the textured target plane are exploited to derive a vision-based control law. The image of the spherical centroid of a set of landmarks present in the landing target is used as a position measurement, whereas the translational optical flow is used as velocity measurement. The kinematics of the vision-based system is expressed in terms of the observable features, and the proposed control law guarantees convergence without estimating the unknown distance between the vision system and the target, which is also guaranteed to remain strictly positive, avoiding undesired collisions. The performance of the proposed control law is evaluated in MATLAB and 3-D simulation software Gazebo. Simulation results for a quadrotor UAV are provided for different velocity profiles of the moving target, showcasing the robustness of the proposed controller. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.07318 [pdf, other]

Rethinking Perceptual Metrics for Medical Image Translation

Authors: Nicholas Konz, Yuwen Chen, Hanxue Gu, Haoyu Dong, Maciej A. Mazurowski

Abstract: Modern medical image translation methods use generative models for tasks such as the conversion of CT images to MRI. Evaluating these methods typically relies on some chosen downstream task in the target domain, such as segmentation. On the other hand, task-agnostic metrics are attractive, such as the network feature-based perceptual metrics (e.g., FID) that are common to image translation in gene… ▽ More Modern medical image translation methods use generative models for tasks such as the conversion of CT images to MRI. Evaluating these methods typically relies on some chosen downstream task in the target domain, such as segmentation. On the other hand, task-agnostic metrics are attractive, such as the network feature-based perceptual metrics (e.g., FID) that are common to image translation in general computer vision. In this paper, we investigate evaluation metrics for medical image translation on two medical image translation tasks (GE breast MRI to Siemens breast MRI and lumbar spine MRI to CT), tested on various state-of-the-art translation methods. We show that perceptual metrics do not generally correlate with segmentation metrics due to them extending poorly to the anatomical constraints of this sub-field, with FID being especially inconsistent. However, we find that the lesser-used pixel-level SWD metric may be useful for subtle intra-modality translation. Our results demonstrate the need for further research into helpful metrics for medical image translation. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2403.10786 [pdf, other]

ContourDiff: Unpaired Image Translation with Contour-Guided Diffusion Models

Authors: Yuwen Chen, Nicholas Konz, Hanxue Gu, Haoyu Dong, Yaqian Chen, Lin Li, Jisoo Lee, Maciej A. Mazurowski

Abstract: Accurately translating medical images across different modalities (e.g., CT to MRI) has numerous downstream clinical and machine learning applications. While several methods have been proposed to achieve this, they often prioritize perceptual quality with respect to output domain features over preserving anatomical fidelity. However, maintaining anatomy during translation is essential for many tas… ▽ More Accurately translating medical images across different modalities (e.g., CT to MRI) has numerous downstream clinical and machine learning applications. While several methods have been proposed to achieve this, they often prioritize perceptual quality with respect to output domain features over preserving anatomical fidelity. However, maintaining anatomy during translation is essential for many tasks, e.g., when leveraging masks from the input domain to develop a segmentation model with images translated to the output domain. To address these challenges, we propose ContourDiff, a novel framework that leverages domain-invariant anatomical contour representations of images. These representations are simple to extract from images, yet form precise spatial constraints on their anatomical content. We introduce a diffusion model that converts contour representations of images from arbitrary input domains into images in the output domain of interest. By applying the contour as a constraint at every diffusion sampling step, we ensure the preservation of anatomical content. We evaluate our method by training a segmentation model on images translated from CT to MRI with their original CT masks and testing its performance on real MRIs. Our method outperforms other unpaired image translation methods by a significant margin, furthermore without the need to access any input domain information during training. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Code will be released on GitHub

arXiv:2402.05210 [pdf, other]

Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion Models

Authors: Nicholas Konz, Yuwen Chen, Haoyu Dong, Maciej A. Mazurowski

Abstract: Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask abl… ▽ More Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our method ("SegGuidedDiff") to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that our method reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space. SegGuidedDiff has many applications, including cross-modality translation, and the generation of paired or counterfactual data. Our code is available at https://github.com/mazurowski-lab/segmentation-guided-diffusion. △ Less

Submitted 19 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted at MICCAI 2024. Code and synthetic dataset: https://github.com/mazurowski-lab/segmentation-guided-diffusion

arXiv:2401.12974 [pdf, other]

SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location on MRI

Authors: Hanxue Gu, Roy Colglazier, Haoyu Dong, Jikai Zhang, Yaqian Chen, Zafer Yildiz, Yuwen Chen, Lin Li, Jichen Yang, Jay Willhite, Alex M. Meyer, Brian Guo, Yashvi Atul Shah, Emily Luo, Shipra Rajput, Sally Kuehn, Clark Bulleit, Kevin A. Wu, Jisoo Lee, Brandon Ramirez, Darui Lu, Jay M. Levin, Maciej A. Mazurowski

Abstract: Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment pla… ▽ More Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment planning. Specifically, segmenting bones in MRI would allow for more quantitative assessments of musculoskeletal conditions, while such assessments are largely absent in current radiological practice. The difficulty of bone MRI segmentation is illustrated by the fact that limited algorithms are publicly available for use, and those contained in the literature typically address a specific anatomic area. In our study, we propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations. The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation. Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions; (2) investigating several standard network architectures and strategies for automated segmentation; (3) introducing SegmentAnyBone, an innovative foundational model-based approach that extends Segment Anything Model (SAM); (4) comparative analysis of our algorithm and previous approaches; and (5) generalization analysis of our algorithm across different anatomical locations and MRI sequences, as well as an external dataset. We publicly release our model at https://github.com/mazurowski-lab/SegmentAnyBone. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 15 pages, 15 figures

arXiv:2311.12257 [pdf, other]

Equipping Pretrained Unconditional Music Transformers with Instrument and Genre Controls

Authors: Weihan Xu, Julian McAuley, Shlomo Dubnov, Hao-Wen Dong

Abstract: The ''pretraining-and-finetuning'' paradigm has become a norm for training domain-specific models in natural language processing and computer vision. In this work, we aim to examine this paradigm for symbolic music generation through leveraging the largest ever symbolic music dataset sourced from the MuseScore forum. We first pretrain a large unconditional transformer model using 1.5 million songs… ▽ More The ''pretraining-and-finetuning'' paradigm has become a norm for training domain-specific models in natural language processing and computer vision. In this work, we aim to examine this paradigm for symbolic music generation through leveraging the largest ever symbolic music dataset sourced from the MuseScore forum. We first pretrain a large unconditional transformer model using 1.5 million songs. We then propose a simple technique to equip this pretrained unconditional music transformer model with instrument and genre controls by finetuning the model with additional control tokens. Our proposed representation offers improved high-level controllability and expressiveness against two existing representations. The experimental results show that the proposed model can successfully generate music with user-specified instruments and genre. In a subjective listening test, the proposed model outperforms the pretrained baseline model in terms of coherence, harmony, arrangement and overall quality. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2308.09693 [pdf, other]

A Lightweight Transformer for Faster and Robust EBSD Data Collection

Authors: Harry Dong, Sean Donegan, Megna Shah, Yuejie Chi

Abstract: Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures tha… ▽ More Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures that have made breakthroughs in a plethora of domains, for data processing and recovery. To be more robust to errors and accelerate this 3D EBSD data collection, we introduce a two step method that recovers missing slices in an 3D EBSD volume, using an efficient transformer model and a projection algorithm to process the transformer's outputs. Overcoming the computational and practical hurdles of deep learning with scarce high dimensional data, we train this model using only synthetic 3D EBSD data with self-supervision and obtain superior recovery accuracy on real 3D EBSD data, compared to existing methods. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.00507 [pdf, other]

Improved Prognostic Prediction of Pancreatic Cancer Using Multi-Phase CT by Integrating Neural Distance and Texture-Aware Transformer

Authors: Hexin Dong, Jiawen Yao, Yuxing Tang, Mingze Yuan, Yingda Xia, Jian Zhou, Hong Lu, Jingren Zhou, Bin Dong, Le Lu, Li Zhang, Zaiyi Liu, Yu Shi, Ling Zhang

Abstract: Pancreatic ductal adenocarcinoma (PDAC) is a highly lethal cancer in which the tumor-vascular involvement greatly affects the resectability and, thus, overall survival of patients. However, current prognostic prediction methods fail to explicitly and accurately investigate relationships between the tumor and nearby important vessels. This paper proposes a novel learnable neural distance that descr… ▽ More Pancreatic ductal adenocarcinoma (PDAC) is a highly lethal cancer in which the tumor-vascular involvement greatly affects the resectability and, thus, overall survival of patients. However, current prognostic prediction methods fail to explicitly and accurately investigate relationships between the tumor and nearby important vessels. This paper proposes a novel learnable neural distance that describes the precise relationship between the tumor and vessels in CT images of different patients, adopting it as a major feature for prognosis prediction. Besides, different from existing models that used CNNs or LSTMs to exploit tumor enhancement patterns on dynamic contrast-enhanced CT imaging, we improved the extraction of dynamic tumor-related texture features in multi-phase contrast-enhanced CT by fusing local and global features using CNN and transformer modules, further enhancing the features extracted across multi-phase CT images. We extensively evaluated and compared the proposed method with existing methods in the multi-center (n=4) dataset with 1,070 patients with PDAC, and statistical analysis confirmed its clinical effectiveness in the external test set consisting of three centers. The developed risk marker was the strongest predictor of overall survival among preoperative factors and it has the potential to be combined with established clinical factors to select patients at higher risk who might benefit from neoadjuvant therapy. △ Less

Submitted 13 September, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

Comments: MICCAI 2023

arXiv:2307.08208 [pdf, other]

Towards Stealthy Backdoor Attacks against Speech Recognition via Elements of Sound

Authors: Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, Stefanos Koffas, Yiming Li

Abstract: Deep neural networks (DNNs) have been widely and successfully adopted and deployed in various applications of speech recognition. Recently, a few works revealed that these models are vulnerable to backdoor attacks, where the adversaries can implant malicious prediction behaviors into victim models by poisoning their training process. In this paper, we revisit poison-only backdoor attacks against s… ▽ More Deep neural networks (DNNs) have been widely and successfully adopted and deployed in various applications of speech recognition. Recently, a few works revealed that these models are vulnerable to backdoor attacks, where the adversaries can implant malicious prediction behaviors into victim models by poisoning their training process. In this paper, we revisit poison-only backdoor attacks against speech recognition. We reveal that existing methods are not stealthy since their trigger patterns are perceptible to humans or machine detection. This limitation is mostly because their trigger patterns are simple noises or separable and distinctive clips. Motivated by these findings, we propose to exploit elements of sound ($e.g.$, pitch and timbre) to design more stealthy yet effective poison-only backdoor attacks. Specifically, we insert a short-duration high-pitched signal as the trigger and increase the pitch of remaining audio clips to `mask' it for designing stealthy pitch-based triggers. We manipulate timbre features of victim audios to design the stealthy timbre-based attack and design a voiceprint selection module to facilitate the multi-backdoor attack. Our attacks can generate more `natural' poisoned samples and therefore are more stealthy. Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our attacks under different settings ($e.g.$, all-to-one, all-to-all, clean-label, physical, and multi-backdoor settings) and their stealthiness. The code for reproducing main experiments are available at \url{https://github.com/HanboCai/BadSpeech_SoE}. △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: 13 pages

arXiv:2307.04525 [pdf, other]

Cluster-Induced Mask Transformers for Effective Opportunistic Gastric Cancer Screening on Non-contrast CT Scans

Authors: Mingze Yuan, Yingda Xia, Xin Chen, Jiawen Yao, Junli Wang, Mingyan Qiu, Hexin Dong, Jingren Zhou, Bin Dong, Le Lu, Li Zhang, Zaiyi Liu, Ling Zhang

Abstract: Gastric cancer is the third leading cause of cancer-related mortality worldwide, but no guideline-recommended screening test exists. Existing methods can be invasive, expensive, and lack sensitivity to identify early-stage gastric cancer. In this study, we explore the feasibility of using a deep learning approach on non-contrast CT scans for gastric cancer detection. We propose a novel cluster-ind… ▽ More Gastric cancer is the third leading cause of cancer-related mortality worldwide, but no guideline-recommended screening test exists. Existing methods can be invasive, expensive, and lack sensitivity to identify early-stage gastric cancer. In this study, we explore the feasibility of using a deep learning approach on non-contrast CT scans for gastric cancer detection. We propose a novel cluster-induced Mask Transformer that jointly segments the tumor and classifies abnormality in a multi-task manner. Our model incorporates learnable clusters that encode the texture and shape prototypes of gastric cancer, utilizing self- and cross-attention to interact with convolutional features. In our experiments, the proposed method achieves a sensitivity of 85.0% and specificity of 92.6% for detecting gastric tumors on a hold-out test set consisting of 100 patients with cancer and 148 normal. In comparison, two radiologists have an average sensitivity of 73.5% and specificity of 84.3%. We also obtain a specificity of 97.7% on an external test set with 903 normal cases. Our approach performs comparably to established state-of-the-art gastric cancer screening tools like blood testing and endoscopy, while also being more sensitive in detecting early-stage cancer. This demonstrates the potential of our approach as a novel, non-invasive, low-cost, and accurate method for opportunistic gastric cancer screening. △ Less

Submitted 15 July, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

Comments: MICCAI 2023

arXiv:2306.09736 [pdf]

Overtaking-enabled Eco-approach Control at Signalized Intersections for Connected and Automated Vehicles

Authors: Haoxuan Dong, Weichao Zhuang, Guoyuan Wu, Zhaojian Li, Guodong Yin, Ziyou Song

Abstract: Preceding vehicles typically dominate the movement of following vehicles in traffic systems, thereby significantly influencing the efficacy of eco-driving control that concentrates on vehicle speed optimization. To potentially mitigate the negative effect of preceding vehicles on eco-driving control at the signalized intersection, this paper proposes an overtakingenabled eco-approach control (OEAC… ▽ More Preceding vehicles typically dominate the movement of following vehicles in traffic systems, thereby significantly influencing the efficacy of eco-driving control that concentrates on vehicle speed optimization. To potentially mitigate the negative effect of preceding vehicles on eco-driving control at the signalized intersection, this paper proposes an overtakingenabled eco-approach control (OEAC) strategy. It combines driving lane planning and speed optimization for connected and automated vehicles to relax the first-in-first-out queuing policy at the signalized intersection, minimizing the target vehicle's energy consumption and travel delay. The OEAC adopts a receding horizon two-stage control framework to derive optimal driving trajectories for adapting to dynamic traffic conditions. In the first stage, the driving lane optimization problem is formulated as a Markov decision process and solved using dynamic programming, which takes into account the uncertain disturbance from preceding vehicles. In the second stage, the vehicle's speed trajectory with the minimal driving cost is optimized rapidly using Pontryagin's minimum principle to obtain the closed-form analytical optimal solution. Extensive simulations are conducted to evaluate the effectiveness of the OEAC. The results show that the OEAC is excellent in driving cost reduction over constant speed and regular eco-approach and departure strategies in various traffic scenarios, with an average improvement of 20.91% and 5.62%, respectively. △ Less

Submitted 16 June, 2023; originally announced June 2023.

arXiv:2306.09635 [pdf, other]

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Authors: Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian McAuley

Abstract: Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge… ▽ More Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models. △ Less

Submitted 23 July, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted by WASPAA 2023. Demo: https://salu133445.github.io/clipsonic/

arXiv:2306.01387 [pdf, other]

doi 10.1109/TCST.2024.3361393

Physics-Augmented Data-EnablEd Predictive Control for Eco-driving of Mixed Traffic Considering Diverse Human Behaviors

Authors: Dongjun Li, Kaixiang Zhang, Haoxuan Dong, Qun Wang, Zhaojian Li, Ziyou Song

Abstract: Data-driven cooperative control of connected and automated vehicles (CAVs) has gained extensive research interest as it can utilize collected data to generate control actions without relying on parametric system models that are generally challenging to obtain. Existing methods mainly focused on improving traffic safety and stability, while less emphasis has been placed on energy efficiency in the… ▽ More Data-driven cooperative control of connected and automated vehicles (CAVs) has gained extensive research interest as it can utilize collected data to generate control actions without relying on parametric system models that are generally challenging to obtain. Existing methods mainly focused on improving traffic safety and stability, while less emphasis has been placed on energy efficiency in the presence of uncertainties and diversities of human-driven vehicles (HDVs). In this paper, we employ a data-enabled predictive control (DeePC) scheme to address the eco-driving of mixed traffic flows with diverse behaviors of human drivers. Specifically, by incorporating the physical relationship of the studied system and the Hankel matrix update from the generalized behavior representation to a particular one, we develop a new Physics-Augmented Data-EnablEd Predictive Control (PA-DeePC) approach to handle human driver diversities. In particular, a power consumption term is added to the DeePC cost function to reduce the holistic energy consumption of both CAVs and HDVs. Simulation results demonstrate the effectiveness of our approach in accurately capturing random human driver behaviors and addressing the complex dynamics of mixed traffic flows, while ensuring driving safety and traffic efficiency. Furthermore, the proposed optimization framework achieves substantial reductions in energy consumption, i.e., average reductions of 4.83% and 9.16% when compared to the benchmark algorithms. △ Less

Submitted 1 February, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.03098 [pdf, other]

doi 10.1016/j.media.2023.102836

Unsupervised anomaly localization in high-resolution breast scans using deep pluralistic image completion

Authors: Nicholas Konz, Haoyu Dong, Maciej A. Mazurowski

Abstract: Automated tumor detection in Digital Breast Tomosynthesis (DBT) is a difficult task due to natural tumor rarity, breast tissue variability, and high resolution. Given the scarcity of abnormal images and the abundance of normal images for this problem, an anomaly detection/localization approach could be well-suited. However, most anomaly localization research in machine learning focuses on non-medi… ▽ More Automated tumor detection in Digital Breast Tomosynthesis (DBT) is a difficult task due to natural tumor rarity, breast tissue variability, and high resolution. Given the scarcity of abnormal images and the abundance of normal images for this problem, an anomaly detection/localization approach could be well-suited. However, most anomaly localization research in machine learning focuses on non-medical datasets, and we find that these methods fall short when adapted to medical imaging datasets. The problem is alleviated when we solve the task from the image completion perspective, in which the presence of anomalies can be indicated by a discrepancy between the original appearance and its auto-completion conditioned on the surroundings. However, there are often many valid normal completions given the same surroundings, especially in the DBT dataset, making this evaluation criterion less precise. To address such an issue, we consider pluralistic image completion by exploring the distribution of possible completions instead of generating fixed predictions. This is achieved through our novel application of spatial dropout on the completion network during inference time only, which requires no additional training cost and is effective at generating diverse completions. We further propose minimum completion distance (MCD), a new metric for detecting anomalies, thanks to these stochastic completions. We provide theoretical as well as empirical support for the superiority over existing methods of using the proposed method for anomaly localization. On the DBT dataset, our model outperforms other state-of-the-art methods by at least 10\% AUROC for pixel-level detection. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: Accepted in Medical Image Analysis (2023). Our code is at https://github.com/mazurowski-lab/picard

Journal ref: Medical Image Analysis, 102836 (2023)

arXiv:2304.07572 [pdf, other]

doi 10.1145/3570361.3592511

GPSMirror: Expanding Accurate GPS Positioning to Shadowed and Indoor Regions with Backscatter

Authors: Huixin Dong, Yirong Xie, Xianan Zhang, Wei Wang, Xinyu Zhang, Jianhua He

Abstract: Despite the prevalence of GPS services, they still suffer from intermittent positioning with poor accuracy in partially shadowed regions like urban canyons, flyover shadows, and factories' indoor areas. Existing wisdom relies on hardware modifications of GPS receivers or power-hungry infrastructures requiring continuous plug-in power supply which is hard to provide in outdoor regions and some fact… ▽ More Despite the prevalence of GPS services, they still suffer from intermittent positioning with poor accuracy in partially shadowed regions like urban canyons, flyover shadows, and factories' indoor areas. Existing wisdom relies on hardware modifications of GPS receivers or power-hungry infrastructures requiring continuous plug-in power supply which is hard to provide in outdoor regions and some factories. This paper fills the gap with GPSMirror, the first GPS-strengthening system that works for unmodified smartphones with the assistance of newly-designed GPS backscatter tags. The key enabling techniques in GPSMirror include: (i) a careful hardware design with microwatt-level power consumption that pushes the limit of backscatter sensitivity to re-radiate extremely weak GPS signals with enough coverage approaching the regulation limit; and (ii) a novel GPS positioning algorithm achieving meter-level accuracy in shadowed regions as well as expanding locatable regions under inadequate satellites where conventional algorithms fail. We build a prototype of the GPSMirror tags and conduct comprehensive experiments to evaluate them. Our results show that a GPSMirror tag can provide coverage up to 27.7 m. GPSMirror achieves median positioning accuracy of 3.7 m indoors and 4.6 m in urban canyon environments, respectively. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: 13 pages, 26 figures, to appear in MobiCom 2023

arXiv:2301.03413 [pdf]

doi 10.1109/LES.2015.2440761

A Framework of Reconfigurable Transducer Nodes for Smart Home Environments

Authors: Basim Hafidh, Hussein Al Osman, Haiwei Dong, Abdulmotaleb El Saddik

Abstract: This letter presents a transducer network framework that supports the amalgamation of multiple transducers into single wireless nodes. This approach is aimed at decreasing energy consumption by reducing the number of wireless transceivers involved in such networks. To make wireless nodes easily reconfigurable, a plug and play mechanism is applied to enable the clustering of any number of transduce… ▽ More This letter presents a transducer network framework that supports the amalgamation of multiple transducers into single wireless nodes. This approach is aimed at decreasing energy consumption by reducing the number of wireless transceivers involved in such networks. To make wireless nodes easily reconfigurable, a plug and play mechanism is applied to enable the clustering of any number of transducers. Furthermore, an algorithm is proposed to dynamically detect added and removed transducers from a node. Lastly, an XML based protocol is devised to allow nodes to communicate a description of their layout, measured data and control information. To verify the proposed framework, multiple reconfigurable wireless nodes are used to monitor the dynamic condition of a multiple rooms during a period of 24 hours in order to emulate a smart home scenario. △ Less

Submitted 25 December, 2022; originally announced January 2023.

Journal ref: IEEE Embedded Systems Letters, vol. 7, no. 3, pp. 81-84, 2015

arXiv:2212.12908 [pdf, other]

doi 10.1109/JSEN.2020.3016611

Sitting Posture Recognition Using a Spiking Neural Network

Authors: Jianquan Wang, Basim Hafidh, Haiwei Dong, Abdulmotaleb El Saddik

Abstract: To increase the quality of citizens' lives, we designed a personalized smart chair system to recognize sitting behaviors. The system can receive surface pressure data from the designed sensor and provide feedback for guiding the user towards proper sitting postures. We used a liquid state machine and a logistic regression classifier to construct a spiking neural network for classifying 15 sitting… ▽ More To increase the quality of citizens' lives, we designed a personalized smart chair system to recognize sitting behaviors. The system can receive surface pressure data from the designed sensor and provide feedback for guiding the user towards proper sitting postures. We used a liquid state machine and a logistic regression classifier to construct a spiking neural network for classifying 15 sitting postures. To allow this system to read our pressure data into the spiking neurons, we designed an algorithm to encode map-like data into cosine-rank sparsity data. The experimental results consisting of 15 sitting postures from 19 participants show that the prediction precision of our SNN is 88.52%. △ Less

Submitted 25 December, 2022; originally announced December 2022.

Journal ref: IEEE Sensors Journal, vol. 21, no. 2, pp. 1779-1786, 2021

arXiv:2212.10103 [pdf, ps, other]

VSVC: Backdoor attack against Keyword Spotting based on Voiceprint Selection and Voice Conversion

Authors: Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, Shunhui Ji

Abstract: Keyword spotting (KWS) based on deep neural networks (DNNs) has achieved massive success in voice control scenarios. However, training of such DNN-based KWS systems often requires significant data and hardware resources. Manufacturers often entrust this process to a third-party platform. This makes the training process uncontrollable, where attackers can implant backdoors in the model by manipulat… ▽ More Keyword spotting (KWS) based on deep neural networks (DNNs) has achieved massive success in voice control scenarios. However, training of such DNN-based KWS systems often requires significant data and hardware resources. Manufacturers often entrust this process to a third-party platform. This makes the training process uncontrollable, where attackers can implant backdoors in the model by manipulating third-party training data. An effective backdoor attack can force the model to make specified judgments under certain conditions, i.e., triggers. In this paper, we design a backdoor attack scheme based on Voiceprint Selection and Voice Conversion, abbreviated as VSVC. Experimental results demonstrated that VSVC is feasible to achieve an average attack success rate close to 97% in four victim models when poisoning less than 1% of the training data. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: 7 pages,5 figures

arXiv:2212.07065 [pdf, other]

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick

Abstract: Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds.… ▽ More Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings. △ Less

Submitted 3 March, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: Accepted by ICLR 2023. Audio samples can be found at https://sony.github.io/CLIPSep/

arXiv:2211.08697 [pdf, ps, other]

PBSM: Backdoor attack against Keyword spotting based on pitch boosting and sound masking

Authors: Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, Shunhui Ji

Abstract: Keyword spotting (KWS) has been widely used in various speech control scenarios. The training of KWS is usually based on deep neural networks and requires a large amount of data. Manufacturers often use third-party data to train KWS. However, deep neural networks are not sufficiently interpretable to manufacturers, and attackers can manipulate third-party training data to plant backdoors during th… ▽ More Keyword spotting (KWS) has been widely used in various speech control scenarios. The training of KWS is usually based on deep neural networks and requires a large amount of data. Manufacturers often use third-party data to train KWS. However, deep neural networks are not sufficiently interpretable to manufacturers, and attackers can manipulate third-party training data to plant backdoors during the model training. An effective backdoor attack can force the model to make specified judgments under certain conditions, i.e., triggers. In this paper, we design a backdoor attack scheme based on Pitch Boosting and Sound Masking for KWS, called PBSM. Experimental results demonstrated that PBSM is feasible to achieve an average attack success rate close to 90% in three victim models when poisoning less than 1% of the training data. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 5 pages, 4 figures

arXiv:2210.02147 [pdf, other]

Adaptive Leading Cruise Control in Mixed Traffic Considering Human Behavioral Diversity

Authors: Qun Wang, Haoxuan Dong, Fei Ju, Weichao Zhuang, Chen Lv, Liangmo Wang, Ziyou Song

Abstract: This paper presents an adaptive leading cruise control strategy for the connected and automated vehicle (CAV) and first considers its impact on the following human-driven vehicle (HDV) with diverse driving characteristics in the unified optimization framework for improved holistic energy efficiency. The car-following behaviors of HDV are statistically calibrated using the Next Generation Simulatio… ▽ More This paper presents an adaptive leading cruise control strategy for the connected and automated vehicle (CAV) and first considers its impact on the following human-driven vehicle (HDV) with diverse driving characteristics in the unified optimization framework for improved holistic energy efficiency. The car-following behaviors of HDV are statistically calibrated using the Next Generation Simulation dataset. In a typical single-lane car-following scenario where CAVs and HDVs share the road, the longitudinal speed control of CAVs can substantially reduce the energy consumption of the following HDV by avoiding unnecessary acceleration and braking. Moreover, apart from the objectives including car-following safety and traffic efficiency, the energy efficiencies of both CAV and HDV are incorporated into the reward function of reinforcement learning. The specific driving pattern of the following HDV is learned in real-time from historical speed information to predict its acceleration and power consumption in the optimization horizon. A comprehensive simulation is conducted to statistically verify the positive impacts of CAV on the holistic energy efficiency of the mixed traffic flow with uncertain and diverse human driving behaviors. Simulation results indicate that the holistic energy efficiency is improved by 4.38% on average. △ Less

Submitted 5 October, 2022; originally announced October 2022.

arXiv:2209.02871 [pdf, other]

Improving Choral Music Separation through Expressive Synthesized Data from Sampled Instruments

Authors: Ke Chen, Hao-Wen Dong, Yi Luo, Julian McAuley, Taylor Berg-Kirkpatrick, Miller Puckette, Shlomo Dubnov

Abstract: Choral music separation refers to the task of extracting tracks of voice parts (e.g., soprano, alto, tenor, and bass) from mixed audio. The lack of datasets has impeded research on this topic as previous work has only been able to train and evaluate models on a few minutes of choral music data due to copyright issues and dataset collection difficulties. In this paper, we investigate the use of syn… ▽ More Choral music separation refers to the task of extracting tracks of voice parts (e.g., soprano, alto, tenor, and bass) from mixed audio. The lack of datasets has impeded research on this topic as previous work has only been able to train and evaluate models on a few minutes of choral music data due to copyright issues and dataset collection difficulties. In this paper, we investigate the use of synthesized training data for the source separation task on real choral music. We make three contributions: first, we provide an automated pipeline for synthesizing choral music data from sampled instrument plugins within controllable options for instrument expressiveness. This produces an 8.2-hour-long choral music dataset from the JSB Chorales Dataset and one can easily synthesize additional data. Second, we conduct an experiment to evaluate multiple separation models on available choral music separation datasets from previous work. To the best of our knowledge, this is the first experiment to comprehensively evaluate choral music separation. Third, experiments demonstrate that the synthesized choral data is of sufficient quality to improve the model's performance on real choral music datasets. This provides additional experimental statistics and data support for the choral music separation study. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: Camera Ready for Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022

Journal ref: The 23rd International Society for Music Information Retrieval Conference, 2022

arXiv:2208.12517 [pdf]

Enabling Massage Actions: An Interactive Parallel Robot with Compliant Joints

Authors: Huixu Dong, Yue Feng, Chen Qiu, Ye Pan, Miaoying He, I-Ming Chen

Abstract: We propose a parallel massage robot with compliant joints based on the series elastic actuator (SEA), offering a unified force-position control approach. First, the kinematic and static force models are established for obtaining the corresponding control variables. Then, a novel force-position control strategy is proposed to separately control the force-position along the normal direction of the s… ▽ More We propose a parallel massage robot with compliant joints based on the series elastic actuator (SEA), offering a unified force-position control approach. First, the kinematic and static force models are established for obtaining the corresponding control variables. Then, a novel force-position control strategy is proposed to separately control the force-position along the normal direction of the surface and another two-direction displacement, without the requirement of a robotic dynamics model. To evaluate its performance, we implement a series of robotic massage experiments. The results demonstrate that the proposed massage manipulator can successfully achieve desired forces and motion patterns of massage tasks, arriving at a high-score user experience. △ Less

Submitted 26 August, 2022; originally announced August 2022.

arXiv:2207.06983 [pdf, other]

Multitrack Music Transformer

Authors: Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley, Taylor Berg-Kirkpatrick

Abstract: Existing approaches for generating multitrack music with transformer models have been limited in terms of the number of instruments, the length of the music segments and slow inference. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations. In this work, we propose a new multitrack music representation that allows a diverse set of ins… ▽ More Existing approaches for generating multitrack music with transformer models have been limited in terms of the number of instruments, the length of the music segments and slow inference. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations. In this work, we propose a new multitrack music representation that allows a diverse set of instruments while keeping a short sequence length. Our proposed Multitrack Music Transformer (MMT) achieves comparable performance with state-of-the-art systems, landing in between two recently proposed models in a subjective listening test, while achieving substantial speedups and memory reductions over both, making the method attractive for real time improvisation or near real time creative applications. Further, we propose a new measure for analyzing musical self-attention and show that the trained model attends more to notes that form a consonant interval with the current note and to notes that are 4N beats away from the current step. △ Less

Submitted 24 May, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

Comments: Accepted by ICASSP 2023. Demo: https://salu133445.github.io/mmt/ . Code: https://github.com/salu133445/mmt

arXiv:2207.02797 [pdf, other]

doi 10.1007/978-3-031-16452-1_65

The Intrinsic Manifolds of Radiological Images and their Role in Deep Learning

Authors: Nicholas Konz, Hanxue Gu, Haoyu Dong, Maciej A. Mazurowski

Abstract: The manifold hypothesis is a core mechanism behind the success of deep learning, so understanding the intrinsic manifold structure of image data is central to studying how neural networks learn from the data. Intrinsic dataset manifolds and their relationship to learning difficulty have recently begun to be studied for the common domain of natural images, but little such research has been attempte… ▽ More The manifold hypothesis is a core mechanism behind the success of deep learning, so understanding the intrinsic manifold structure of image data is central to studying how neural networks learn from the data. Intrinsic dataset manifolds and their relationship to learning difficulty have recently begun to be studied for the common domain of natural images, but little such research has been attempted for radiological images. We address this here. First, we compare the intrinsic manifold dimensionality of radiological and natural images. We also investigate the relationship between intrinsic dimensionality and generalization ability over a wide range of datasets. Our analysis shows that natural image datasets generally have a higher number of intrinsic dimensions than radiological images. However, the relationship between generalization ability and intrinsic dimensionality is much stronger for medical images, which could be explained as radiological images having intrinsic features that are more difficult to learn. These results give a more principled underpinning for the intuition that radiological images can be more challenging to apply deep learning to than natural image datasets common to machine learning research. We believe rather than directly applying models developed for natural images to the radiological imaging domain, more care should be taken to developing architectures and algorithms that are more tailored to the specific characteristics of this domain. The research shown in our paper, demonstrating these characteristics and the differences from natural images, is an important first step in this direction. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: preprint version, accepted for MICCAI 2022 (25th International Conference on Medical Image Computing and Computer Assisted Intervention). 8 pages (+ author names + references + supplementary), 4 figures. Code available at https://github.com/mazurowski-lab/radiologyintrinsicmanifolds

arXiv:2206.09109 [pdf, other]

Fast and Provable Tensor Robust Principal Component Analysis via Scaled Gradient Descent

Authors: Harry Dong, Tian Tong, Cong Ma, Yuejie Chi

Abstract: An increasing number of data science and machine learning problems rely on computation with tensors, which better capture the multi-way relationships and interactions of data than matrices. When tapping into this critical advantage, a key challenge is to develop computationally efficient and provably correct algorithms for extracting useful information from tensor data that are simultaneously robu… ▽ More An increasing number of data science and machine learning problems rely on computation with tensors, which better capture the multi-way relationships and interactions of data than matrices. When tapping into this critical advantage, a key challenge is to develop computationally efficient and provably correct algorithms for extracting useful information from tensor data that are simultaneously robust to corruptions and ill-conditioning. This paper tackles tensor robust principal component analysis (RPCA), which aims to recover a low-rank tensor from its observations contaminated by sparse corruptions, under the Tucker decomposition. To minimize the computation and memory footprints, we propose to directly recover the low-dimensional tensor factors -- starting from a tailored spectral initialization -- via scaled gradient descent (ScaledGD), coupled with an iteration-varying thresholding operation to adaptively remove the impact of corruptions. Theoretically, we establish that the proposed algorithm converges linearly to the true low-rank tensor at a constant rate that is independent with its condition number, as long as the level of corruptions is not too large. Empirically, we demonstrate that the proposed algorithm achieves better and more scalable performance than state-of-the-art matrix and tensor RPCA algorithms through synthetic experiments and real-world applications. △ Less

Submitted 22 February, 2023; v1 submitted 18 June, 2022; originally announced June 2022.

arXiv:2206.00390 [pdf, other]

doi 10.1109/TIM.2023.3259031

Attention-embedded Quadratic Network (Qttention) for Effective and Interpretable Bearing Fault Diagnosis

Authors: Jing-Xiao Liao, Hang-Cheng Dong, Zhi-Qi Sun, Jinwei Sun, Shiping Zhang, Feng-Lei Fan

Abstract: Bearing fault diagnosis is of great importance to decrease the damage risk of rotating machines and further improve economic profits. Recently, machine learning, represented by deep learning, has made great progress in bearing fault diagnosis. However, applying deep learning to such a task still faces a major problem. A deep network is notoriously a black box. It is difficult to know how a model c… ▽ More Bearing fault diagnosis is of great importance to decrease the damage risk of rotating machines and further improve economic profits. Recently, machine learning, represented by deep learning, has made great progress in bearing fault diagnosis. However, applying deep learning to such a task still faces a major problem. A deep network is notoriously a black box. It is difficult to know how a model classifies faulty signals from the normal and the physics principle behind the classification. To solve the interpretability issue, first, we prototype a convolutional network with recently-invented quadratic neurons. This quadratic neuron empowered network can qualify the noisy bearing data due to the strong feature representation ability of quadratic neurons. Moreover, we independently derive the attention mechanism from a quadratic neuron, referred to as qttention, by factorizing the learned quadratic function in analogue to the attention, making the model with quadratic neurons inherently interpretable. Experiments on the public and our datasets demonstrate that the proposed network can facilitate effective and interpretable bearing fault diagnosis. △ Less

Submitted 7 August, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: update abstract add experiments in classification results delete small data experiment add comparison experiments of qttention and convolution

Report number: Art no. 3511113

Journal ref: IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1-13, 2023

arXiv:2202.10690 [pdf]

An Energy-concentrated Wavelet Transform for Time Frequency Analysis of Transient Signals

Authors: Haoran Dong, Gang Yu

Abstract: Transient signals are often composed of a series of modes that have multivalued time-dependent instantaneous frequency (IF), which brings challenges to the development of signal processing technology. Fortunately, the group delay (GD) of such signal can be well expressed as a single valued function of frequency. By considering the frequency-domain signal model, we present a postprocessing method c… ▽ More Transient signals are often composed of a series of modes that have multivalued time-dependent instantaneous frequency (IF), which brings challenges to the development of signal processing technology. Fortunately, the group delay (GD) of such signal can be well expressed as a single valued function of frequency. By considering the frequency-domain signal model, we present a postprocessing method called wavelet transform (WT)-based time-reassigned synchrosqueezing transform (WTSST). Our proposed method embeds a two-dimensional GD operator into a synchrosqueezing framework to generate a time-frequency representation (TFR) of transient signal with high energy concentration and allows to retrieve the whole or part of the signal. The theoretical analyses of the WTSST are provided, including the analysis of GD candidate accuracy and signal reconstruction accuracy. Moreover, based on WTSST, the WT-based time-reassigned multisynchrosqueezing transform (WTMSST) is proposed by introducing a stepwise refinement scheme, which further improves the drawback that the WTSST method is unable to deal with strong frequency-varying signal. Simulation and real signal analysis illustrate that the proposed methods have the capacity to appropriately describe the features of transient signals. △ Less

Submitted 22 February, 2022; originally announced February 2022.

arXiv:2202.06034 [pdf, other]

Deep Performer: Score-to-Audio Music Performance Synthesis

Authors: Hao-Wen Dong, Cong Zhou, Taylor Berg-Kirkpatrick, Julian McAuley

Abstract: Music performance synthesis aims to synthesize a musical score into a natural performance. In this paper, we borrow recent advances in text-to-speech synthesis and present the Deep Performer -- a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. Hence, we propose two new techniques for handling polyphonic inputs and providing… ▽ More Music performance synthesis aims to synthesize a musical score into a natural performance. In this paper, we borrow recent advances in text-to-speech synthesis and present the Deep Performer -- a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. Hence, we propose two new techniques for handling polyphonic inputs and providing a fine-grained conditioning in a transformer encoder-decoder model. To train our proposed system, we present a new violin dataset consisting of paired recordings and scores along with estimated alignments between them. We show that our proposed model can synthesize music with clear polyphony and harmonic structures. In a listening test, we achieve competitive quality against the baseline model, a conditional generative audio model, in terms of pitch accuracy, timbre and noise level. Moreover, our proposed model significantly outperforms the baseline on an existing piano dataset in overall quality. △ Less

Submitted 20 February, 2022; v1 submitted 12 February, 2022; originally announced February 2022.

Comments: ICASSP 2022 final version with appendix

arXiv:2201.02831 [pdf, other]

doi 10.1016/j.media.2022.102628

CrossMoDA 2021 challenge: Benchmark of Cross-Modality Domain Adaptation techniques for Vestibular Schwannoma and Cochlea Segmentation

Authors: Reuben Dorent, Aaron Kujawa, Marina Ivory, Spyridon Bakas, Nicola Rieke, Samuel Joutard, Ben Glocker, Jorge Cardoso, Marc Modat, Kayhan Batmanghelich, Arseniy Belkov, Maria Baldeon Calisto, Jae Won Choi, Benoit M. Dawant, Hexin Dong, Sergio Escalera, Yubo Fan, Lasse Hansen, Mattias P. Heinrich, Smriti Joshi, Victoriya Kashtanova, Hyeon Gyu Kim, Satoshi Kondo, Christian N. Kruse, Susana K. Lai-Yuen , et al. (15 additional authors not shown)

Abstract: Domain Adaptation (DA) has recently raised strong interests in the medical imaging community. While a large variety of DA techniques has been proposed for image segmentation, most of these techniques have been validated either on private datasets or on small publicly available datasets. Moreover, these datasets mostly addressed single-class problems. To tackle these limitations, the Cross-Modality… ▽ More Domain Adaptation (DA) has recently raised strong interests in the medical imaging community. While a large variety of DA techniques has been proposed for image segmentation, most of these techniques have been validated either on private datasets or on small publicly available datasets. Moreover, these datasets mostly addressed single-class problems. To tackle these limitations, the Cross-Modality Domain Adaptation (crossMoDA) challenge was organised in conjunction with the 24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2021). CrossMoDA is the first large and multi-class benchmark for unsupervised cross-modality DA. The challenge's goal is to segment two key brain structures involved in the follow-up and treatment planning of vestibular schwannoma (VS): the VS and the cochleas. Currently, the diagnosis and surveillance in patients with VS are performed using contrast-enhanced T1 (ceT1) MRI. However, there is growing interest in using non-contrast sequences such as high-resolution T2 (hrT2) MRI. Therefore, we created an unsupervised cross-modality segmentation benchmark. The training set provides annotated ceT1 (N=105) and unpaired non-annotated hrT2 (N=105). The aim was to automatically perform unilateral VS and bilateral cochlea segmentation on hrT2 as provided in the testing set (N=137). A total of 16 teams submitted their algorithm for the evaluation phase. The level of performance reached by the top-performing teams is strikingly high (best median Dice - VS:88.4%; Cochleas:85.7%) and close to full supervision (median Dice - VS:92.5%; Cochleas:87.7%). All top-performing methods made use of an image-to-image translation approach to transform the source-domain images into pseudo-target-domain images. A segmentation network was then trained using these generated images and the manual annotations provided for the source image. △ Less

Submitted 14 December, 2022; v1 submitted 8 January, 2022; originally announced January 2022.

Comments: In Medical Image Analysis

arXiv:2112.05758 [pdf, other]

Edge-Enhanced Dual Discriminator Generative Adversarial Network for Fast MRI with Parallel Imaging Using Multi-view Information

Authors: Jiahao Huang, Weiping Ding, Jun Lv, Jingwen Yang, Hao Dong, Javier Del Ser, Jun Xia, Tiaojuan Ren, Stephen Wong, Guang Yang

Abstract: In clinical medicine, magnetic resonance imaging (MRI) is one of the most important tools for diagnosis, triage, prognosis, and treatment planning. However, MRI suffers from an inherent slow data acquisition process because data is collected sequentially in k-space. In recent years, most MRI reconstruction methods proposed in the literature focus on holistic image reconstruction rather than enhanc… ▽ More In clinical medicine, magnetic resonance imaging (MRI) is one of the most important tools for diagnosis, triage, prognosis, and treatment planning. However, MRI suffers from an inherent slow data acquisition process because data is collected sequentially in k-space. In recent years, most MRI reconstruction methods proposed in the literature focus on holistic image reconstruction rather than enhancing the edge information. This work steps aside this general trend by elaborating on the enhancement of edge information. Specifically, we introduce a novel parallel imaging coupled dual discriminator generative adversarial network (PIDD-GAN) for fast multi-channel MRI reconstruction by incorporating multi-view information. The dual discriminator design aims to improve the edge information in MRI reconstruction. One discriminator is used for holistic image reconstruction, whereas the other one is responsible for enhancing edge information. An improved U-Net with local and global residual learning is proposed for the generator. Frequency channel attention blocks (FCA Blocks) are embedded in the generator for incorporating attention mechanisms. Content loss is introduced to train the generator for better reconstruction quality. We performed comprehensive experiments on Calgary-Campinas public brain MR dataset and compared our method with state-of-the-art MRI reconstruction methods. Ablation studies of residual learning were conducted on the MICCAI13 dataset to validate the proposed modules. Results show that our PIDD-GAN provides high-quality reconstructed MR images, with well-preserved edge information. The time of single-image reconstruction is below 5ms, which meets the demand of faster processing. △ Less

Submitted 10 December, 2021; originally announced December 2021.

Comments: 33 pages, 13 figures, Applied Intelligence

arXiv:2112.05150 [pdf, other]

Deep Recurrent Neural Network with Multi-scale Bi-directional Propagation for Video Deblurring

Authors: Chao Zhu, Hang Dong, Jinshan Pan, Boyang Liang, Yuhao Huang, Lean Fu, Fei Wang

Abstract: The success of the state-of-the-art video deblurring methods stems mainly from implicit or explicit estimation of alignment among the adjacent frames for latent video restoration. However, due to the influence of the blur effect, estimating the alignment information from the blurry adjacent frames is not a trivial task. Inaccurate estimations will interfere the following frame restoration. Instead… ▽ More The success of the state-of-the-art video deblurring methods stems mainly from implicit or explicit estimation of alignment among the adjacent frames for latent video restoration. However, due to the influence of the blur effect, estimating the alignment information from the blurry adjacent frames is not a trivial task. Inaccurate estimations will interfere the following frame restoration. Instead of estimating alignment information, we propose a simple and effective deep Recurrent Neural Network with Multi-scale Bi-directional Propagation (RNN-MBP) to effectively propagate and gather the information from unaligned neighboring frames for better video deblurring. Specifically, we build a Multi-scale Bi-directional Propagation~(MBP) module with two U-Net RNN cells which can directly exploit the inter-frame information from unaligned neighboring hidden states by integrating them in different scales. Moreover, to better evaluate the proposed algorithm and existing state-of-the-art methods on real-world blurry scenes, we also create a Real-World Blurry Video Dataset (RBVD) by a well-designed Digital Video Acquisition System (DVAS) and use it as the training and evaluation dataset. Extensive experimental results demonstrate that the proposed RBVD dataset effectively improves the performance of existing algorithms on real-world blurry videos, and the proposed algorithm performs favorably against the state-of-the-art methods on three typical benchmarks. The code is available at https://github.com/XJTU-CVLAB-LOWLEVEL/RNN-MBP. △ Less

Submitted 9 December, 2021; originally announced December 2021.

Comments: Accepted by AAAI-2022

arXiv:2111.04046 [pdf]

GSG: A Granary Soft Gripper with Mechanical Force Sensing via 3-Dimensional Snap-Through Structure

Authors: Huixu Dong, Chao-Yu Chen, Chen Qiu, Chen-Hua Yeow, Haoyong Yu

Abstract: Grasping is an essential capability for most robots in practical applications. Soft robotic grippers are considered as a critical part of robotic grasping and have attracted considerable attention in terms of the advantages of the high compliance and robustness to variance in object geometry; however, they are still limited by the corresponding sensing capabilities and actuation mechanisms. We pro… ▽ More Grasping is an essential capability for most robots in practical applications. Soft robotic grippers are considered as a critical part of robotic grasping and have attracted considerable attention in terms of the advantages of the high compliance and robustness to variance in object geometry; however, they are still limited by the corresponding sensing capabilities and actuation mechanisms. We propose a novel soft gripper that looks like a granary with a compliant snap-through bistable mechanism fabricated by integrated mold technology, achieving sensing and actuation purely mechanically. In particular, the snap-through bistable structure in the proposed gripper allows us to reduce the complexity of the mechanism, control, sensing designs since the grasping and sensing behaviors are completely passive. The grasping behaviors are automatically motivated once the trigger position of the gripper touches an object and applies sufficient force. To grasp objects with various profiles, the proposed granary soft gripper (GSG) is designed to be capable of enveloping, pinching and caging grasps. The gripper consists of a chamber palm, a palm cap and three fingers. First, the design of the gripper is analyzed. Then, after the theoretical model is constructed, finite element (FE) simulations are conducted to verify the built model. Finally, a series of grasping experiments is carried out to assess the snap-through behavior of the proposed gripper on grasping and sensing. The experimental results illustrate that the proposed gripper can manipulate a variety of soft and rigid objects and remain stable even though it undertakes external disturbances. △ Less

Submitted 7 November, 2021; originally announced November 2021.

arXiv:2108.01769 [pdf, other]

An Empirical Evaluation of End-to-End Polyphonic Optical Music Recognition

Authors: Sachinda Edirisooriya, Hao-Wen Dong, Julian McAuley, Taylor Berg-Kirkpatrick

Abstract: Previous work has shown that neural architectures are able to perform optical music recognition (OMR) on monophonic and homophonic music with high accuracy. However, piano and orchestral scores frequently exhibit polyphonic passages, which add a second dimension to the task. Monophonic and homophonic music can be described as homorhythmic, or having a single musical rhythm. Polyphonic music, on th… ▽ More Previous work has shown that neural architectures are able to perform optical music recognition (OMR) on monophonic and homophonic music with high accuracy. However, piano and orchestral scores frequently exhibit polyphonic passages, which add a second dimension to the task. Monophonic and homophonic music can be described as homorhythmic, or having a single musical rhythm. Polyphonic music, on the other hand, can be seen as having multiple rhythmic sequences, or voices, concurrently. We first introduce a workflow for creating large-scale polyphonic datasets suitable for end-to-end recognition from sheet music publicly available on the MuseScore forum. We then propose two novel formulations for end-to-end polyphonic OMR -- one treating the problem as a type of multi-task binary classification, and the other treating it as multi-sequence detection. Building upon the encoder-decoder architecture and an image encoder proposed in past work on end-to-end OMR, we propose two novel decoder models -- FlagDecoder and RNNDecoder -- that correspond to the two formulations. Finally, we compare the empirical performance of these end-to-end approaches to polyphonic OMR and observe a new state-of-the-art performance with our multi-sequence detection decoder, RNNDecoder. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: Accepted to ISMIR 2021

arXiv:2108.00184 [pdf, other]

Performance assessment and tuning of PID control using TLBO: the single-loop case and PI/P cascade case

Authors: Wei Zhang, He Dong, Yunlang Xu, Xiaoping Li

Abstract: Proportional-integral-derivative (PID) control, the most common control strategy in the industry, always suffers from health problems resulting from external disturbances, improper tuning, etc. Therefore, there have been many studies on control performance assessment (CPA) and optimal tuning. Minimum output variance (MOV) is used as a benchmark for CPA of PID, but it is difficult to be found due t… ▽ More Proportional-integral-derivative (PID) control, the most common control strategy in the industry, always suffers from health problems resulting from external disturbances, improper tuning, etc. Therefore, there have been many studies on control performance assessment (CPA) and optimal tuning. Minimum output variance (MOV) is used as a benchmark for CPA of PID, but it is difficult to be found due to the associated non-convex optimization problem. For the optimal tuning, many different objective functions have been proposed, but few consider the stochastic disturbance rejection. In this paper, a multi-objective function simultaneously considering integral of absolute error (IAE) and MOV is proposed to optimize PID for better disturbance rejection. The non-convex problem and multi-objective problem are solved by teaching-learning-based optimization (TLBO). This stochastic optimization algorithm can guarantee a tighter lower bound for MOV due to the excellent capability of local optima avoidance and needs less calculation time due to the low complexity. Furthermore, CPA and the tuning method are extended to the PI/P cascade case. The results of several numerical examples of CPA problems show that TLBO can generate better MOV than existing methods within one second on most examples. The simulation results of the tuning method applied to two temperature control systems reveal that the weight of the multi-objective function can compromise other performance criteria such as overshoot and settling time to improve the disturbance rejection. It also indicates that the tuning method can be utilized to multi-stage PID control strategy to resolve the contradiction between disturbance rejection and other performance criteria. △ Less

Submitted 31 July, 2021; originally announced August 2021.

arXiv:2107.05916 [pdf, other]

Towards Automatic Instrumentation by Learning to Separate Parts in Symbolic Multitrack Music

Authors: Hao-Wen Dong, Chris Donahue, Taylor Berg-Kirkpatrick, Julian McAuley

Abstract: Modern keyboards allow a musician to play multiple instruments at the same time by assigning zones -- fixed pitch ranges of the keyboard -- to different instruments. In this paper, we aim to further extend this idea and examine the feasibility of automatic instrumentation -- dynamically assigning instruments to notes in solo music during performance. In addition to the online, real-time-capable se… ▽ More Modern keyboards allow a musician to play multiple instruments at the same time by assigning zones -- fixed pitch ranges of the keyboard -- to different instruments. In this paper, we aim to further extend this idea and examine the feasibility of automatic instrumentation -- dynamically assigning instruments to notes in solo music during performance. In addition to the online, real-time-capable setting for performative use cases, automatic instrumentation can also find applications in assistive composing tools in an offline setting. Due to the lack of paired data of original solo music and their full arrangements, we approach automatic instrumentation by learning to separate parts (e.g., voices, instruments and tracks) from their mixture in symbolic multitrack music, assuming that the mixture is to be played on a keyboard. We frame the task of part separation as a sequential multi-class classification problem and adopt machine learning to map sequences of notes into sequences of part labels. To examine the effectiveness of our proposed models, we conduct a comprehensive empirical evaluation over four diverse datasets of different genres and ensembles -- Bach chorales, string quartets, game music and pop music. Our experiments show that the proposed models outperform various baselines. We also demonstrate the potential for our proposed models to produce alternative convincing instrumentations for an existing arrangement by separating its mixture into parts. All source code and audio samples can be found at https://salu133445.github.io/arranger/ . △ Less

Submitted 21 October, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: ISMIR 2021 camera ready

arXiv:2105.13296 [pdf, other]

doi 10.1109/JSTSP.2022.3144020

Federated Meta Learning Enhanced Acoustic Radio Cooperative Framework for Ocean of Things Underwater Acoustic Communications

Authors: Hao Zhao, Fei Ji, Quansheng Guan, Qiang Li, Shuai Wang, Hefeng Dong, Miaowen Wen

Abstract: Sixth-generation wireless communication (6G) will be an integrated architecture of "space, air, ground and sea". One of the most difficult part of this architecture is the underwater information acquisition which need to transmitt information cross the interface between water and air.In this senario, ocean of things (OoT) will play an important role, because it can serve as a hub connecting Intern… ▽ More Sixth-generation wireless communication (6G) will be an integrated architecture of "space, air, ground and sea". One of the most difficult part of this architecture is the underwater information acquisition which need to transmitt information cross the interface between water and air.In this senario, ocean of things (OoT) will play an important role, because it can serve as a hub connecting Internet of things (IoT) and Internet of underwater things (IoUT). OoT device not only can collect data through underwater methods, but also can utilize radio frequence over the air. For underwater communications, underwater acoustic communications (UWA COMMs) is the most effective way for OoT devices to exchange information, but it is always tormented by doppler shift and synchronization errors. In this paper, in order to overcome UWA tough conditions, a deep neural networks based receiver for underwater acoustic chirp communication, called C-DNN, is proposed. Moreover, to improve the performance of DL-model and solve the problem of model generalization, we also proposed a novel federated meta learning (FML) enhanced acoustic radio cooperative (ARC) framework, dubbed ARC/FML, to do transfer. Particularly, tractable expressions are derived for the convergence rate of FML in a wireless setting, accounting for effects from both scheduling ratio, local epoch and the data amount on a single node.From our analysis and simulation results, it is shown that, the proposed C-DNN can provide a better BER performance and lower complexity than classical matched filter (MF) in underwater acoustic communications scenario. The ARC/FML framework has good convergence under a variety of channels than federated learning (FL). In summary, the proposed ARC/FML for OoT is a promising scheme for information exchange across water and air. △ Less

Submitted 23 May, 2021; originally announced May 2021.

arXiv:2011.12754 [pdf, other]

Feature Selection based on Principal Component Analysis for Underwater Source Localization by Deep Learning

Authors: Xiaoyu Zhu, Hefeng Dong, Pierluigi Salvo Rossi, Martin Landrø

Abstract: In this paper, we propose an interpretable feature selection method based on principal component analysis (PCA) and principal component regression (PCR), which can extract important features for underwater source localization by only introducing the source location without other prior information. This feature selection method is combined with a two-step framework for underwater source localizatio… ▽ More In this paper, we propose an interpretable feature selection method based on principal component analysis (PCA) and principal component regression (PCR), which can extract important features for underwater source localization by only introducing the source location without other prior information. This feature selection method is combined with a two-step framework for underwater source localization based on the semi-supervised learning scheme. In the framework, the first step utilizes a convolutional autoencoder to extract the latent features from the whole available dataset. The second step performs source localization via an encoder multi-layer perceptron (MLP) trained on a limited labeled portion of the dataset. The proposed approach has been validated on the public dataset SwllEx-96 Event S5. The result shows the framework has appealing accuracy and robustness on the unseen data, especially when the number of data used to train gradually decreases. After feature selection, not only the training stage has a 95\% acceleration but the performance of the framework becomes more robust on the depth and more accurate when the number of labeled data used to train is extremely limited. △ Less

Submitted 25 November, 2020; originally announced November 2020.

arXiv:2009.09361 [pdf, other]

Lyapunov-Based Reinforcement Learning for Decentralized Multi-Agent Control

Authors: Qingrui Zhang, Hao Dong, Wei Pan

Abstract: Decentralized multi-agent control has broad applications, ranging from multi-robot cooperation to distributed sensor networks. In decentralized multi-agent control, systems are complex with unknown or highly uncertain dynamics, where traditional model-based control methods can hardly be applied. Compared with model-based control in control theory, deep reinforcement learning (DRL) is promising to… ▽ More Decentralized multi-agent control has broad applications, ranging from multi-robot cooperation to distributed sensor networks. In decentralized multi-agent control, systems are complex with unknown or highly uncertain dynamics, where traditional model-based control methods can hardly be applied. Compared with model-based control in control theory, deep reinforcement learning (DRL) is promising to learn the controller/policy from data without the knowing system dynamics. However, to directly apply DRL to decentralized multi-agent control is challenging, as interactions among agents make the learning environment non-stationary. More importantly, the existing multi-agent reinforcement learning (MARL) algorithms cannot ensure the closed-loop stability of a multi-agent system from a control-theoretic perspective, so the learned control polices are highly possible to generate abnormal or dangerous behaviors in real applications. Hence, without stability guarantee, the application of the existing MARL algorithms to real multi-agent systems is of great concern, e.g., UAVs, robots, and power systems, etc. In this paper, we aim to propose a new MARL algorithm for decentralized multi-agent control with a stability guarantee. The new MARL algorithm, termed as a multi-agent soft-actor critic (MASAC), is proposed under the well-known framework of "centralized-training-with-decentralized-execution". The closed-loop stability is guaranteed by the introduction of a stability constraint during the policy improvement in our MASAC algorithm. The stability constraint is designed based on Lyapunov's method in control theory. To demonstrate the effectiveness, we present a multi-agent navigation example to show the efficiency of the proposed MASAC algorithm. △ Less

Submitted 20 September, 2020; originally announced September 2020.

Comments: Accepted to The 2nd International Conference on Distributed Artificial Intelligence

arXiv:2009.00072 [pdf]

Under Water Waste Cleaning by Mobile Edge Computing and Intelligent Image Processing Based Robotic Fish

Authors: Subhadeep Sahoo, Xiao Han Dong, Zi Qian Liu, Joydeep Sahoo

Abstract: As water pollution is a serious threat to underwater resources, i.e., underwater plants and species, we focus on protecting the resources by cleaning the non-biodegradable waste from the water. The waste can be recycled for further usage. Here we design a robotic fish which mainly comprises optical biosensor, camera module, piston module, and wireless transceiver. By exploiting the LTE and 5G netw… ▽ More As water pollution is a serious threat to underwater resources, i.e., underwater plants and species, we focus on protecting the resources by cleaning the non-biodegradable waste from the water. The waste can be recycled for further usage. Here we design a robotic fish which mainly comprises optical biosensor, camera module, piston module, and wireless transceiver. By exploiting the LTE and 5G network architecture, the fish stores the information about the underwater waste in the nearest mobile edge computing server as well as in the centralized cloud server. Finally, when the fish clears the underwater waste, it offloads the captured image of the located object to the mobile edge computing server or sometimes to the cloud server for making a decision. The servers employ intelligent image processing technology and an adaptive learning process to make a decision. However, if the servers fail to make a decision, then the fish utilizes its optical biosensor. By this scheme, the time delay for clearing any water body is minimized and the waste collection capacity of the fish is maximized. This technique can effectively help the government or municipal personnel for making clean water without manual efforts. △ Less

Submitted 31 August, 2020; originally announced September 2020.

Comments: This is an innovative project report awarded by Ericsson Innovation Award 2019

arXiv:2008.01951 [pdf, other]

MusPy: A Toolkit for Symbolic Music Generation

Authors: Hao-Wen Dong, Ke Chen, Julian McAuley, Taylor Berg-Kirkpatrick

Abstract: In this paper, we present MusPy, an open source Python library for symbolic music generation. MusPy provides easy-to-use tools for essential components in a music generation system, including dataset management, data I/O, data preprocessing and model evaluation. In order to showcase its potential, we present statistical analysis of the eleven datasets currently supported by MusPy. Moreover, we con… ▽ More In this paper, we present MusPy, an open source Python library for symbolic music generation. MusPy provides easy-to-use tools for essential components in a music generation system, including dataset management, data I/O, data preprocessing and model evaluation. In order to showcase its potential, we present statistical analysis of the eleven datasets currently supported by MusPy. Moreover, we conduct a cross-dataset generalizability experiment by training an autoregressive model on each dataset and measuring held-out likelihood on the others---a process which is made easier by MusPy's dataset management system. The results provide a map of domain overlap between various commonly used datasets and show that some datasets contain more representative cross-genre samples than others. Along with the dataset analysis, these results might serve as a guide for choosing datasets in future research. Source code and documentation are available at https://github.com/salu133445/muspy . △ Less

Submitted 5 August, 2020; originally announced August 2020.

Comments: Accepted by International Society for Music Information Retrieval Conference (ISMIR), 2020

arXiv:2001.03831 [pdf, other]

A Comparative Study for Non-rigid Image Registration and Rigid Image Registration

Authors: Xiaoran Zhang, Hexiang Dong, Di Gao, Xiao Zhao

Abstract: Image registration algorithms can be generally categorized into two groups: non-rigid and rigid. Recently, many deep learning-based algorithms employ a neural net to characterize non-rigid image registration function. However, do they always perform better? In this study, we compare the state-of-art deep learning-based non-rigid registration approach with rigid registration approach. The data is g… ▽ More Image registration algorithms can be generally categorized into two groups: non-rigid and rigid. Recently, many deep learning-based algorithms employ a neural net to characterize non-rigid image registration function. However, do they always perform better? In this study, we compare the state-of-art deep learning-based non-rigid registration approach with rigid registration approach. The data is generated from Kaggle Dog vs Cat Competition \url{https://www.kaggle.com/c/dogs-vs-cats/} and we test the algorithms' performance on rigid transformation including translation, rotation, scaling, shearing and pixelwise non-rigid transformation. The Voxelmorph is trained on rigidset and nonrigidset separately for comparison and we also add a gaussian blur layer to its original architecture to improve registration performance. The best quantitative results in both root-mean-square error (RMSE) and mean absolute error (MAE) metrics for rigid registration are produced by SimpleElastix and non-rigid registration by Voxelmorph. We select representative samples for visual assessment. △ Less

Submitted 11 January, 2020; originally announced January 2020.

arXiv:2001.02360 [pdf, other]

Automatic Melody Harmonization with Triad Chords: A Comparative Study

Authors: Yin-Cheng Yeh, Wen-Yi Hsiao, Satoru Fukayama, Tetsuro Kitahara, Benjamin Genchel, Hao-Min Liu, Hao-Wen Dong, Yian Chen, Terence Leong, Yi-Hsuan Yang

Abstract: Several prior works have proposed various methods for the task of automatic melody harmonization, in which a model aims to generate a sequence of chords to serve as the harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a comparative study evaluating and comparing the performance of a set of canonical approaches to this task, including a template matching bas… ▽ More Several prior works have proposed various methods for the task of automatic melody harmonization, in which a model aims to generate a sequence of chords to serve as the harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a comparative study evaluating and comparing the performance of a set of canonical approaches to this task, including a template matching based model, a hidden Markov based model, a genetic algorithm based model, and two deep learning based models. The evaluation is conducted on a dataset of 9,226 melody/chord pairs we newly collect for this study, considering up to 48 triad chords, using a standardized training/test split. We report the result of an objective evaluation using six different metrics and a subjective study with 202 participants. △ Less

Submitted 27 April, 2021; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: 20 pages, 6 figures, published in Journal of New Music Research (JNMR), Volume 50 Issue 1

arXiv:1911.03461 [pdf, other]

AIM 2019 Challenge on Image Demoireing: Methods and Results

Authors: Shanxin Yuan, Radu Timofte, Gregory Slabaugh, Ales Leonardis, Bolun Zheng, Xin Ye, Xiang Tian, Yaowu Chen, Xi Cheng, Zhenyong Fu, Jian Yang, Ming Hong, Wenying Lin, Wenjin Yang, Yanyun Qu, Hong-Kyu Shin, Joon-Yeon Kim, Sung-Jea Ko, Hang Dong, Yu Guo, Jie Wang, Xuan Ding, Zongyan Han, Sourya Dipta Das, Kuldeep Purohit , et al. (3 additional authors not shown)

Abstract: This paper reviews the first-ever image demoireing challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ICCV 2019. This paper describes the challenge, and focuses on the proposed solutions and their results. Demoireing is a difficult task of removing moire patterns from an image to reveal an underlying clean image. A new dataset, called LCDMoire wa… ▽ More This paper reviews the first-ever image demoireing challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ICCV 2019. This paper describes the challenge, and focuses on the proposed solutions and their results. Demoireing is a difficult task of removing moire patterns from an image to reveal an underlying clean image. A new dataset, called LCDMoire was created for this challenge, and consists of 10,200 synthetically generated image pairs (moire and clean ground truth). The challenge was divided into 2 tracks. Track 1 targeted fidelity, measuring the ability of demoire methods to obtain a moire-free image compared with the ground truth, while Track 2 examined the perceptual quality of demoire methods. The tracks had 60 and 39 registered participants, respectively. A total of eight teams competed in the final testing phase. The entries span the current the state-of-the-art in the image demoireing problem. △ Less

Submitted 8 November, 2019; originally announced November 2019.

Comments: arXiv admin note: text overlap with arXiv:1911.02498

arXiv:1906.00884 [pdf, other]

Fashion Editing with Adversarial Parsing Learning

Authors: Haoye Dong, Xiaodan Liang, Yixuan Zhang, Xujie Zhang, Zhenyu Xie, Bowen Wu, Ziqi Zhang, Xiaohui Shen, Jian Yin

Abstract: Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to re… ▽ More Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to restore the incomplete image, which tends to wash away the sketch and color information. In this paper, we propose a novel Fashion Editing Generative Adversarial Network (FE-GAN), which is capable of manipulating fashion images by free-form sketches and sparse color strokes. FE-GAN consists of two modules: 1) a free-form parsing network that learns to control the human parsing generation by manipulating sketch and color; 2) a parsing-aware inpainting network that renders detailed textures with semantic guidance from the human parsing map. A new attention normalization layer is further applied at multiple scales in the decoder of the inpainting network to enhance the quality of the synthesized image. Extensive experiments on high-resolution fashion image datasets demonstrate that the proposed method significantly outperforms the state-of-the-art methods on image manipulation. △ Less

Submitted 28 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

Comments: 22 pages, 18 figures

arXiv:1808.00312 [pdf, ps, other]

On a hierarchical control strategy for multi-agent formation without reflection

Authors: Toshiharu Sugie, Brian D. O. Anderson, Zhiyong Sun, Huichao Dong

Abstract: This paper considers a formation shape control problem for point agents in a two-dimensional ambient space, where the control is distributed, is based on achieving desired distances between nominated agent pairs, and avoids the possibility of reflection ambiguities. This has potential applications for large-scale multi-agent systems having simple information exchange structure. One solution to thi… ▽ More This paper considers a formation shape control problem for point agents in a two-dimensional ambient space, where the control is distributed, is based on achieving desired distances between nominated agent pairs, and avoids the possibility of reflection ambiguities. This has potential applications for large-scale multi-agent systems having simple information exchange structure. One solution to this type of problem, applicable to formations with just three or four agents, was recently given by considering a potential function which consists of both distance error and signed triangle area terms. However, it seems to be challenging to apply it to formations with more than four agents. This paper shows a hierarchical control strategy which can be applicable to any number of agents based on the above type of potential function and a formation shaping incorporating a grouping of equilateral triangles, so that all controlled distances are in fact the same. A key analytical result and some numerical results are shown to demonstrate the effectiveness of the proposed method. △ Less

Submitted 1 August, 2018; originally announced August 2018.

Comments: Accepted by the 57th IEEE Conference on Decision and Control

arXiv:1804.09399 [pdf, other]

Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

Authors: Hao-Wen Dong, Yi-Hsuan Yang

Abstract: It has been shown recently that deep convolutional generative adversarial networks (GANs) can learn to generate music in the form of piano-rolls, which represent music by binary-valued time-pitch matrices. However, existing models can only generate real-valued piano-rolls and require further post-processing, such as hard thresholding (HT) or Bernoulli sampling (BS), to obtain the final binary-valu… ▽ More It has been shown recently that deep convolutional generative adversarial networks (GANs) can learn to generate music in the form of piano-rolls, which represent music by binary-valued time-pitch matrices. However, existing models can only generate real-valued piano-rolls and require further post-processing, such as hard thresholding (HT) or Bernoulli sampling (BS), to obtain the final binary-valued results. In this paper, we study whether we can have a convolutional GAN model that directly creates binary-valued piano-rolls by using binary neurons. Specifically, we propose to append to the generator an additional refiner network, which uses binary neurons at the output layer. The whole network is trained in two stages. Firstly, the generator and the discriminator are pretrained. Then, the refiner network is trained along with the discriminator to learn to binarize the real-valued piano-rolls the pretrained generator creates. Experimental results show that using binary neurons instead of HT or BS indeed leads to better results in a number of objective measures. Moreover, deterministic binary neurons perform better than stochastic ones in both objective measures and a subjective test. The source code, training data and audio examples of the generated results can be found at https://salu133445.github.io/bmusegan/ . △ Less

Submitted 6 October, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

Comments: A preliminary version of this paper appeared in ISMIR 2018. In this version, we added an appendix to provide figures of sample results and remarks on the end-to-end models

arXiv:1709.06298 [pdf, other]

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment

Authors: Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, Yi-Hsuan Yang

Abstract: Generating music has a few notable differences from generating images and videos. First, music is an art of time, necessitating a temporal model. Second, music is usually composed of multiple instruments/tracks with their own temporal dynamics, but collectively they unfold over time interdependently. Lastly, musical notes are often grouped into chords, arpeggios or melodies in polyphonic music, an… ▽ More Generating music has a few notable differences from generating images and videos. First, music is an art of time, necessitating a temporal model. Second, music is usually composed of multiple instruments/tracks with their own temporal dynamics, but collectively they unfold over time interdependently. Lastly, musical notes are often grouped into chords, arpeggios or melodies in polyphonic music, and thereby introducing a chronological ordering of notes is not naturally suitable. In this paper, we propose three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs). The three models, which differ in the underlying assumptions and accordingly the network architectures, are referred to as the jamming model, the composer model and the hybrid model. We trained the proposed models on a dataset of over one hundred thousand bars of rock music and applied them to generate piano-rolls of five tracks: bass, drums, guitar, piano and strings. A few intra-track and inter-track objective metrics are also proposed to evaluate the generative results, in addition to a subjective user study. We show that our models can generate coherent music of four bars right from scratch (i.e. without human inputs). We also extend our models to human-AI cooperative music generation: given a specific track composed by human, we can generate four additional tracks to accompany it. All code, the dataset and the rendered audio samples are available at https://salu133445.github.io/musegan/ . △ Less

Submitted 24 November, 2017; v1 submitted 19 September, 2017; originally announced September 2017.

Comments: to appear at AAAI 2018

Showing 1–49 of 49 results for author: Dong, H