subscribe to arXiv mailings

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

Authors: Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji

Abstract: This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, a… ▽ More This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2112.00702 [pdf, other]

Semi-supervised music emotion recognition using noisy student training and harmonic pitch class profiles

Authors: Hao Hao Tan

Abstract: We present Mirable's submission to the 2021 Emotions and Themes in Music challenge. In this work, we intend to address the question: can we leverage semi-supervised learning techniques on music emotion recognition? With that, we experiment with noisy student training, which has improved model performance in the image classification domain. As the noisy student method requires a strong teacher mode… ▽ More We present Mirable's submission to the 2021 Emotions and Themes in Music challenge. In this work, we intend to address the question: can we leverage semi-supervised learning techniques on music emotion recognition? With that, we experiment with noisy student training, which has improved model performance in the image classification domain. As the noisy student method requires a strong teacher model, we further delve into the factors including (i) input training length and (ii) complementary music representations to further boost the performance of the teacher model. For (i), we find that models trained with short input length perform better in PR-AUC, whereas those trained with long input length perform better in ROC-AUC. For (ii), we find that using harmonic pitch class profiles (HPCP) consistently improve tagging performance, which suggests that harmonic representation is useful for music emotion tagging. Finally, we find that noisy student method only improves tagging results for the case of long training length. Additionally, we find that ensembling representations trained with different training lengths can improve tagging results significantly, which suggest a possible direction to explore incorporating multiple temporal resolutions in the network architecture for future work. △ Less

Submitted 9 December, 2021; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: MediaEval 2021 submission for Emotion and Themes in Music

arXiv:2109.07099 [pdf]

Self-powered InP Nanowire Photodetector for Single Photon Level Detection at Room Temperature

Authors: Yi Zhu, Vidur Raj, Ziyuan Li, Hark Hoe Tan, Chennupati Jagadish, Lan Fu

Abstract: Highly sensitive photodetectors with single photon level detection is one of the key components to a range of emerging technologies, in particular the ever-growing field of optical communication, remote sensing, and quantum computing. Currently, most of the single-photon detection technologies require external biasing at high voltages and/or cooling to low temperatures, posing great limitations fo… ▽ More Highly sensitive photodetectors with single photon level detection is one of the key components to a range of emerging technologies, in particular the ever-growing field of optical communication, remote sensing, and quantum computing. Currently, most of the single-photon detection technologies require external biasing at high voltages and/or cooling to low temperatures, posing great limitations for wider applications. Here, we demonstrate InP nanowire array photodetectors that can achieve single-photon level light detection at room temperature without an external bias. We use top-down etched, heavily doped p-type InP nanowires and n-type AZO/ZnO carrier selective contact to form a radial p-n junction with a built-in electric field exceeding 3x10^5 V/cm at 0 V. The device exhibits broadband light sensitivity and can distinguish a single photon per pulse from the dark noise at 0 V, enabled by its design to realize near-ideal broadband absorption, extremely low dark current, and highly efficient charge carrier separation. Meanwhile, the bandwidth of the device reaches above 600 MHz with a timing jitter of 538 ps. The proposed device design provides a new pathway towards low-cost, high-sensitivity, self-powered photodetectors for numerous future applications. △ Less

Submitted 15 September, 2021; originally announced September 2021.

arXiv:2007.15474 [pdf, other]

Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling

Authors: Hao Hao Tan, Dorien Herremans

Abstract: High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-le… ▽ More High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w.r.t. the modelled low-level attributes of the generated output music. Furthermore, we demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes (rhythm and note density), with only 1% of the training set being labelled. Finally, using the learnt high-level feature representations, we explore the application of our framework in style transfer tasks across different arousal states. The effectiveness of this approach is verified through a subjective listening test. △ Less

Submitted 29 July, 2020; originally announced July 2020.

Journal ref: Proc. of 21st International Society of Music Information Retrieval Conference, ISMIR 2020

arXiv:2006.09833 [pdf, other]

Generative Modelling for Controllable Audio Synthesis of Expressive Piano Performance

Authors: Hao Hao Tan, Yin-Jyun Luo, Dorien Herremans

Abstract: We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of syn… ▽ More We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of synthesizing the audio. This is based on conditions which are latent variables that can be sampled from the prior or inferred from other pieces. One of the envisioned use cases is to inspire creative and brand new interpretations for existing pieces of piano music. △ Less

Submitted 12 July, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

Journal ref: Published at ICML Workshop on Machine Learning for Media Discovery Workshop (ML4MD) 2020

arXiv:1902.06996 [pdf]

Agent Madoff: A Heuristic-Based Negotiation Agent For The Diplomacy Strategy Game

Authors: Hao Hao Tan

Abstract: In this paper, we present the strategy of Agent Madoff, which is a heuristic-based negotiation agent that won 2nd place at the Automated Negotiating Agents Competition (ANAC 2017). Agent Madoff is implemented to play the game Diplomacy, which is a strategic board game that mimics the situation during World War I. Each player represents a major European power which has to negotiate with other force… ▽ More In this paper, we present the strategy of Agent Madoff, which is a heuristic-based negotiation agent that won 2nd place at the Automated Negotiating Agents Competition (ANAC 2017). Agent Madoff is implemented to play the game Diplomacy, which is a strategic board game that mimics the situation during World War I. Each player represents a major European power which has to negotiate with other forces and win possession of a majority supply centers on the map. We propose a design architecture which consists of 3 components: heuristic module, acceptance strategy and bidding strategy. The heuristic module, responsible for evaluating which regions on the graph are more worthy, considers the type of region and the number of supply centers adjacent to the region and return a utility value for each region on the map. The acceptance strategy is done on a case-by-case basis according to the type of the order by calculating the acceptance probability using a composite function. The bidding strategy adopts a defensive approach aimed to neutralize attacks and resolve conflict moves with other players to minimize our loss on supply centers. △ Less

Submitted 19 February, 2019; originally announced February 2019.

Showing 1–6 of 6 results for author: Tan, H H