subscribe to arXiv mailings

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

Abstract: Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of compositional hierarchy. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which e… ▽ More Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of compositional hierarchy. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the semantics and context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is controllable in a flexible way. By sampling from the interpretable hierarchical languages or adjusting pre-trained external representations, users can control the music flow via various features such as phrase harmonic structures, rhythmic patterns, and accompaniment texture. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: Proceedings of the International Conference on Learning Representations (ICLR 2024)

MSC Class: 68Txx

arXiv:2404.06393 [pdf, other]

MuPT: A Generative Symbolic Music Pretrained Transformer

Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions. △ Less

Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2307.10304 [pdf, other]

Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls

Authors: Lejun Min, Junyan Jiang, Gus Xia, Jingwei Zhao

Abstract: We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. Internal control refers to the process in which users pre-define a part of the music and then let the model infill the rest, similar to the task… ▽ More We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. Internal control refers to the process in which users pre-define a part of the music and then let the model infill the rest, similar to the task of masked music generation (or music inpainting). External control conditions the model with external yet related information, such as chord, texture, or other features, via the cross-attention mechanism. We show that by using internal and external controls, Polyffusion unifies a wide range of music creation tasks, including melody generation given accompaniment, accompaniment generation given melody, arbitrary music segment inpainting, and music arrangement given chords or textures. Experimental results show that our model significantly outperforms existing Transformer and sampling-based baselines, and using pre-trained disentangled representations as external conditions yields more effective controls. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Comments: In Proceedings of the 24th Conference of the International Society for Music Information Retrieval (ISMIR 2023), Milan, Italy

arXiv:2307.03918 [pdf]

VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Authors: Congqi Cao, Ze Sun, Qinyi Lv, Lingtong Min, Yanning Zhang

Abstract: Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider v… ▽ More Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Thirdly, to take advantage of both the parallel and autoregressive models, we design a Transformer based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. Extensive experiments on two large-scale first-person view datasets, i.e., EPIC-Kitchens and EGTEA Gaze+, validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. △ Less

Submitted 8 July, 2023; originally announced July 2023.

Comments: 12 pages, 7 figures

arXiv:2307.02974 [pdf, other]

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

Authors: Yuting Lu, Lingtong Min, Binglu Wang, Le Zheng, Xiaoxu Wang, Yongqiang Zhao, Teng Long

Abstract: Remote sensing image super-resolution (RSISR) plays a vital role in enhancing spatial detials and improving the quality of satellite imagery. Recently, Transformer-based models have shown competitive performance in RSISR. To mitigate the quadratic computational complexity resulting from global self-attention, various methods constrain attention to a local window, enhancing its efficiency. Conseque… ▽ More Remote sensing image super-resolution (RSISR) plays a vital role in enhancing spatial detials and improving the quality of satellite imagery. Recently, Transformer-based models have shown competitive performance in RSISR. To mitigate the quadratic computational complexity resulting from global self-attention, various methods constrain attention to a local window, enhancing its efficiency. Consequently, the receptive fields in a single attention layer are inadequate, leading to insufficient context modeling. Furthermore, while most transform-based approaches reuse shallow features through skip connections, relying solely on these connections treats shallow and deep features equally, impeding the model's ability to characterize them. To address these issues, we propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR. Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages. The model incorporates cross-spatial pixel integration attention (CSPIA) to introduce contextual information into a local window, while cross-stage feature fusion attention (CSFFA) adaptively fuses features from the previous stage to improve feature expression in line with the requirements of the current stage. We conducted comprehensive experiments on multiple benchmark datasets, demonstrating the superior performance of our proposed SPIFFNet in terms of both quantitative metrics and visual quality when compared to state-of-the-art methods. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2112.09939 [pdf, other]

Syntactic-GCN Bert based Chinese Event Extraction

Authors: Jiangwei Liu, Jingshu Zhang, Xiaohong Huang, Liangyu Min

Abstract: With the rapid development of information technology, online platforms (e.g., news portals and social media) generate enormous web information every moment. Therefore, it is crucial to extract structured representations of events from social streams. Generally, existing event extraction research utilizes pattern matching, machine learning, or deep learning methods to perform event extraction tasks… ▽ More With the rapid development of information technology, online platforms (e.g., news portals and social media) generate enormous web information every moment. Therefore, it is crucial to extract structured representations of events from social streams. Generally, existing event extraction research utilizes pattern matching, machine learning, or deep learning methods to perform event extraction tasks. However, the performance of Chinese event extraction is not as good as English due to the unique characteristics of the Chinese language. In this paper, we propose an integrated framework to perform Chinese event extraction. The proposed approach is a multiple channel input neural framework that integrates semantic features and syntactic features. The semantic features are captured by BERT architecture. The Part of Speech (POS) features and Dependency Parsing (DP) features are captured by profiling embeddings and Graph Convolutional Network (GCN), respectively. We also evaluate our model on a real-world dataset. Experimental results show that the proposed method outperforms the benchmark approaches significantly. △ Less

Submitted 18 December, 2021; originally announced December 2021.

Comments: 9 pages, 4 figures, 3 tables. arXiv admin note: text overlap with arXiv:2111.03212

arXiv:2111.11350 [pdf]

ShufaNet: Classification method for calligraphers who have reached the professional level

Authors: Ge Yunfei, Diao Changyu, Li Min, Yu Ruohan, Qiu Linshan, Xu Duanqing

Abstract: The authenticity of calligraphy is significant but difficult task in the realm of art, where the key problem is the few-shot classification of calligraphy. We propose a novel method, ShufaNet ("Shufa" is the pinyin of Chinese calligraphy), to classify Chinese calligraphers' styles based on metric learning in the case of few-shot, whose classification accuracy exceeds the level of students majoring… ▽ More The authenticity of calligraphy is significant but difficult task in the realm of art, where the key problem is the few-shot classification of calligraphy. We propose a novel method, ShufaNet ("Shufa" is the pinyin of Chinese calligraphy), to classify Chinese calligraphers' styles based on metric learning in the case of few-shot, whose classification accuracy exceeds the level of students majoring in calligraphy. We present a new network architecture, including the unique expression of the style of handwriting fonts called ShufaLoss and the calligraphy category information as prior knowledge. Meanwhile, we modify the spatial attention module and create ShufaAttention for handwriting fonts based on the traditional Chinese nine Palace thought. For the training of the model, we build a calligraphers' data set. Our method achieved 65% accuracy rate in our data set for few-shot learning, surpassing resNet and other mainstream CNNs. Meanwhile, we conducted battle for calligraphy major students, and finally surpassed them. This is the first attempt of deep learning in the field of calligrapher classification, and we expect to provide ideas for subsequent research. △ Less

Submitted 22 November, 2021; originally announced November 2021.

Comments: 10pages, 11 figures

arXiv:2111.03212 [pdf, other]

An overview of event extraction and its applications

Authors: Jiangwei Liu, Liangyu Min, Xiaohong Huang

Abstract: With the rapid development of information technology, online platforms have produced enormous text resources. As a particular form of Information Extraction (IE), Event Extraction (EE) has gained increasing popularity due to its ability to automatically extract events from human language. However, there are limited literature surveys on event extraction. Existing review works either spend much eff… ▽ More With the rapid development of information technology, online platforms have produced enormous text resources. As a particular form of Information Extraction (IE), Event Extraction (EE) has gained increasing popularity due to its ability to automatically extract events from human language. However, there are limited literature surveys on event extraction. Existing review works either spend much effort describing the details of various approaches or focus on a particular field. This study provides a comprehensive overview of the state-of-the-art event extraction methods and their applications from text, including closed-domain and open-domain event extraction. A trait of this survey is that it provides an overview in moderate complexity, avoiding involving too many details of particular approaches. This study focuses on discussing the common characters, application fields, advantages, and disadvantages of representative works, ignoring the specificities of individual approaches. Finally, we summarize the common issues, current solutions, and future research directions. We hope this work could help researchers and practitioners obtain a quick overview of recent event extraction. △ Less

Submitted 4 November, 2021; originally announced November 2021.

arXiv:2105.06284 [pdf, other]

Ergodic Capacity of High Throughput Satellite Systems With Mixed FSO-RF Transmission

Authors: Kong Huaicong, Lin Min, Wang Zining, Ouyang Jian, Cheng Julian

Abstract: We study a high throughput satellite system, where the feeder link uses free-space optical (FSO) and the user link uses radio frequency (RF) communication. In particular, we first propose a transmit diversity using Alamouti space time block coding to mitigate the atmospheric turbulence in the feeder link. Then, based on the concept of average virtual signal-to-interference-plus-noise ratio and one… ▽ More We study a high throughput satellite system, where the feeder link uses free-space optical (FSO) and the user link uses radio frequency (RF) communication. In particular, we first propose a transmit diversity using Alamouti space time block coding to mitigate the atmospheric turbulence in the feeder link. Then, based on the concept of average virtual signal-to-interference-plus-noise ratio and one-bit feedback, we propose a beamforming algorithm for the user link to maximize the ergodic capacity (EC). Moreover, by assuming that the FSO links follow the Malaga distribution whereas RF links undergo the shadowed-Rician fading, we derive a closed-form EC expression of the considered system. Finally, numerical simulations validate the accuracy of our theoretical analysis, and show that the proposed schemes can achieve higher capacity compared with the reference schemes. △ Less

Submitted 13 May, 2021; originally announced May 2021.

arXiv:2005.07225 [pdf, other]

SAGE: Sequential Attribute Generator for Analyzing Glioblastomas using Limited Dataset

Authors: Padmaja Jonnalagedda, Brent Weinberg, Jason Allen, Taejin L. Min, Shiv Bhanu, Bir Bhanu

Abstract: While deep learning approaches have shown remarkable performance in many imaging tasks, most of these methods rely on availability of large quantities of data. Medical image data, however, is scarce and fragmented. Generative Adversarial Networks (GANs) have recently been very effective in handling such datasets by generating more data. If the datasets are very small, however, GANs cannot learn th… ▽ More While deep learning approaches have shown remarkable performance in many imaging tasks, most of these methods rely on availability of large quantities of data. Medical image data, however, is scarce and fragmented. Generative Adversarial Networks (GANs) have recently been very effective in handling such datasets by generating more data. If the datasets are very small, however, GANs cannot learn the data distribution properly, resulting in less diverse or low-quality results. One such limited dataset is that for the concurrent gain of 19 and 20 chromosomes (19/20 co-gain), a mutation with positive prognostic value in Glioblastomas (GBM). In this paper, we detect imaging biomarkers for the mutation to streamline the extensive and invasive prognosis pipeline. Since this mutation is relatively rare, i.e. small dataset, we propose a novel generative framework - the Sequential Attribute GEnerator (SAGE), that generates detailed tumor imaging features while learning from a limited dataset. Experiments show that not only does SAGE generate high quality tumors when compared to standard Deep Convolutional GAN (DC-GAN) and Wasserstein GAN with Gradient Penalty (WGAN-GP), it also captures the imaging biomarkers accurately. △ Less

Submitted 3 June, 2022; v1 submitted 14 May, 2020; originally announced May 2020.

arXiv:1704.03168 [pdf, other]

FMMU: A Hardware-Automated Flash Map Management Unit for Scalable Performance of NAND Flash-Based SSDs

Authors: Yeong-Jae Woo, Sang Lyul Min

Abstract: NAND flash-based Solid State Drives (SSDs), which are widely used from embedded systems to enterprise servers, are enhancing performance by exploiting the parallelism of NAND flash memories. To cope with the performance improvement of SSDs, storage systems have rapidly adopted the host interface for SSDs from Serial-ATA, which is used for existing hard disk drives, to high-speed PCI express. Since… ▽ More NAND flash-based Solid State Drives (SSDs), which are widely used from embedded systems to enterprise servers, are enhancing performance by exploiting the parallelism of NAND flash memories. To cope with the performance improvement of SSDs, storage systems have rapidly adopted the host interface for SSDs from Serial-ATA, which is used for existing hard disk drives, to high-speed PCI express. Since NAND flash memory does not allow in-place updates, it requires special software called Flash Translation Layer (FTL), and SSDs are equipped with embedded processors to run FTL. Existing SSDs increase the clock frequency of embedded processors or increase the number of embedded processors in order to prevent FTL from acting as bottleneck of SSD performance, but these approaches are not scalable. This paper proposes a hardware-automated Flash Map Management Unit, called FMMU, that handles the address translation process dominating the execution time of the FTL by hardware automation. FMMU provides methods for exploiting the parallelism of flash memory by processing outstanding requests in a non-blocking manner while reducing the number of flash operations. The experimental results show that the FMMU reduces the FTL execution time in the map cache hit case and the miss case by 44% and 37%, respectively, compared with the existing software-based approach operating in 4-core. FMMU also prevents FTL from acting as a performance bottleneck for up to 32-channel, 8-way SSD using PCIe 3.0 x32 host interface. △ Less

Submitted 11 April, 2017; originally announced April 2017.

arXiv:1612.04277 [pdf]

Copycat: A High Precision Real Time NAND Simulator

Authors: Juyong Shin, Jongbo Bae, Ansu Na, Sang Lyul Min

Abstract: In this paper, we describe the design and implementation of a high precision real time NAND simulator called Copycat that runs on a commodity multi-core desktop environment. This NAND simulator facilitates the development of embedded flash memory management software such as the flash translation layer (FTL). The simulator also allows a comprehensive fault injection for testing the reliability of t… ▽ More In this paper, we describe the design and implementation of a high precision real time NAND simulator called Copycat that runs on a commodity multi-core desktop environment. This NAND simulator facilitates the development of embedded flash memory management software such as the flash translation layer (FTL). The simulator also allows a comprehensive fault injection for testing the reliability of the FTL. Compared against a real FPGA implementation, the simulator's response time deviation is under 0.28% on average, with a maximum of 10.12%. △ Less

Submitted 11 December, 2016; originally announced December 2016.

Showing 1–12 of 12 results for author: Min, L