subscribe to arXiv mailings

Human-Centered LLM-Agent User Interface: A Position Paper

Authors: Daniel Chin, Yuxuan Wang, Gus Xia

Abstract: Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly. Still, the operation scope of the LLM agent is limited to passively following the user, requiring the user to frame his/her needs with regard to the underlying tools/systems. We note that the potential of an LLM-Agent U… ▽ More Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly. Still, the operation scope of the LLM agent is limited to passively following the user, requiring the user to frame his/her needs with regard to the underlying tools/systems. We note that the potential of an LLM-Agent User Interface (LAUI) is much greater. A user mostly ignorant to the underlying tools/systems should be able to work with a LAUI to discover an emergent workflow. Contrary to the conventional way of designing an explorable GUI to teach the user a predefined set of ways to use the system, in the ideal LAUI, the LLM agent is initialized to be proficient with the system, proactively studies the user and his/her needs, and proposes new interaction schemes to the user. To illustrate LAUI, we present Flute X GPT, a concrete example using an LLM agent, a prompt manager, and a flute-tutoring multi-modal software-hardware system to facilitate the complex, real-time user experience of learning to play the flute. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2310.02383 [pdf, other]

Automatic Multi-Path Web Story Creation from a Structural Article

Authors: Daniel Nkemelu, Peggy Chi, Daniel Castro Chin, Krishna Srinivasan, Irfan Essa

Abstract: Web articles such as Wikipedia serve as one of the major sources of knowledge dissemination and online learning. However, their in-depth information--often in a dense text format--may not be suitable for mobile browsing, even in a responsive UI. We propose an automatic approach that converts a structural article of any length into a set of interactive Web Stories that are ideal for mobile experien… ▽ More Web articles such as Wikipedia serve as one of the major sources of knowledge dissemination and online learning. However, their in-depth information--often in a dense text format--may not be suitable for mobile browsing, even in a responsive UI. We propose an automatic approach that converts a structural article of any length into a set of interactive Web Stories that are ideal for mobile experiences. We focused on Wikipedia articles and developed Wiki2Story, a pipeline based on language and layout models, to demonstrate the concept. Wiki2Story dynamically slices an article and plans one to multiple Story paths according to the document hierarchy. For each slice, it generates a multi-page summary Story composed of text and image pairs in visually-appealing layouts. We derived design principles from an analysis of manually-created Story practices. We executed our pipeline on 500 Wikipedia documents and conducted user studies to review selected outputs. Results showed that Wiki2Story effectively captured and presented salient content from the original articles and sparked interest in viewers. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2306.01683 [pdf, other]

Balancing Exploration and Exploitation: Disentangled $β$-CVAE in De Novo Drug Design

Authors: Guang Jun Nicholas Ang, De Tao Irwin Chin, Bingquan Shen

Abstract: Deep generative models have recently emerged as a promising de novo drug design method. In this respect, deep generative conditional variational autoencoder (CVAE) models are a powerful approach for generating novel molecules with desired drug-like properties. However, molecular graph-based models with disentanglement and multivariate explicit latent conditioning have not been fully elucidated. To… ▽ More Deep generative models have recently emerged as a promising de novo drug design method. In this respect, deep generative conditional variational autoencoder (CVAE) models are a powerful approach for generating novel molecules with desired drug-like properties. However, molecular graph-based models with disentanglement and multivariate explicit latent conditioning have not been fully elucidated. To address this, we proposed a molecular-graph $β$-CVAE model for de novo drug design. Here, we empirically tuned the value of disentanglement and assessed its ability to generate molecules with optimised univariate- or-multivariate properties. In particular, we optimised the octanol-water partition coefficient (ClogP), molar refractivity (CMR), quantitative estimate of drug-likeness (QED), and synthetic accessibility score (SAS). Results suggest that a lower $β$ value increases the uniqueness of generated molecules (exploration). Univariate optimisation results showed our model generated molecular property averages of ClogP = 41.07% $\pm$ 0.01% and CMR 66.76% $\pm$ 0.01% by the Ghose filter. Multivariate property optimisation results showed that our model generated an average of 30.07% $\pm$ 0.01% molecules for both desired properties. Furthermore, our model improved the QED and SAS (exploitation) of molecules generated. Together, these results suggest that the $β$-CVAE could balance exploration and exploitation through disentanglement and is a promising model for de novo drug design, thus providing a basis for future studies. △ Less

Submitted 17 August, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

arXiv:2306.00983 [pdf, other]

StyleDrop: Text-to-Image Generation in Any Style

Authors: Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan

Abstract: Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follo… ▽ More Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than $1\%$ of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https://styledrop.github.io △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Preprint. Project page at https://styledrop.github.io

arXiv:2302.10890 [pdf, other]

Learning Interpretable Low-dimensional Representation via Physical Symmetry

Authors: Xuanjie Liu, Daniel Chin, Yichen Huang, Gus Xia

Abstract: We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord and texture. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree… ▽ More We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord and texture. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use physical symmetry as a self consistency constraint for the latent space of time-series data. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to certain group transformations. We show that physical symmetry leads the model to learn a linear pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to counterfactual representation augmentation, a new technique which improves sample efficiency. △ Less

Submitted 9 February, 2024; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2209.10259 [pdf, other]

Learning Hierarchical Metrical Structure Beyond Measures

Authors: Junyan Jiang, Daniel Chin, Yixiao Zhang, Gus Xia

Abstract: Music contains hierarchical structures beyond beats and measures. While hierarchical structure annotations are helpful for music information retrieval and computer musicology, such annotations are scarce in current digital music databases. In this paper, we explore a data-driven approach to automatically extract hierarchical metrical structures from scores. We propose a new model with a Temporal C… ▽ More Music contains hierarchical structures beyond beats and measures. While hierarchical structure annotations are helpful for music information retrieval and computer musicology, such annotations are scarce in current digital music databases. In this paper, we explore a data-driven approach to automatically extract hierarchical metrical structures from scores. We propose a new model with a Temporal Convolutional Network-Conditional Random Field (TCN-CRF) architecture. Given a symbolic music score, our model takes in an arbitrary number of voices in a beat-quantized form, and predicts a 4-level hierarchical metrical structure from downbeat-level to section-level. We also annotate a dataset using RWC-POP MIDI files to facilitate training and evaluation. We show by experiments that the proposed method performs better than the rule-based approach under different orchestration settings. We also perform some simple musicological analysis on the model predictions. All demos, datasets and pre-trained models are publicly available on Github. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: Accepted at the International Society for Music Information Retrieval (ISMIR), 2022

arXiv:2107.08727 [pdf]

Measuring a Six-hole Recorder Flute's Response to Breath Pressure Variations and Fitting a Model

Authors: Daniel Chin, Gus Xia

Abstract: We propose the Siamese-flute method that measures the breath pressure and the acoustic sound in parallel. We fit a 6-DoF model to describe how the breath pressure affects the octave and the microtonal pitch bend, revealing the octave hysteresis. We release both our model parameters and our data analysis tools. We propose the Siamese-flute method that measures the breath pressure and the acoustic sound in parallel. We fit a 6-DoF model to describe how the breath pressure affects the octave and the microtonal pitch bend, revealing the octave hysteresis. We release both our model parameters and our data analysis tools. △ Less

Submitted 19 July, 2021; originally announced July 2021.

arXiv:2004.13908 [pdf]

Interactive Rainbow Score: A Visual-centered Multimodal Flute Tutoring System

Authors: Daniel Chin, Yian Zhang, Tianyu Zhang, Jake Zhao, Gus G. Xia

Abstract: Learning to play an instrument is intrinsically multimodal, and we have seen a trend of applying visual and haptic feedback in music games and computer-aided music tutoring systems. However, most current systems are still designed to master individual pieces of music; it is unclear how well the learned skills can be generalized to new pieces. We aim to explore this question. In this study, we cont… ▽ More Learning to play an instrument is intrinsically multimodal, and we have seen a trend of applying visual and haptic feedback in music games and computer-aided music tutoring systems. However, most current systems are still designed to master individual pieces of music; it is unclear how well the learned skills can be generalized to new pieces. We aim to explore this question. In this study, we contribute Interactive Rainbow Score, an interactive visual system to boost the learning of sight-playing, the general musical skill to read music and map the visual representations to performance motions. The key design of Interactive Rainbow Score is to associate pitches (and the corresponding motions) with colored notation and further strengthen such association via real-time interactions. Quantitative results show that the interactive feature on average increases the learning efficiency by 31.1%. Further analysis indicates that it is critical to apply the interaction in the early period of learning. △ Less

Submitted 28 April, 2020; originally announced April 2020.

Comments: NIME 2020 poster presentation. 6 pages

arXiv:1906.01197 [pdf]

Adaptive Multimodal Music Learning via Interactive-haptic Instrument

Authors: Yian Zhang, Yinmiao Li, Daniel Chin, Gus Xia

Abstract: Haptic interfaces have untapped the sense of touch to assist multimodal music learning. We have recently seen various improvements of interface design on tactile feedback and force guidance aiming to make instrument learning more effective. However, most interfaces are still quite static; they cannot yet sense the learning progress and adjust the tutoring strategy accordingly. To solve this proble… ▽ More Haptic interfaces have untapped the sense of touch to assist multimodal music learning. We have recently seen various improvements of interface design on tactile feedback and force guidance aiming to make instrument learning more effective. However, most interfaces are still quite static; they cannot yet sense the learning progress and adjust the tutoring strategy accordingly. To solve this problem, we contribute an adaptive haptic interface based on the latest design of haptic flute. We first adopted a clutch mechanism to enable the interface to turn on and off the haptic control flexibly in real time. The interactive tutor is then able to follow human performances and apply the "teacher force" only when the software instructs so. Finally, we incorporated the adaptive interface with a step-by-step dynamic learning strategy. Experimental results showed that dynamic learning dramatically outperforms static learning, which boosts the learning rate by 45.3% and shrinks the forgetting chance by 86%. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: 6 pages, 14 figures, 2 tables. This paper is accepted by NIME 2019(New Interface for Musical Expression)

ACM Class: H.5.5; I.2.9; I.2.6

Showing 1–9 of 9 results for author: Chin, D