Skip to main content

Showing 1–50 of 94 results for author: Mo, S

  1. arXiv:2407.09801  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.MM

    IoT-LM: Large Multisensory Language Models for the Internet of Things

    Authors: Shentong Mo, Russ Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang

    Abstract: The Internet of Things (IoT) network integrating billions of smart physical devices embedded with sensors, software, and communication technologies is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, and audio to recognize the states of humans and physical… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.06217

  2. arXiv:2407.03736  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    Semantic Grouping Network for Audio Source Separation

    Authors: Shentong Mo, Yapeng Tian

    Abstract: Recently, audio-visual separation approaches have taken advantage of the natural synchronization between the two modalities to boost audio source separation performance. They extracted high-level semantics from visual inputs as the guidance to help disentangle sound representation for individual sources. Can we directly learn to disentangle the individual semantics from the sound itself? The dilem… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  3. arXiv:2407.01552  [pdf

    cs.NI physics.optics

    High Spectral-Efficiency, Ultra-low MIMO SDM Transmission over a Field-Deployed Multi-Core OAM Fiber

    Authors: Junyi Liu, Zengquan Xu, Shuqi Mo, Yuming Huang, Yining Huang, Zhenhua Li, Yuying Guo, Lei Shen, Shuo Xu, Ran Gao, Cheng Du, Qian Feng, Jie Luo, Jie Liu, Siyuan Yu

    Abstract: Few-mode multi-core fiber (FM-MCF) based Space-Division Multiplexing (SDM) systems possess the potential to maximize the number of multiplexed spatial channels per fiber by harnessing both the space (fiber cores) and mode (optical mode per core) dimensions. However, to date, no SDM transmissions over field-deployed FM-MCFs in realistic outdoor settings have been reported, which contrasts with SDM… ▽ More

    Submitted 29 April, 2024; originally announced July 2024.

    Comments: 17 pages, 8 figures

  4. arXiv:2406.09386  [pdf, other

    cs.CV

    SimGen: Simulator-conditioned Driving Scene Generation

    Authors: Yunsong Zhou, Michael Simon, Zhenghao Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou

    Abstract: Controllable synthetic data generation can substantially lower the annotation cost of training data in autonomous driving research and development. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, the trained models can only… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  5. arXiv:2406.07540  [pdf, other

    cs.CV cs.LG

    Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

    Authors: Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

    Abstract: Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexib… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 18 pages, 11 figures, see project page at https://genforce.github.io/ctrl-x

  6. arXiv:2406.05038  [pdf, other

    cs.CV cs.AI cs.LG

    Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

    Authors: Shentong Mo

    Abstract: Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  7. arXiv:2406.04930  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

    Authors: Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

    Abstract: Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignmen… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted in Efficient Deep Learning for Computer Vision CVPR Workshop 2024

  8. arXiv:2405.17995  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

    Authors: Shentong Mo, Sukmin Yun

    Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  9. arXiv:2405.15881  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

    Authors: Shentong Mo, Yapeng Tian

    Abstract: In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, l… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  10. arXiv:2405.07202  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD eess.AS

    Unified Video-Language Pre-training with Synchronized Audio

    Authors: Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

    Abstract: Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two m… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  11. arXiv:2404.17808  [pdf, other

    cs.CL

    Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal

    Authors: Haoran Lian, Yizhe Xiong, Jianwei Niu, Shasha Mo, Zhenpeng Su, Zijia Lin, Peng Liu, Hui Chen, Guiguang Ding

    Abstract: Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have be… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  12. arXiv:2404.13081  [pdf, other

    cs.CL cs.AI cs.LG

    SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs

    Authors: Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung-Woo Ha, Jinwoo Shin

    Abstract: Large language models (LLMs) have made significant advancements in various natural language processing tasks, including question answering (QA) tasks. While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Augmenting retrieved passage… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Accepted at ICLR 2024

  13. arXiv:2404.12876  [pdf, other

    cs.CV cs.AI cs.LG

    A Large-scale Medical Visual Task Adaptation Benchmark

    Authors: Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li

    Abstract: Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  14. arXiv:2404.10308  [pdf, other

    cs.LG cs.AI

    Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

    Authors: Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin

    Abstract: Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Accepted to ICLR 2024. The first two authors contributed equally

  15. arXiv:2404.02257  [pdf, other

    cs.CV

    SnAG: Scalable and Accurate Video Grounding

    Authors: Fangzhou Mu, Sicheng Mo, Yin Li

    Abstract: Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the sca… ▽ More

    Submitted 5 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Code available at https://github.com/fmu2/snag_release

  16. arXiv:2404.00509  [pdf, other

    cs.LG cs.CV

    DailyMAE: Towards Pretraining Masked Autoencoders in One Day

    Authors: Jiantao Wu, Shentong Mo, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais

    Abstract: Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining po… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  17. arXiv:2403.07938  [pdf, other

    cs.SD cs.AI cs.CV cs.LG cs.MM eess.AS

    Text-to-Audio Generation Synchronized with Videos

    Authors: Shentong Mo, Jing Shi, Yapeng Tian

    Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2305.12903

  18. arXiv:2403.05659  [pdf, other

    cs.CV

    Audio-Synchronized Visual Animation

    Authors: Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado

    Abstract: Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips acros… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: 15 pages

  19. arXiv:2402.17406  [pdf, other

    cs.CV cs.AI cs.LG

    LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

    Authors: Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li

    Abstract: Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominan… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  20. arXiv:2402.14299  [pdf, other

    cs.RO cs.AI

    We Choose to Go to Space: Agent-driven Human and Multi-Robot Collaboration in Microgravity

    Authors: Miao Xin, Zhongrui You, Zihan Zhang, Taoran Jiang, Tingjia Xu, Haotian Liang, Guojing Ge, Yuchen Ji, Shentong Mo, Jian Cheng

    Abstract: We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions. Future space exploration requires humans to work together with robots. However, acquiring proficient robot skills and adept collaboration under microgravity conditions poses significant challenges within ground laboratories. To address this issue, we develop a microg… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  21. arXiv:2401.03494  [pdf

    cs.LG cs.CE physics.app-ph

    Pre-insertion resistors temperature prediction based on improved WOA-SVR

    Authors: Honghe Dai, Site Mo, Haoxin Wang, Nan Yin, Songhai Fan, Bixiong Li

    Abstract: The pre-insertion resistors (PIR) within high-voltage circuit breakers are critical components and warm up by generating Joule heat when an electric current flows through them. Elevated temperature can lead to temporary closure failure and, in severe cases, the rupture of PIR. To accurately predict the temperature of PIR, this study combines finite element simulation techniques with Support Vector… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

  22. arXiv:2312.07536  [pdf, other

    cs.CV

    FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

    Authors: Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou

    Abstract: Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Project Page: https://genforce.github.io/freecontrol/

  23. arXiv:2312.07231  [pdf, other

    cs.CV cs.AI cs.LG

    Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

    Authors: Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li

    Abstract: Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose Fas… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Project Page: https://dit-3d.github.io/FastDiT-3D/

  24. arXiv:2312.06220  [pdf, other

    cs.LG cs.AI

    Dance of Channel and Sequence: An Efficient Attention-Based Approach for Multivariate Time Series Forecasting

    Authors: Haoxin Wang, Yipeng Mo, Nan Yin, Honghe Dai, Bixiong Li, Songhai Fan, Site Mo

    Abstract: In recent developments, predictive models for multivariate time series analysis have exhibited commendable performance through the adoption of the prevalent principle of channel independence. Nevertheless, it is imperative to acknowledge the intricate interplay among channels, which fundamentally influences the outcomes of multivariate predictions. Consequently, the notion of channel independence,… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  25. arXiv:2312.01118  [pdf, other

    cs.CV

    Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning

    Authors: Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, Zhenhua Feng, Muhammad Awais

    Abstract: Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function. It overcomes the limitations of conventional supervised one, e.g., scalability and label biases. Despite progress in this domain, current benchmarks, incorporating a narrow scope of classes, stop the nuanced evaluation of semantic representations. To bridge this gap, we introduce… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

  26. arXiv:2312.01017  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD

    Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

    Authors: Shentong Mo, Pedro Morgado

    Abstract: Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challe… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  27. arXiv:2311.15080  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD eess.AS

    Weakly-Supervised Audio-Visual Segmentation

    Authors: Shentong Mo, Bhiksha Raj

    Abstract: Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotat… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  28. arXiv:2311.11285  [pdf, other

    cs.LG

    TimeSQL: Improving Multivariate Time Series Forecasting with Multi-Scale Patching and Smooth Quadratic Loss

    Authors: Site Mo, Haoxin Wang, Bixiong Li, Songhai Fan, Yuankai Wu, Xianggen Liu

    Abstract: Time series is a special type of sequence data, a sequence of real-valued random variables collected at even intervals of time. The real-world multivariate time series comes with noises and contains complicated local and global temporal dynamics, making it difficult to forecast the future time series given the historical observations. This work proposes a simple and effective framework, coined as… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

  29. arXiv:2311.06217  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.MM

    MultiIoT: Benchmarking Machine Learning for the Internet of Things

    Authors: Shentong Mo, Louis-Philippe Morency, Russ Salakhutdinov, Paul Pu Liang

    Abstract: The next generation of machine learning systems must be adept at perceiving and interacting with the physical world through a diverse array of sensory channels. Commonly referred to as the `Internet of Things (IoT)' ecosystem, sensory data from motion, thermal, geolocation, depth, wireless signals, video, and audio are increasingly used to model the states of physical environments and the humans i… ▽ More

    Submitted 4 July, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

  30. arXiv:2310.18850  [pdf, other

    cs.CV cs.AI cs.LG

    Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models

    Authors: Shentong Mo, Zhun Sun, Chao Li

    Abstract: Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views. In practice, augmentation techniques that mask regions of a sample with zero/mean values or patches from other samples are commonly employed in pre-trained models with self-/semi-/fully-supervised contrastive losses. However, the underlying mechanism behind the effectiv… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

  31. arXiv:2309.07694  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Tree of Uncertain Thoughts Reasoning for Large Language Models

    Authors: Shentong Mo, Miao Xin

    Abstract: While the recently introduced Tree of Thoughts (ToT) has heralded advancements in allowing Large Language Models (LLMs) to reason through foresight and backtracking for global decision-making, it has overlooked the inherent local uncertainties in intermediate decision points or "thoughts". These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significan… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  32. arXiv:2309.05281  [pdf, other

    cs.CV cs.LG cs.MM

    Class-Incremental Grouping Network for Continual Audio-Visual Learning

    Authors: Shentong Mo, Weiguo Pian, Yapeng Tian

    Abstract: Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-mo… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: ICCV 2023. arXiv admin note: text overlap with arXiv:2303.17056

  33. arXiv:2308.11448  [pdf, other

    cs.CV cs.LG

    Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

    Authors: Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler

    Abstract: Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

  34. arXiv:2308.11073  [pdf, other

    cs.CV

    Audio-Visual Class-Incremental Learning

    Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian

    Abstract: In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlati… ▽ More

    Submitted 14 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  35. arXiv:2307.12679  [pdf, other

    cs.LG math.NA

    An Estimator for the Sensitivity to Perturbations of Deep Neural Networks

    Authors: Naman Maheshwari, Nicholas Malaya, Scott Moe, Jaydeep P. Kulkarni, Sudhanva Gurumurthi

    Abstract: For Deep Neural Networks (DNNs) to become useful in safety-critical applications, such as self-driving cars and disease diagnosis, they must be stable to perturbations in input and model parameters. Characterizing the sensitivity of a DNN to perturbations is necessary to determine minimal bit-width precision that may be used to safely represent the network. However, no general result exists that i… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: Actual work and paper concluded in January 2019

  36. arXiv:2307.01831  [pdf, other

    cs.CV cs.AI cs.LG

    DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

    Authors: Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li

    Abstract: Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape genera… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Project Page: https://dit-3d.github.io/

  37. arXiv:2306.16329  [pdf, other

    cs.CV

    DiffComplete: Diffusion-based Generative 3D Shape Completion

    Authors: Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, Jiaya Jia

    Abstract: We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature agg… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: Project Page: https://ruihangchu.com/diffcomplete.html

  38. arXiv:2306.14490  [pdf, other

    cs.CV cs.AI

    TaiChi Action Capture and Performance Analysis with Multi-view RGB Cameras

    Authors: Jianwei Li, Siyu Mo, Yanfei Shen

    Abstract: Recent advances in computer vision and deep learning have influenced the field of sports performance analysis for researchers to track and reconstruct freely moving humans without any marker attachment. However, there are few works for vision-based motion capture and intelligent analysis for professional TaiChi movement. In this paper, we propose a framework for TaiChi performance capture and anal… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

  39. arXiv:2305.19458  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

    Authors: Shentong Mo, Pedro Morgado

    Abstract: The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as t… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  40. arXiv:2305.14095  [pdf, other

    cs.CV cs.LG

    S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

    Authors: Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jinwoo Shin

    Abstract: Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and adapting to such domains is challenging due to the limited number of image-text pairs available for training. To address this, we propose S-CLIP, a semi-superv… ▽ More

    Submitted 25 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  41. arXiv:2305.12903  [pdf, other

    cs.CV cs.LG cs.MM

    DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

    Authors: Shentong Mo, Jing Shi, Yapeng Tian

    Abstract: Text-to-audio (TTA) generation is a recent popular problem that aims to synthesize general audio given text descriptions. Previous methods utilized latent diffusion models to learn audio embedding in a latent space with text embedding as the condition. However, they ignored the synchronization between audio and visual content in the video, and tended to generate audio mismatching from video frames… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  42. arXiv:2305.01836  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

    Authors: Shentong Mo, Yapeng Tian

    Abstract: Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. However, there is less exploration concerning how SAM works on audio-visual tasks, such as visual sound localization and segmentation. In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that ca… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  43. arXiv:2304.04399  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

    Authors: Shentong Mo, Jingfei Xia, Ihor Markevych

    Abstract: Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

  44. arXiv:2303.17056  [pdf, other

    cs.CV cs.LG cs.MM

    Audio-Visual Grouping Network for Sound Localization from Mixtures

    Authors: Shentong Mo, Yapeng Tian

    Abstract: Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneous… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  45. arXiv:2303.12959  [pdf, other

    cs.LG cs.AI

    Variantional autoencoder with decremental information bottleneck for disentanglement

    Authors: Jiantao Wu, Shentong Mo, Xiang Yang, Muhammad Awais, Sara Atito, Xingshen Zhang, Lin Wang, Xiang Yang

    Abstract: One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous studies, which increase the information bottleneck during training, tend to lose the constraint of disentanglement, leading to the information diffusion problem. In this paper, we present a novel framework for disentangled representation learn… ▽ More

    Submitted 4 October, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

  46. arXiv:2302.14483  [pdf, other

    cs.LG cs.CV stat.ML

    RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data

    Authors: Sangwoo Mo, Jong-Chyi Su, Chih-Yao Ma, Mido Assran, Ishan Misra, Licheng Yu, Sean Bell

    Abstract: Semi-supervised learning aims to train a model using limited labels. State-of-the-art semi-supervised methods for image classification such as PAWS rely on self-supervised representations learned with large-scale unlabeled but curated data. However, PAWS is often less effective when using real-world unlabeled data that is uncurated, e.g., contains out-of-class data. We propose RoPAWS, a robust ext… ▽ More

    Submitted 28 February, 2023; originally announced February 2023.

    Comments: ICLR 2023

  47. arXiv:2302.10506  [pdf, other

    cs.LG

    Diffusion Probabilistic Models for Structured Node Classification

    Authors: Hyosoon Jang, Seonghyun Park, Sangwoo Mo, Sungsoo Ahn

    Abstract: This paper studies structured node classification on graphs, where the predictions should consider dependencies between the node labels. In particular, we focus on solving the problem for partially labeled graphs where it is essential to incorporate the information in the known label for predicting the unknown labels. To address this issue, we propose a novel framework leveraging the diffusion pro… ▽ More

    Submitted 18 June, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

  48. arXiv:2301.11104  [pdf, other

    cs.LG cs.CV

    Discovering and Mitigating Visual Biases through Keyword Explanation

    Authors: Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, Jinwoo Shin

    Abstract: Addressing biases in computer vision models is crucial for real-world AI deployments. However, mitigating visual biases is challenging due to their unexplainable nature, often identified indirectly through visualization or sample statistics, which necessitates additional human supervision for interpretation. To tackle this issue, we propose the Bias-to-Text (B2T) framework, which interprets visual… ▽ More

    Submitted 26 March, 2024; v1 submitted 26 January, 2023; originally announced January 2023.

    Comments: CVPR 2024. First two authors contributed equally

  49. arXiv:2212.06595  [pdf, other

    cs.CV cs.LG

    OAMixer: Object-aware Mixing Layer for Vision Transformers

    Authors: Hyunwoo Kang, Sangwoo Mo, Jinwoo Shin

    Abstract: Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches equally, recent studies reveal that incorporating inductive bias like spatiality benefits the representations. However, most prior works solely focused on the l… ▽ More

    Submitted 13 December, 2022; originally announced December 2022.

    Comments: CVPR Transformers for Vision Workshop 2022. First two authors contributed equally

  50. arXiv:2212.02090  [pdf, other

    cs.CV cs.AI cs.LG

    Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling

    Authors: Junhyun Nam, Sangwoo Mo, Jaeho Lee, Jinwoo Shin

    Abstract: To capture the relationship between samples and labels, conditional generative models often inherit spurious correlations from the training dataset. This can result in label-conditional distributions that are imbalanced with respect to another latent attribute. To mitigate this issue, which we call spurious causality of conditional generation, we propose a general two-step strategy. (a) Fairness I… ▽ More

    Submitted 4 July, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

    Comments: TMLR 2023