-
3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects
Authors:
Weiming Zhi,
Haozhan Tang,
Tianyi Zhang,
Matthew Johnson-Roberson
Abstract:
Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our…
▽ More
Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation
Authors:
Zeyu Zhang,
Akide Liu,
Qi Chen,
Feng Chen,
Ian Reid,
Richard Hartley,
Bohan Zhuang,
Hao Tang
Abstract:
Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter…
▽ More
Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling. To solve these challenges, we propose InfiniMotion, a method that generates continuous motion sequences of arbitrary length within an autoregressive framework. We highlight its groundbreaking capability by generating a continuous 1-hour human motion with around 80,000 frames. Specifically, we introduce the Motion Memory Transformer with Bidirectional Mamba Memory, enhancing the transformer's memory to process long motion sequences effectively without overwhelming computational resources. Notably our method achieves over 30% improvement in FID and 6 times longer demonstration compared to previous state-of-the-art methods, showcasing significant advancements in long motion generation. See project webpage: https://steve-zeyu-zhang.github.io/InfiniMotion/
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance
Authors:
Xiaoxu Xu,
Yitian Yuan,
Jinlong Li,
Qiudan Zhang,
Zequn Jie,
Lin Ma,
Hao Tang,
Nicu Sebe,
Xu Wang
Abstract:
In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-langu…
▽ More
In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes the Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce the Embeddings Specialization Stage to purify the feature representation with the help of a given scene-level label, specifying a better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able not only to achieve the state-of-the-art performance on both S3DIS and ScanNet datasets, but also to maintain strong generalization capability.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
3x2: 3D Object Part Segmentation by 2D Semantic Correspondences
Authors:
Anh Thai,
Weiyao Wang,
Hao Tang,
Stefan Stojanov,
Matt Feiszli,
James M. Rehg
Abstract:
3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object par…
▽ More
3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. Project website: \url{https://ngailapdi.github.io/projects/3by2/}.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Symmetric Second-Harmonic Generation in Sub-wavelength Periodically Poled Thin Film Lithium Niobate
Authors:
Fengyan Yang,
Juanjuan Lu,
Mohan Shen,
Guangcanlan Yang,
Hong X. Tang
Abstract:
Second harmonic generation (SHG) extensively employs periodically poled nonlinear crystals through forward quasi-phase-matching to achieve efficient frequency conversion. As poling periods approach sub-micrometers, backward quasi-phase-matching has also been demonstrated, albeit by utilizing pulsed laser drives. The realization of symmetric second harmonic generation, characterized by counterpropa…
▽ More
Second harmonic generation (SHG) extensively employs periodically poled nonlinear crystals through forward quasi-phase-matching to achieve efficient frequency conversion. As poling periods approach sub-micrometers, backward quasi-phase-matching has also been demonstrated, albeit by utilizing pulsed laser drives. The realization of symmetric second harmonic generation, characterized by counterpropagating pumps, however, has remained elusive despite theoretical predictions. The main challenge lies in achieving strong nonlinear coupling with poling period below half the wavelength of the second-harmonic light. The recent emergence of high-quality ferroelectric lithium niobate thin films provides an opportunity for achieving precise domain control at submicron dimensions. In this article, we demonstrate reliable control of ferroelectric domains in thin film lithium niobate waveguide with a poling period down to 370nm, thereby realizing highly efficient continuous-wave pumped symmetric SHG. This demonstration not only validates the feasibility of achieving subwavelength periodic poling on waveguides but also opens new avenues for leveraging submicron ferroelectric domain structures in integrated photonics and nonlinear optics research.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Detect, Investigate, Judge and Determine: A Novel LLM-based Framework for Few-shot Fake News Detection
Authors:
Ye Liu,
Jiajun Zhu,
Kai Zhang,
Haoyu Tang,
Yanghai Zhang,
Xukai Liu,
Qi Liu,
Enhong Chen
Abstract:
Few-Shot Fake News Detection (FS-FND) aims to distinguish inaccurate news from real ones in extremely low-resource scenarios. This task has garnered increased attention due to the widespread dissemination and harmful impact of fake news on social media. Large Language Models (LLMs) have demonstrated competitive performance with the help of their rich prior knowledge and excellent in-context learni…
▽ More
Few-Shot Fake News Detection (FS-FND) aims to distinguish inaccurate news from real ones in extremely low-resource scenarios. This task has garnered increased attention due to the widespread dissemination and harmful impact of fake news on social media. Large Language Models (LLMs) have demonstrated competitive performance with the help of their rich prior knowledge and excellent in-context learning abilities. However, existing methods face significant limitations, such as the Understanding Ambiguity and Information Scarcity, which significantly undermine the potential of LLMs. To address these shortcomings, we propose a Dual-perspective Augmented Fake News Detection (DAFND) model, designed to enhance LLMs from both inside and outside perspectives. Specifically, DAFND first identifies the keywords of each news article through a Detection Module. Subsequently, DAFND creatively designs an Investigation Module to retrieve inside and outside valuable information concerning to the current news, followed by another Judge Module to derive its respective two prediction results. Finally, a Determination Module further integrates these two predictions and derives the final result. Extensive experiments on two publicly available datasets show the efficacy of our proposed method, particularly in low-resource settings.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Giant graviton expansion from eigenvalue instantons
Authors:
Yiming Chen,
Raghu Mahajan,
Haifeng Tang
Abstract:
Recently, S. Murthy has proposed a convergent expansion of free partition functions and superconformal indices of finite-$N$ purely adjoint gauge theories based on a Fredholm determinant expansion. This expansion has been dubbed the giant graviton expansion and takes the form of an infinite series of corrections to the $N=\infty$ result, with the $m^\text{th}$ correction being of order $e^{-mN}$.…
▽ More
Recently, S. Murthy has proposed a convergent expansion of free partition functions and superconformal indices of finite-$N$ purely adjoint gauge theories based on a Fredholm determinant expansion. This expansion has been dubbed the giant graviton expansion and takes the form of an infinite series of corrections to the $N=\infty$ result, with the $m^\text{th}$ correction being of order $e^{-mN}$. We show that this expansion can be reproduced using eigenvalue instantons in unitary matrix integrals. This perspective allows us to get the giant graviton expansion proposed by S. Murthy without the intermediate step of the Hubbard Stratonovich transformation.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
MSC-LIO: An MSCKF-Based LiDAR-Inertial Odometry with Same-Plane-Point Tracking
Authors:
Tisheng Zhang,
Man Yuan,
Linfu Wei,
Hailiang Tang,
Xiaoji Niu
Abstract:
The multi-state constraint Kalman filter (MSCKF) has been proven to be more efficient than graph optimization for visual-based odometry while with similar accuracy. However, it has not yet been properly considered and studied for LiDAR-based odometry. In this paper, we propose a novel tightly coupled LiDAR-inertial odometry based on the MSCKF framework, named MSC-LIO. An efficient LiDAR same-plane…
▽ More
The multi-state constraint Kalman filter (MSCKF) has been proven to be more efficient than graph optimization for visual-based odometry while with similar accuracy. However, it has not yet been properly considered and studied for LiDAR-based odometry. In this paper, we propose a novel tightly coupled LiDAR-inertial odometry based on the MSCKF framework, named MSC-LIO. An efficient LiDAR same-plane-point (LSPP) tracking method, without explicit feature extraction, is present for frame-to-frame data associations. The tracked LSPPs are employed to build an LSPP measurement model, which constructs a multi-state constraint. Besides, we propose an effective point-velocity-based LiDAR-IMU time-delay (LITD) estimation method, which is derived from the proposed LSPP tracking method. Extensive experiments were conducted on both public and private datasets. The results demonstrate that the proposed MSC-LIO yields higher accuracy and efficiency than the state-of-the-art methods. The ablation experiment results indicate that the data-association efficiency is improved by nearly 3 times using the LSPP tracking method. Besides, the proposed LITD estimation method can effectively and accurately estimate the LITD.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection
Authors:
Shuang Hao,
Chunlin Zhong,
He Tang
Abstract:
The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core componen…
▽ More
The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions. We will release source code upon acceptance.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
High-order accurate entropy stable schemes for compressible Euler equations with van der Waals equation of state on adaptive moving meshes
Authors:
Shangting Li,
Huazhong Tang
Abstract:
This paper develops the high-order entropy stable (ES) finite difference schemes for multi-dimensional compressible Euler equations with the van der Waals equation of state (EOS) on adaptive moving meshes. Semi-discrete schemes are first nontrivially constructed built on the newly derived high-order entropy conservative (EC) fluxes in curvilinear coordinates and scaled eigenvector matrices as well…
▽ More
This paper develops the high-order entropy stable (ES) finite difference schemes for multi-dimensional compressible Euler equations with the van der Waals equation of state (EOS) on adaptive moving meshes. Semi-discrete schemes are first nontrivially constructed built on the newly derived high-order entropy conservative (EC) fluxes in curvilinear coordinates and scaled eigenvector matrices as well as the multi-resolution WENO reconstruction, and then the fully-discrete schemes are given by using the high-order explicit strong-stability-preserving Runge-Kutta time discretizations.The high-order EC fluxes in curvilinear coordinates are derived by using the discrete geometric conservation laws and the linear combination of the two-point symmetric EC fluxes, while the two-point EC fluxes are delicately selected by using their sufficient condition, the thermodynamic entropy and the technically selected parameter vector.The adaptive moving meshes are iteratively generated by solving the mesh redistribution equations, in which the fundamental derivative related to the occurrence of non-classical waves is involved to produce high-quality mesh. Several numerical tests on the parallel computer system with the MPI programming are conducted to validate the accuracy, the ability to capture the classical and non-classical waves, and the high efficiency of our schemes in comparison with their counterparts on the uniform mesh.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
GMC: A General Framework of Multi-stage Context Learning and Utilization for Visual Detection Tasks
Authors:
Xuan Wang,
Hao Tang,
Zhigang Zhu
Abstract:
Various contextual information has been employed by many approaches for visual detection tasks. However, most of the existing approaches only focus on specific context for specific tasks. In this paper, GMC, a general framework is proposed for multistage context learning and utilization, with various deep network architectures for various visual detection tasks. The GMC framework encompasses three…
▽ More
Various contextual information has been employed by many approaches for visual detection tasks. However, most of the existing approaches only focus on specific context for specific tasks. In this paper, GMC, a general framework is proposed for multistage context learning and utilization, with various deep network architectures for various visual detection tasks. The GMC framework encompasses three stages: preprocessing, training, and post-processing. In the preprocessing stage, the representation of local context is enhanced by utilizing commonly used labeling standards. During the training stage, semantic context information is fused with visual information, leveraging prior knowledge from the training dataset to capture semantic relationships. In the post-processing stage, general topological relations and semantic masks for stuff are incorporated to enable spatial context reasoning between objects. The proposed framework provides a comprehensive and adaptable solution for context learning and utilization in visual detection scenarios. The framework offers flexibility with user-defined configurations and provide adaptability to diverse network architectures and visual detection tasks, offering an automated and streamlined solution that minimizes user effort and inference time in context learning and reasoning. Experimental results on the visual detection tasks, for storefront object detection, pedestrian detection and COCO object detection, demonstrate that our framework outperforms previous state-of-the-art detectors and transformer architectures. The experiments also demonstrate that three contextual learning components can not only be applied individually and in combination, but can also be applied to various network architectures, and its flexibility and effectiveness in various detection scenarios.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Non-contact excitation of multi-GHz lithium niobate electromechanical resonators
Authors:
Danqing Wang,
Jiacheng Xie,
Yu Guo,
Mohan Shen,
Hong X. Tang
Abstract:
The demand for high-performance electromechanical resonators is ever-growing across diverse applications, ranging from sensing and time-keeping to advanced communication devices. Among the electromechanical materials being explored, thin-film lithium niobate stands out for its strong piezoelectric properties and low acoustic loss. However, in nearly all existing lithium niobate electromechanical d…
▽ More
The demand for high-performance electromechanical resonators is ever-growing across diverse applications, ranging from sensing and time-keeping to advanced communication devices. Among the electromechanical materials being explored, thin-film lithium niobate stands out for its strong piezoelectric properties and low acoustic loss. However, in nearly all existing lithium niobate electromechanical devices, the configuration is such that the electrodes are in direct contact with the mechanical resonator. This configuration introduces an undesirable mass-loading effect, giving rise to spurious modes and additional damping. Here, we present an electromechanical platform that mitigates this challenge by leveraging a flip-chip bonding technique to separate the electrodes from the mechanical resonator. By offloading the electrodes from the resonator, our approach yields a substantial increase in the quality factor of these resonators, paving the way for enhanced performance and reliability for their device applications.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning
Authors:
Min Zhang,
Jianye Hao,
Xian Fu,
Peilong Han,
Hao Zhang,
Lei Shi,
Hongyao Tang,
Yan Zheng
Abstract:
In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed lig…
▽ More
In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed light on their capabilities and limitations in this domain. To this end, based on the characteristics of embodied task planning, we first develop a systematic evaluation framework, which encapsulates four crucial capabilities of MFMs: object understanding, spatio-temporal perception, task understanding, and embodied reasoning. Following this, we propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios, typical yet diverse task types, task instances of varying difficulties, and rich test case types ranging from multiple embodied question answering to embodied task reasoning. Finally, we offer a simple and easy-to-use automatic evaluation platform that enables the automated testing of multiple MFMs on the proposed benchmark. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance. The MFE-ETP is a high-quality, large-scale, and challenging benchmark relevant to real-world tasks.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Exploring control of the emergent exciton insulator state in 1T-TiSe$_2$ monolayer by state-of-the-art theory models
Authors:
Hong Tang,
Li Yin,
Gábor I. Csonka,
Adrienn Ruzsinszky
Abstract:
The layered transition metal dichalcogenide 1T-TiSe$_2$ is of great research interest, having intriguing properties of charge density waves (CDW) and superconductivity under doping or pressurizing. The monolayer form of 1T-TiSe$_2$ also shows a CDW with a higher transition temperature T_c than the bulk, indicating a stronger CDW interaction. By using the meta-generalized gradient approximation (me…
▽ More
The layered transition metal dichalcogenide 1T-TiSe$_2$ is of great research interest, having intriguing properties of charge density waves (CDW) and superconductivity under doping or pressurizing. The monolayer form of 1T-TiSe$_2$ also shows a CDW with a higher transition temperature T_c than the bulk, indicating a stronger CDW interaction. By using the meta-generalized gradient approximation (metaGGA)-based model Bethe-Salpeter Equation (BSE) and many-body perturbation GW+BSE methods, we calculate the exciton binding energies and electron energy loss spectrum (EELS) for the 1T-TiSe$_2$ monolayer under different in-plane biaxial strains. We find that even without strain the 1T-TiSe$_2$ monolayer can have negative exciton energies at the Brillouin zone boundary point M, with a binding energy larger than the gap. The calculated EELS reinforces this picture, indicating EI (exciton insulator) states in 1T-TiSe$_2$ monolayer even without strain. The Wannier-Mott formula calculations of exciton binding energy corroborate results from GW+BSE. Small compressive strains enhance the EI state, and for tensile strains slightly less than 3%, the EI state in this monolayer persists. At large tensile strains, the material makes a transition to a normal semiconductor. Our results provide important information for understanding the quantum nature of this two-dimensional (2D) material. Our results from the standard G0W0@PBE+SOC+U+BSE approach are not qualitatively different from those of a more computationally efficient metaGGA-based SCAN+SOC+U+mBSE+$f_{xc}^{loc}$ approach that employs a model BSE.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
A second-order direct Eulerian GRP scheme for ten-moment Gaussian closure equations with source terms
Authors:
Jiangfu Wang,
Huazhong Tang
Abstract:
This paper proposes a second-order accurate direct Eulerian generalized Riemann problem (GRP) scheme for the ten-moment Gaussian closure equations with source terms. The generalized Riemann invariants associated with the rarefaction waves, the contact discontinuity and the shear waves are given, and the 1D exact Riemann solver is obtained. After that, the generalized Riemann invariants and the Ran…
▽ More
This paper proposes a second-order accurate direct Eulerian generalized Riemann problem (GRP) scheme for the ten-moment Gaussian closure equations with source terms. The generalized Riemann invariants associated with the rarefaction waves, the contact discontinuity and the shear waves are given, and the 1D exact Riemann solver is obtained. After that, the generalized Riemann invariants and the Rankine-Hugoniot jump conditions are directly used to resolve the left and right nonlinear waves (rarefaction wave and shock wave) of the local GRP in Eulerian formulation, and then the 1D direct Eulerian GRP scheme is derived. They are much more complicated, technical and nontrivial due to more physical variables and elementary waves. Some 1D and 2D numerical experiments are presented to check the accuracy and high resolution of the proposed GRP schemes, where the 2D direct Eulerian GRP scheme is given by using the Strang splitting method for simplicity. It should be emphasized that several examples of 2D Riemann problems are constructed for the first time.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)
Authors:
Lam Pham,
Phat Lam,
Tin Nguyen,
Hieu Tang,
Alexander Schindler
Abstract:
In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By co…
▽ More
In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.
△ Less
Submitted 2 May, 2024;
originally announced July 2024.
-
The USTC-NERCSLIP Systems for The ICMC-ASR Challenge
Authors:
Minghui Wu,
Luzhen Xu,
Jie Zhang,
Haitao Tang,
Yanyan Yue,
Ruizhi Liao,
Jintao Zhao,
Zhengzhe Zhang,
Yichi Wang,
Haoyin Yan,
Hongliang Yu,
Tongle Ma,
Jiachen Liu,
Chongliang Wu,
Yongchao Li,
Yanyong Zhang,
Xin Fang,
Yue Zhang
Abstract:
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,…
▽ More
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position, respectively. For ASR, we employ an iterative pseudo-label generation method based on fusion model to obtain text labels of unsupervised data. To mitigate the impact of accent, an Accent-ASR framework is proposed, which captures pronunciation-related accent features at a fine-grained level and linguistic information at a coarse-grained level. On the ICMC-ASR eval set, the proposed system achieves a CER of 13.16% on track 1 and a cpCER of 21.48% on track 2, which significantly outperforms the official baseline system and obtains the first rank on both tracks.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions
Authors:
Xiang Li,
Haoran Tang,
Siyu Chen,
Ziwei Wang,
Ryan Chen,
Marcin Abram
Abstract:
We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. T…
▽ More
We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. This effect is especially visible for open questions and questions of high difficulty or novelty. This result reveals a fundamental difference between the treatment of close-form and open-form questions by large-language models and shows a need for a more robust evaluation of in-context learning on the variety of different types of questions. It also poses a new question of how to optimally select a context for large language models, especially in the context of Retrieval Augmented Generation (RAG) systems. Our results suggest that the answer to this question can be highly application-dependent and might be contingent on factors including the format of the question, the perceived difficulty level of the questions, and the novelty or popularity of the information we seek.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders
Authors:
Phat Lam,
Lam Pham,
Tin Nguyen,
Hieu Tang,
Thinh Pham,
Loi Khanh Nguyen,
Alexander Schindler
Abstract:
Existing speaker diarization systems heavily rely on large amounts of manually annotated data, which is labor-intensive and challenging to collect in real-world scenarios. Additionally, the language-specific constraint in speaker diarization systems significantly hinders their applicability and scalability in multilingual settings. In this paper, we therefore propose a cluster-based speaker diariz…
▽ More
Existing speaker diarization systems heavily rely on large amounts of manually annotated data, which is labor-intensive and challenging to collect in real-world scenarios. Additionally, the language-specific constraint in speaker diarization systems significantly hinders their applicability and scalability in multilingual settings. In this paper, we therefore propose a cluster-based speaker diarization system for multilingual telephone call applications. The proposed system supports multiple languages and does not require large-scale annotated data for the training process as leveraging the multilingual Whisper model to extract speaker embeddings and proposing a novel Mixture of Sparse Autoencoders (Mix-SAE) network architecture for unsupervised speaker clustering. Experimental results on the evaluating dataset derived from two-speaker subsets of CALLHOME and CALLFRIEND telephonic speech corpora demonstrate superior efficiency of the proposed Mix-SAE network to other autoencoder-based clustering methods. The overall performance of our proposed system also indicates the promising potential of our approach in developing unsupervised multilingual speaker diarization applications within the context of limited annotated data and enhancing the integration ability into comprehensive multi-task speech analysis systems (i.e. multiple tasks of speech-to-text, language detection, speaker diarization integrated in a low-complexity system).
△ Less
Submitted 7 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory
Authors:
Wenliang Zhong,
Haoyu Tang,
Qinghai Zheng,
Mingzhu Xu,
Yupeng Hu,
Liqiang Nie
Abstract:
The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data wit…
▽ More
The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT leverages insights from the linearized dynamics of Neural Tangent Kernel methods to create a convex combination of expert trajectories, guiding the student network to converge rapidly and stably. This trajectory is not only easier to store, but also enables a continuous sampling strategy during distillation, ensuring thorough learning and fitting of the entire expert trajectory. Comprehensive experiments across three public datasets validate the superiority of MCT over traditional MTT methods.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Transient spin modes from relaxational axial kinetic theory
Authors:
Shu Lin,
Haiqin Tang
Abstract:
We study the dynamics of spin mode by solving the axial kinetic equations under the relaxation time approximation in the presence of dissipative sources. We find transient spin modes in response to electric field with spacetime inhomogeneity, fluid acceleration and shear. To the lowest order in spatial gradient $k$, we find the responses to electric field and acceleration can be interpreted as ret…
▽ More
We study the dynamics of spin mode by solving the axial kinetic equations under the relaxation time approximation in the presence of dissipative sources. We find transient spin modes in response to electric field with spacetime inhomogeneity, fluid acceleration and shear. To the lowest order in spatial gradient $k$, we find the responses to electric field and acceleration can be interpreted as retarded response to time variations of magnetic field and vorticity respectively. The response to shear can lead to a global spin polarization suppressed by powers of $k$. Beyond lowest order, the responses to all three sources are non-local with branch cut in the dispersions. We argue that the non-locality is a consequence of the quasi-particle picture underlying the kinetic description. We also analyze the mixing between spin modes and shear modes alone using the response we have obtained, finding the spin modes split into three with two of them developing oscillatory behavior. The correction to damping dispersions occur at $O(k^{4/3})$, which is parametrically larger than the existing one due to mixing of spin modes with shear and vorticity modes. It also indicates possible breakdown of gradient expansion.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Authors:
Yuang Peng,
Yuxin Cui,
Haomiao Tang,
Zekun Qi,
Runpei Dong,
Jing Bai,
Chunrui Han,
Zheng Ge,
Xiangyu Zhang,
Shu-Tao Xia
Abstract:
Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advan…
▽ More
Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
On estimation and order selection for multivariate extremes via clustering
Authors:
Shiyuan Deng,
He Tang,
Shuyang Bai
Abstract:
We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we intr…
▽ More
We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the well-known simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a max-linear factor model, where a typical information-based approach is not viable. Our second contribution is a large-deviation-type analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clustering-based estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavy-tailed factor models. We also present simulations and real-data studies that demonstrate order selection and factor model estimation.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Toward Structure Fairness in Dynamic Graph Embedding: A Trend-aware Dual Debiasing Approach
Authors:
Yicong Li,
Yu Yang,
Jiannong Cao,
Shuaiqi Liu,
Haoran Tang,
Guandong Xu
Abstract:
Recent studies successfully learned static graph embeddings that are structurally fair by preventing the effectiveness disparity of high- and low-degree vertex groups in downstream graph mining tasks. However, achieving structure fairness in dynamic graph embedding remains an open problem. Neglecting degree changes in dynamic graphs will significantly impair embedding effectiveness without notably…
▽ More
Recent studies successfully learned static graph embeddings that are structurally fair by preventing the effectiveness disparity of high- and low-degree vertex groups in downstream graph mining tasks. However, achieving structure fairness in dynamic graph embedding remains an open problem. Neglecting degree changes in dynamic graphs will significantly impair embedding effectiveness without notably improving structure fairness. This is because the embedding performance of high-degree and low-to-high-degree vertices will significantly drop close to the generally poorer embedding performance of most slightly changed vertices in the long-tail part of the power-law distribution. We first identify biased structural evolutions in a dynamic graph based on the evolving trend of vertex degree and then propose FairDGE, the first structurally Fair Dynamic Graph Embedding algorithm. FairDGE learns biased structural evolutions by jointly embedding the connection changes among vertices and the long-short-term evolutionary trend of vertex degrees. Furthermore, a novel dual debiasing approach is devised to encode fair embeddings contrastively, customizing debiasing strategies for different biased structural evolutions. This innovative debiasing strategy breaks the effectiveness bottleneck of embeddings without notable fairness loss. Extensive experiments demonstrate that FairDGE achieves simultaneous improvement in the effectiveness and fairness of embeddings.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Knowledge Fusion By Evolving Weights of Language Models
Authors:
Guodong Du,
Jing Li,
Hanting Liu,
Runhua Jiang,
Shuyang Yu,
Yifei Guo,
Sim Kuan Goh,
Ho-Kin Tang
Abstract:
Fine-tuning pre-trained language models, particularly large language models, demands extensive computing resources and can result in varying performance outcomes across different domains and datasets. This paper examines the approach of integrating multiple models from diverse training scenarios into a unified model. This unified model excels across various data domains and exhibits the ability to…
▽ More
Fine-tuning pre-trained language models, particularly large language models, demands extensive computing resources and can result in varying performance outcomes across different domains and datasets. This paper examines the approach of integrating multiple models from diverse training scenarios into a unified model. This unified model excels across various data domains and exhibits the ability to generalize well on out-of-domain data. We propose a knowledge fusion method named Evolver, inspired by evolutionary algorithms, which does not need further training or additional training data. Specifically, our method involves aggregating the weights of different language models into a population and subsequently generating offspring models through mutation and crossover operations. These offspring models are then evaluated against their parents, allowing for the preservation of those models that show enhanced performance on development datasets. Importantly, our model evolving strategy can be seamlessly integrated with existing model merging frameworks, offering a versatile tool for model enhancement. Experimental results on mainstream language models (i.e., encoder-only, decoder-only, encoder-decoder) reveal that Evolver outperforms previous state-of-the-art models by large margins. The code is publicly available at {https://github.com/duguodong7/model-evolution}.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
SFedCA: Credit Assignment-Based Active Client Selection Strategy for Spiking Federated Learning
Authors:
Qiugang Zhan,
Jinbo Cao,
Xiurui Xie,
Malu Zhang,
Huajin Tang,
Guisong Liu
Abstract:
Spiking federated learning is an emerging distributed learning paradigm that allows resource-constrained devices to train collaboratively at low power consumption without exchanging local data. It takes advantage of both the privacy computation property in federated learning (FL) and the energy efficiency in spiking neural networks (SNN). Thus, it is highly promising to revolutionize the efficient…
▽ More
Spiking federated learning is an emerging distributed learning paradigm that allows resource-constrained devices to train collaboratively at low power consumption without exchanging local data. It takes advantage of both the privacy computation property in federated learning (FL) and the energy efficiency in spiking neural networks (SNN). Thus, it is highly promising to revolutionize the efficient processing of multimedia data. However, existing spiking federated learning methods employ a random selection approach for client aggregation, assuming unbiased client participation. This neglect of statistical heterogeneity affects the convergence and accuracy of the global model significantly. In our work, we propose a credit assignment-based active client selection strategy, the SFedCA, to judiciously aggregate clients that contribute to the global sample distribution balance. Specifically, the client credits are assigned by the firing intensity state before and after local model training, which reflects the local data distribution difference from the global model. Comprehensive experiments are conducted on various non-identical and independent distribution (non-IID) scenarios. The experimental results demonstrate that the SFedCA outperforms the existing state-of-the-art spiking federated learning methods, and requires fewer communication rounds.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Inside the Working Mechanism of Meta-generalized Gradient Density Functional Approximations: The Example of Quantum Spin-Hall Insulator 1T`-WTe2
Authors:
Li Yin,
Hong Tang,
Adrienn Ruzsinszky
Abstract:
Quantum spin Hall (QSH) insulators have attracted intensive experimental and theoretical studies due to their beneficial applications in spintronic devices. Density functional theory (DFT) meets challenges when describing the electronic structure of QSH materials. Only the Heyd-Scuseria-Ernzerhof (HSE06) with spin-orbit coupling (SOC) is effective in revealing the band opening in the typical QSH 1…
▽ More
Quantum spin Hall (QSH) insulators have attracted intensive experimental and theoretical studies due to their beneficial applications in spintronic devices. Density functional theory (DFT) meets challenges when describing the electronic structure of QSH materials. Only the Heyd-Scuseria-Ernzerhof (HSE06) with spin-orbit coupling (SOC) is effective in revealing the band opening in the typical QSH 1T`-WTe2, but with increased computational demands. Here, using DFT, Wannier function simulations, the screened hybrid HSE06 functional, and first-principles-based many body perturbation theory GW, we investigate the sensitive electronic structure in monolayer 1T`-WTe2, with advanced meta-generalized gradient (meta-GGA) density functional approximations. The success of the recent SCAN and r2SCAN meta-GGAs left their predecessor meta-GGA made very simple (MVS) ignored by the scientific community. Largely unnoticed were the increased band gaps of MVS compared to any semilocal approximation including SCAN. We find that the non-empirical MVS approximation yields a positive fundamental band gap, without any help from exact exchange, Hubbard U, or SOC correction. We explain the success of the meta-GGA MVS for the band gap in 1T`-WTe2 by presenting two working mechanisms in meta-GGA approximations. Besides, we point out the difficulty of using G0W0 for 1T`-WTe2. Although the single shot GW correction with an MVS reference yields a smaller band gap than GW with PBE, the G0W0@MVS is still not suitable for simulating 1T`-WTe2, due to its negative band gap. These DFT and beyond DFT results highlight the importance of meta-GGAs and novel construction schemes with enhanced kinetic energy density dependence. The MVS approximation re-appears as an appealing alternative for accurately describing 1T`-WTe2, paving an efficient way for exploring other two-dimensional QSH materials in high-throughput calculations.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Brownian Gaussian Unitary Ensemble: non-equilibrium dynamics, efficient $k$-design and application in classical shadow tomography
Authors:
Haifeng Tang
Abstract:
We construct and extensively study a Brownian generalization of the Gaussian Unitary Ensemble (BGUE). Our analysis begins with the non-equilibrium dynamics of BGUE, where we derive explicit analytical expressions for various one-replica and two-replica variables at finite $N$ and $t$. These variables include the spectral form factor and its fluctuation, the two-point function and its fluctuation,…
▽ More
We construct and extensively study a Brownian generalization of the Gaussian Unitary Ensemble (BGUE). Our analysis begins with the non-equilibrium dynamics of BGUE, where we derive explicit analytical expressions for various one-replica and two-replica variables at finite $N$ and $t$. These variables include the spectral form factor and its fluctuation, the two-point function and its fluctuation, out-of-time-order correlators (OTOC), the second Rényi entropy, and the frame potential for unitary 2-designs. We discuss the implications of these results for hyperfast scrambling, emergence of tomperature, and replica-wormhole-like contributions in BGUE. Next, we investigate the low-energy physics of the effective Hamiltonian for an arbitrarily number of replicas, deriving long-time results for the frame potential. We conclude that the time required for the BGUE ensemble to reach $k$-design is linear in $k$, consistent with previous findings in Brownian SYK models. Finally, we apply the BGUE model to the task of classical shadow tomography, deriving analytical results for the shadow norm and identifying an optimal time that minimizes the shadow norm, analogous to the optimal circuit depth in shallow-circuit shadow tomography.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Simulation of chiral motion of excitation within the ground-state manifolds of neutral atoms
Authors:
Hao-Yuan Tang,
Xiao-Xuan Li,
Jia-Bin You,
Xiao-Qiang Shao
Abstract:
Laser-induced gauge fields in neutral atoms serve as a means of mimicking the effects of a magnetic field, providing researchers with a platform to explore behaviors analogous to those observed in condensed matter systems under real magnetic fields. Here, we propose a method to generate chiral motion in atomic excitations within the neutral atomic ground-state manifolds. This is achieved through t…
▽ More
Laser-induced gauge fields in neutral atoms serve as a means of mimicking the effects of a magnetic field, providing researchers with a platform to explore behaviors analogous to those observed in condensed matter systems under real magnetic fields. Here, we propose a method to generate chiral motion in atomic excitations within the neutral atomic ground-state manifolds. This is achieved through the application of polychromatic driving fields coupled to the ground-Rydberg transition, along with unconventional Rydberg pumping. The scheme offers the advantage of arbitrary adjustment of the effective magnetic flux by setting the relative phases between different external laser fields. Additionally, the effective interaction strength between the atomic ground states can be maintained at 10 kHz, surpassing the capabilities of the previous approach utilizing Floquet modulation. Notably, the proposed method can be readily extended to implement a hexagonal neutral atom lattice, serving as the fundamental unit in realizing the Haldane model.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations
Authors:
Mukhtar Mohamed,
Oli Danyi Liu,
Hao Tang,
Sharon Goldwater
Abstract:
Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone ce…
▽ More
Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
dx2-y2-wave Bose Metal induced by the next-nearest-neighbor hopping t'
Authors:
Zhangkai Cao,
Jianyu Li,
Jiahao Su,
Tao Ying,
Ho-Kin Tang
Abstract:
Superconductivity arises when electrons form Cooper pairs with phase coherence. In contrast, a lack of phase coherence in Cooper pairs can lead to an uncondensed metallic ground state known as the Bose metal state. In this study, we investigate an attractively interacting fermionic system with nearest-neighbor (NN) hopping (t) and next-nearest-neighbor (NNN) hopping (t') anisotropy between two spe…
▽ More
Superconductivity arises when electrons form Cooper pairs with phase coherence. In contrast, a lack of phase coherence in Cooper pairs can lead to an uncondensed metallic ground state known as the Bose metal state. In this study, we investigate an attractively interacting fermionic system with nearest-neighbor (NN) hopping (t) and next-nearest-neighbor (NNN) hopping (t') anisotropy between two species of spins in a two-dimensional (2D) lattice. Utilizing the constrained path quantum Monte Carlo (CPQMC) method, we demonstrate the existence of a dx2-y2-wave Cooper pair Bose metal (CPBM) phase with t'/t > 0.7. The CPBM phase exhibits a dome-like structure in the phase diagram of filling n~0.65, with the maximal region around an optimal t'/t ~ 0.2, suggesting that an appropriate value of t' facilitates the formation of the Bose metal. Furthermore, we find that a Bose metal formed by fermions with a closed Fermi surface confirms that the crucial condition for this exotic phenomenon is primarily the anisotropy of the Fermi surface, rather than its topology. Our finding of the dx2-y2-wave CPBM demonstrates the same pairing symmetry as the pseudogap behavior in cuprates, and its experimental realization in ultracold atom systems is also feasible.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models
Authors:
Xiaofeng Zhang,
Chen Shen,
Xiaosong Yuan,
Shaotian Yan,
Liang Xie,
Wenxiao Wang,
Chaochen Gu,
Hao Tang,
Jieping Ye
Abstract:
Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language…
▽ More
Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language models, especially regarding more complex reasoning tasks. To explore the interaction process between image and text in complex reasoning tasks, we introduce the information flow method to visualize the interaction mechanism. By analyzing the dynamic flow of the information flow, we find that the information flow appears to converge in the shallow layer. Further investigation revealed a redundancy of the image token in the shallow layer. Consequently, a truncation strategy was introduced to aggregate image tokens within these shallow layers. This approach has been validated through experiments across multiple models, yielding consistent improvements.
△ Less
Submitted 13 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Effect of Strain on the Band Gap of Monolayer MoS$_2$
Authors:
Raj K. Sah,
Hong Tang,
Chandra Shahi,
Adrienn Ruzsinszky,
John P. Perdew
Abstract:
Monolayer $\mathrm{MoS_2}$ under strain has many interesting properties and possible applications in technology. A recent experimental study examined the effect of strain on the bandgap of monolayer $\mathrm{MoS_2}$ on a mildly curved graphite surface, reporting that under biaxial strain with a Poisson's ratio of 0.44, the bandgap decreases at a rate of 400 meV/% strain. In this work, we performed…
▽ More
Monolayer $\mathrm{MoS_2}$ under strain has many interesting properties and possible applications in technology. A recent experimental study examined the effect of strain on the bandgap of monolayer $\mathrm{MoS_2}$ on a mildly curved graphite surface, reporting that under biaxial strain with a Poisson's ratio of 0.44, the bandgap decreases at a rate of 400 meV/% strain. In this work, we performed density functional theory (DFT) calculations for a free-standing $\mathrm{MoS_2}$ monolayer, using the generalized gradient approximation (GGA) PBE, the hybrid functional HSE06, and many-body perturbation theory with the GW approximation using PBE wavefunctions (G0W0@PBE). We found that under biaxial strain with the experimental Poisson's ratio, the bandgap decreases at rates of 63 meV/% strain (PBE), 73 meV/% strain (HSE06), and 43 meV/% strain (G0W0@PBE), which are significantly smaller than the experimental rate. We also found that PBE predicts a similarly smaller rate (90 meV/% strain) for a different Poisson's ratio of 0.25. Spin-orbit correction (SOC) has little effect on the gap or its strain dependence. Additionally, we observed a semiconductor-to-metal transition under an equal tensile biaxial strain of 10% and a transition from a direct to an indirect bandgap, consistent with previous theoretical work.
△ Less
Submitted 11 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
Authors:
Tzu-Quan Lin,
Hung-yi Lee,
Hao Tang
Abstract:
Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the…
▽ More
Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the entire pretrained model. We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning. DAISY matches the performance of HuBERT on the MiniSUPERB benchmark, but with much faster inference times. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data, dynamically adjusting the computational cost of inference based on the noise level of each sample.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Efficient backward x-ray emission in a finite-length plasma irradiated by a laser pulse of ps duration
Authors:
I-Lin Yeh,
Kavin Tangtartharakul,
Hongmei Tang,
Louise Willingale,
Alexey Arefiev
Abstract:
Motivated by experiments employing ps-long, kilojoule laser pulses, we examined x-ray emission in a finite-length underdense plasma irradiated by such a pulse using two dimensional particle-in-cell simulations. We found that, in addition to the expected forward emission, the plasma also efficiently emits in the backward direction. Our simulations reveal that the backward emission occurs when the l…
▽ More
Motivated by experiments employing ps-long, kilojoule laser pulses, we examined x-ray emission in a finite-length underdense plasma irradiated by such a pulse using two dimensional particle-in-cell simulations. We found that, in addition to the expected forward emission, the plasma also efficiently emits in the backward direction. Our simulations reveal that the backward emission occurs when the laser exits the plasma. The longitudinal plasma electric field generated by the laser at the density down-ramp turns around some of the laser-accelerated electrons and re-accelerates them in the backward direction. As the electrons collide with the laser, they emit hard x-rays. The energy conversion efficiency is comparable to that for the forward emission, but the effective source size is smaller. We show that the ps laser duration is required for achieving a spatial overlap between the laser and the backward energetic electrons. At peak laser intensity of $1.4\times 10^{20}~\rm{W/cm^2}$, backward emitted photons (energies above 100~keV and $10^{\circ}$ divergence angle) account for $2 \times 10^{-5}$ of the incident laser energy. This conversion efficiency is three times higher than that for similarly selected forward emitted photons. The source size of the backward photons ($5~\rm{μm}$) is three times smaller than the source size of the forward photons.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
CADE: Cosine Annealing Differential Evolution for Spiking Neural Network
Authors:
Runhua Jiang,
Guodong Du,
Shuyang Yu,
Yifei Guo,
Sim Kuan Goh,
Ho-Kin Tang
Abstract:
Spiking neural networks (SNNs) have gained prominence for their potential in neuromorphic computing and energy-efficient artificial intelligence, yet optimizing them remains a formidable challenge for gradient-based methods due to their discrete, spike-based computation. This paper attempts to tackle the challenges by introducing Cosine Annealing Differential Evolution (CADE), designed to modulate…
▽ More
Spiking neural networks (SNNs) have gained prominence for their potential in neuromorphic computing and energy-efficient artificial intelligence, yet optimizing them remains a formidable challenge for gradient-based methods due to their discrete, spike-based computation. This paper attempts to tackle the challenges by introducing Cosine Annealing Differential Evolution (CADE), designed to modulate the mutation factor (F) and crossover rate (CR) of differential evolution (DE) for the SNN model, i.e., Spiking Element Wise (SEW) ResNet. Extensive empirical evaluations were conducted to analyze CADE. CADE showed a balance in exploring and exploiting the search space, resulting in accelerated convergence and improved accuracy compared to existing gradient-based and DE-based methods. Moreover, an initialization method based on a transfer learning setting was developed, pretraining on a source dataset (i.e., CIFAR-10) and fine-tuning the target dataset (i.e., CIFAR-100), to improve population diversity. It was found to further enhance CADE for SNN. Remarkably, CADE elevates the performance of the highest accuracy SEW model by an additional 0.52 percentage points, underscoring its effectiveness in fine-tuning and enhancing SNNs. These findings emphasize the pivotal role of a scheduler for F and CR adjustment, especially for DE-based SNN. Source Code on Github: https://github.com/Tank-Jiang/CADE4SNN.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Demonstration of superior communication through thermodynamically free channels in an optical quantum switch
Authors:
Hao Tang,
Yu Guo,
Xiao-Min Hu,
Yun-Feng Huang,
Bi-Heng Liu,
Chuan-Feng Li,
Guang-Can Guo
Abstract:
The release of causal structure of physical events from a well-defined order to an indefinite one stimulates remarkable enhancements in various quantum information tasks. Some of these advantages, however, are questioned for the ambiguous role of the control system in the quantum switch that is an experimentally realized process with indefinite causal structure. In communications, for example, not…
▽ More
The release of causal structure of physical events from a well-defined order to an indefinite one stimulates remarkable enhancements in various quantum information tasks. Some of these advantages, however, are questioned for the ambiguous role of the control system in the quantum switch that is an experimentally realized process with indefinite causal structure. In communications, for example, not only the superposition of alternative causal orders, but also the superposition of alternative trajectories can accelerate information transmissions. Here, we follow the proposal of Liu et al. [Phys. Rev. Lett. 129, 230604 (2022)], and examine the information enhancement effect of indefinite causal orders with the toolkit of thermodynamics in a photonic platform. Specifically, we simulate the thermal interaction between a system qubit and two heat baths embedded in a quantum switch by implementing the corresponding switched thermal channels. Although its action on the system qubit only is thermally free, our results suggest that the quantum switch should be seen as a resource when the control qubit is also considered. Moreover, we characterize the non-Markovian property in this scenario by measuring the information backflows from the heat baths to the system qubit.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Context Gating in Spiking Neural Networks: Achieving Lifelong Learning through Integration of Local and Global Plasticity
Authors:
Jiangrong Shen,
Wenyao Ni,
Qi Xu,
Gang Pan,
Huajin Tang
Abstract:
Humans learn multiple tasks in succession with minimal mutual interference, through the context gating mechanism in the prefrontal cortex (PFC). The brain-inspired models of spiking neural networks (SNN) have drawn massive attention for their energy efficiency and biological plausibility. To overcome catastrophic forgetting when learning multiple tasks in sequence, current SNN models for lifelong…
▽ More
Humans learn multiple tasks in succession with minimal mutual interference, through the context gating mechanism in the prefrontal cortex (PFC). The brain-inspired models of spiking neural networks (SNN) have drawn massive attention for their energy efficiency and biological plausibility. To overcome catastrophic forgetting when learning multiple tasks in sequence, current SNN models for lifelong learning focus on memory reserving or regularization-based modification, while lacking SNN to replicate human experimental behavior. Inspired by biological context-dependent gating mechanisms found in PFC, we propose SNN with context gating trained by the local plasticity rule (CG-SNN) for lifelong learning. The iterative training between global and local plasticity for task units is designed to strengthen the connections between task neurons and hidden neurons and preserve the multi-task relevant information. The experiments show that the proposed model is effective in maintaining the past learning experience and has better task-selectivity than other methods during lifelong learning. Our results provide new insights that the CG-SNN model can extend context gating with good scalability on different SNN architectures with different spike-firing mechanisms. Thus, our models have good potential for parallel implementation on neuromorphic hardware and model human's behavior.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation
Authors:
Jingchang Chen,
Hongxuan Tang,
Zheng Chu,
Qianglong Chen,
Zekun Wang,
Ming Liu,
Bing Qin
Abstract:
Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-imp…
▽ More
Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Authors:
Meng Cao,
Haoran Tang,
Jinfa Huang,
Peng Jin,
Can Zhang,
Ruyang Liu,
Long Chen,
Xiaodan Liang,
Li Yuan,
Ge Li
Abstract:
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient…
▽ More
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Expert-Guided Extinction of Toxic Tokens for Debiased Generation
Authors:
Xueyao Sun,
Kaize Shi,
Haoran Tang,
Guandong Xu,
Qing Li
Abstract:
Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correctin…
▽ More
Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Dataset Growth
Authors:
Ziheng Qin,
Zhaopan Xu,
Yukun Zhou,
Zangwei Zheng,
Zebang Cheng,
Hao Tang,
Lei Shang,
Baigui Sun,
Xiaojiang Peng,
Radu Timofte,
Hongxun Yao,
Kai Wang,
Yang You
Abstract:
Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. H…
▽ More
Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion
Authors:
Hongze Sun,
Rui Liu,
Wuque Cai,
Jun Wang,
Yue Wang,
Huajin Tang,
Yan Cui,
Dezhong Yao,
Daqing Guo
Abstract:
Visual object tracking, which is primarily based on visible light image sequences, encounters numerous challenges in complicated scenarios, such as low light conditions, high dynamic ranges, and background clutter. To address these challenges, incorporating the advantages of multiple visual modalities is a promising solution for achieving reliable object tracking. However, the existing approaches…
▽ More
Visual object tracking, which is primarily based on visible light image sequences, encounters numerous challenges in complicated scenarios, such as low light conditions, high dynamic ranges, and background clutter. To address these challenges, incorporating the advantages of multiple visual modalities is a promising solution for achieving reliable object tracking. However, the existing approaches usually integrate multimodal inputs through adaptive local feature interactions, which cannot leverage the full potential of visual cues, thus resulting in insufficient feature modeling. In this study, we propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities and then uses a unified encoder to align the features across different domains. Moreover, we propose an enhanced transformer-based module to fuse multimodal features using attention mechanisms. With these methods, the MMHT model can effectively construct a multiscale and multidimensional visual feature space and achieve discriminative feature modeling. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with that of other state-of-the-art methods. Overall, our results highlight the effectiveness of the MMHT model in terms of addressing the challenges faced in visual object tracking tasks.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Enhanced dissipation and temporal decay in the Euler-Poisson-Navier-Stokes equations
Authors:
Young-Pil Choi,
Houzhi Tang,
Weiyuan Zou
Abstract:
This paper investigates the global well-posedness and large-time behavior of solutions for a coupled fluid model in $\mathbb{R}^3$ consisting of the isothermal compressible Euler-Poisson system and incompressible Navier-Stokes equations coupled through the drag force. Notably, we exploit the dissipation effects inherent in the Poisson equation to achieve a faster decay of fluid density compared to…
▽ More
This paper investigates the global well-posedness and large-time behavior of solutions for a coupled fluid model in $\mathbb{R}^3$ consisting of the isothermal compressible Euler-Poisson system and incompressible Navier-Stokes equations coupled through the drag force. Notably, we exploit the dissipation effects inherent in the Poisson equation to achieve a faster decay of fluid density compared to velocities. This strategic utilization of dissipation, together with the influence of the electric field and the damping structure induced by the drag force, leads to a remarkable decay behavior: the fluid density converges to equilibrium at a rate of $(1+t)^{-11/4}$, significantly faster than the decay rates of velocity differences $(1+t)^{-7/4}$ and velocities themselves $(1+t)^{-3/4}$ in the $L^2$ norm. Furthermore, under the condition of vanishing coupled incompressible flow, we demonstrate an exponential decay to a constant state for the solution of the corresponding system, the damped Euler-Poisson system.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Optimal stability of Hardy-Littlewood-Sobolev and Sobolev inequalities of arbitrary orders with dimension-dependent constants
Authors:
Lu Chen,
Guozhen Lu,
Hanli Tang
Abstract:
Dolbeault-Esteban-Figalli-Frank-Loss [19] and Chen-Lu-Tang [17] established the optimal asymptotic lower bound for stability of the first-order Sobolev inequality and fractional Sobolev inequality of order $s$ for $0<s<1$ respectively. However, it left the problem of the optimal lower bound for stability of high-order Sobolev inequality and high-order fractional Sobolev inequality unsolved. The pu…
▽ More
Dolbeault-Esteban-Figalli-Frank-Loss [19] and Chen-Lu-Tang [17] established the optimal asymptotic lower bound for stability of the first-order Sobolev inequality and fractional Sobolev inequality of order $s$ for $0<s<1$ respectively. However, it left the problem of the optimal lower bound for stability of high-order Sobolev inequality and high-order fractional Sobolev inequality unsolved. The purpose of this paper is to solve this problem.
The main difficulty lies in establishing the optimal asymptotic behavior for the local stability of the Sobolev inequality for all $0<s<n/2$. The proof of the local stability when $0<s\leq 1$ relies on ``cuttings" at various heights and this helps to split the $L^2$ integral of first order or fractional order derivative of order $0<s<1$. However, this approach does not seem to work for $1<s<n/2$. In order to overcome this difficulty, we directly establish the local stability for the HLS inequality with the optimal asymptotic lower bounds.
To achieve our goal, we develop a new strategy based on the $H^{-s}-$decomposition instead of $L^{\frac{2n}{n+2s}}-$decomposition to obtain the local stability of the HLS inequality with $L^{\frac{2n}{n+2s}}-$distance. This kind of ``new local stability" also brings more difficulties to using the rearrangement flow to deduce the global stability from local stability because of the non-uniqueness of $\|r\|_{\frac{2n}{n+2s}}$ and non-continuity of $\|r\|_{\frac{2n}{n+2s}}$ norm for the rearrangement flow. We establish the norm comparison theorem for $\|r\|_{\frac{2n}{n+2s}}$ and "new continuity" theorem for the rearrangement flow to overcome this difficulty (see Lemma 3.1, Lemma 3.3 and Lemma 3.5).
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff
Authors:
Hao Tang,
Keya Hu,
Jin Peng Zhou,
Sicheng Zhong,
Wei-Long Zheng,
Xujie Si,
Kevin Ellis
Abstract:
Iteratively improving and repairing source code with large language models (LLMs), known as refinement, has emerged as a popular way of generating programs that would be too complex to construct in one shot. Given a bank of test cases, together with a candidate program, an LLM can improve that program by being prompted with failed test cases. But it remains an open question how to best iteratively…
▽ More
Iteratively improving and repairing source code with large language models (LLMs), known as refinement, has emerged as a popular way of generating programs that would be too complex to construct in one shot. Given a bank of test cases, together with a candidate program, an LLM can improve that program by being prompted with failed test cases. But it remains an open question how to best iteratively refine code, with prior work employing simple greedy or breadth-first strategies. We show here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program. We frame this as an arm-acquiring bandit problem, which we solve with Thompson Sampling. The resulting LLM-based program synthesis algorithm is broadly applicable: Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls.
△ Less
Submitted 30 May, 2024; v1 submitted 26 May, 2024;
originally announced May 2024.
-
Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum"
Authors:
MHD Anas Alsakkal,
Runze Wang,
Jayawan Wijekoon,
Huajin Tang
Abstract:
Spike-based encoders represent information as sequences of spikes or pulses, which are transmitted between neurons. A prevailing consensus suggests that spike-based approaches demonstrate exceptional capabilities in capturing the temporal dynamics of neural activity and have the potential to provide energy-efficient solutions for low-power applications. The Spiketrum encoder efficiently compresses…
▽ More
Spike-based encoders represent information as sequences of spikes or pulses, which are transmitted between neurons. A prevailing consensus suggests that spike-based approaches demonstrate exceptional capabilities in capturing the temporal dynamics of neural activity and have the potential to provide energy-efficient solutions for low-power applications. The Spiketrum encoder efficiently compresses input data using spike trains or code sets (for non-spiking applications) and is adaptable to both hardware and software implementations, with lossless signal reconstruction capability. The paper proposes and assesses Spiketrum's hardware, evaluating its output under varying spike rates and its classification performance with popular spiking and non-spiking classifiers, and also assessing the quality of information compression and hardware resource utilization. The paper extensively benchmarks both Spiketrum hardware and its software counterpart against state-of-the-art, biologically-plausible encoders. The evaluations encompass benchmarking criteria, including classification accuracy, training speed, and sparsity when using encoder outputs in pattern recognition and classification with both spiking and non-spiking classifiers. Additionally, they consider encoded output entropy and hardware resource utilization and power consumption of the hardware version of the encoders. Results demonstrate Spiketrum's superiority in most benchmarking criteria, making it a promising choice for various applications. It efficiently utilizes hardware resources with low power consumption, achieving high classification accuracy. This work also emphasizes the potential of encoders in spike-based processing to improve the efficiency and performance of neural computing systems.
△ Less
Submitted 31 May, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Exotic d-wave Bose Metal in two dimensions
Authors:
Zhangkai Cao,
Jiahao Su,
Jianyu Li,
Tao Ying,
WanSheng Wang,
Jin-Hua Sun,
Ho-Kin Tang,
Haiqing Lin
Abstract:
The Landau Fermi liquid theory, a cornerstone in condensed matter physics, encounters limitations in explaining certain phenomena, like the peculiar behavior of strange metals in high-temperature superconductors. Non-Fermi liquids, like Bose metals with uncondensed bosonic ground state, offer potential explanations, yet constructing an elusive Bose metal phase in two dimensions (2D) remains a form…
▽ More
The Landau Fermi liquid theory, a cornerstone in condensed matter physics, encounters limitations in explaining certain phenomena, like the peculiar behavior of strange metals in high-temperature superconductors. Non-Fermi liquids, like Bose metals with uncondensed bosonic ground state, offer potential explanations, yet constructing an elusive Bose metal phase in two dimensions (2D) remains a formidable challenge. Utilizing constraint path quantum Monte Carlo and functional renormalization group methods on a fermionic system with spin anisotropy in a 2D lattice, we reveal the emergence of a Cooper pair Bose metal in a highly anisotropic regime (a < 0.30) with wide range of filling, most notably at a filling fraction of n~0.8. Our findings exhibit a visible nonzero momentum Bose surface in the Cooper-pair distribution function, accompanied by a distinct signal of dxy correlation between pairs. Our results highlight that spin-dependent anisotropy in the Fermi surface leads to versatile pairing forms. Platforms such as ultracold atoms in optical lattices and recently proposed altermagnets hold promise for realizing this intriguing phase.
△ Less
Submitted 24 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Collaboration of Teachers for Semi-supervised Object Detection
Authors:
Liyu Chen,
Huaao Tang,
Yi Wen,
Hanting Chen,
Wei Li,
Junchao Liu,
Jie Hu
Abstract:
Recent semi-supervised object detection (SSOD) has achieved remarkable progress by leveraging unlabeled data for training. Mainstream SSOD methods rely on Consistency Regularization methods and Exponential Moving Average (EMA), which form a cyclic data flow. However, the EMA updating training approach leads to weight coupling between the teacher and student models. This coupling in a cyclic data f…
▽ More
Recent semi-supervised object detection (SSOD) has achieved remarkable progress by leveraging unlabeled data for training. Mainstream SSOD methods rely on Consistency Regularization methods and Exponential Moving Average (EMA), which form a cyclic data flow. However, the EMA updating training approach leads to weight coupling between the teacher and student models. This coupling in a cyclic data flow results in a decrease in the utilization of unlabeled data information and the confirmation bias on low-quality or erroneous pseudo-labels. To address these issues, we propose the Collaboration of Teachers Framework (CTF), which consists of multiple pairs of teacher and student models for training. In the learning process of CTF, the Data Performance Consistency Optimization module (DPCO) informs the best pair of teacher models possessing the optimal pseudo-labels during the past training process, and these most reliable pseudo-labels generated by the best performing teacher would guide the other student models. As a consequence, this framework greatly improves the utilization of unlabeled data and prevents the positive feedback cycle of unreliable pseudo-labels. The CTF achieves outstanding results on numerous SSOD datasets, including a 0.71% mAP improvement on the 10% annotated COCO dataset and a 0.89% mAP improvement on the VOC dataset compared to LabelMatch and converges significantly faster. Moreover, the CTF is plug-and-play and can be integrated with other mainstream SSOD methods.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Interpretable Spatio-Temporal Embedding for Brain Structural-Effective Network with Ordinary Differential Equation
Authors:
Haoteng Tang,
Guodong Liu,
Siyuan Dai,
Kai Ye,
Kun Zhao,
Wenlu Wang,
Carl Yang,
Lifang He,
Alex Leow,
Paul Thompson,
Heng Huang,
Liang Zhan
Abstract:
The MRI-derived brain network serves as a pivotal instrument in elucidating both the structural and functional aspects of the brain, encompassing the ramifications of diseases and developmental processes. However, prevailing methodologies, often focusing on synchronous BOLD signals from functional MRI (fMRI), may not capture directional influences among brain regions and rarely tackle temporal fun…
▽ More
The MRI-derived brain network serves as a pivotal instrument in elucidating both the structural and functional aspects of the brain, encompassing the ramifications of diseases and developmental processes. However, prevailing methodologies, often focusing on synchronous BOLD signals from functional MRI (fMRI), may not capture directional influences among brain regions and rarely tackle temporal functional dynamics. In this study, we first construct the brain-effective network via the dynamic causal model. Subsequently, we introduce an interpretable graph learning framework termed Spatio-Temporal Embedding ODE (STE-ODE). This framework incorporates specifically designed directed node embedding layers, aiming at capturing the dynamic interplay between structural and effective networks via an ordinary differential equation (ODE) model, which characterizes spatial-temporal brain dynamics. Our framework is validated on several clinical phenotype prediction tasks using two independent publicly available datasets (HCP and OASIS). The experimental results clearly demonstrate the advantages of our model compared to several state-of-the-art methods.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.