subscribe to arXiv mailings

Effective Motion Modeling for UAV-platform Multiple Object Tracking with Re-Margin Loss

Authors: Mufeng Yao, Jinlong Peng, Qingdong He, Bo Peng, Hao Chen, Mingmin Chi, Chao Liu, Jon Atli Benediktsson

Abstract: Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces tracking difficulties caused by large and irregular motion, and insufficient training due to the motion long-tailed distribution of current UAV-MOT datasets. Previous UAV-MOT methods either extract motion and detection features redundantly or supervise motio… ▽ More Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces tracking difficulties caused by large and irregular motion, and insufficient training due to the motion long-tailed distribution of current UAV-MOT datasets. Previous UAV-MOT methods either extract motion and detection features redundantly or supervise motion model in a sparse scheme, which limited their tracking performance and speed. To this end, we propose a flowing-by-detection module to realize accurate motion modeling with a minimum cost. Focusing on the motion long-tailed problem that were ignored by previous works, the flow-guided margin loss is designed to enable more complete training of large moving objects. Experiments on two widely open-source datasets show that our proposed model can successfully track objects with large and irregular motion and outperform existing state-of-the-art methods in UAV-MOT tasks. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2308.07207

arXiv:2407.10481 [pdf, other]

doi 10.1145/3641519.3657492

SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation

Authors: Jordan Juravsky, Yunrong Guo, Sanja Fidler, Xue Bin Peng

Abstract: Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using… ▽ More Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using reinforcement learning (RL). However, scaling these methods beyond several hundred motions has remained challenging. Meanwhile, kinematic animation models are able to successfully learn from thousands of diverse motions by leveraging supervised learning methods. Inspired by these successes, in this work we introduce SuperPADL, a scalable framework for physics-based text-to-motion that leverages both RL and supervised learning to train controllers on thousands of diverse motion clips. SuperPADL is trained in stages using progressive distillation, starting with a large number of specialized experts using RL. These experts are then iteratively distilled into larger, more robust policies using a combination of reinforcement learning and supervised learning. Our final SuperPADL controller is trained on a dataset containing over 5000 skills and runs in real time on a consumer GPU. Moreover, our policy can naturally transition between skills, allowing for users to interactively craft multi-stage animations. We experimentally demonstrate that SuperPADL significantly outperforms RL-based baselines at this large data scale. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.06584 [pdf, other]

HiLMa-Res: A General Hierarchical Framework via Residual RL for Combining Quadrupedal Locomotion and Manipulation

Authors: Xiaoyu Huang, Qiayuan Liao, Yiming Ni, Zhongyu Li, Laura Smith, Sergey Levine, Xue Bin Peng, Koushil Sreenath

Abstract: This work presents HiLMa-Res, a hierarchical framework leveraging reinforcement learning to tackle manipulation tasks while performing continuous locomotion using quadrupedal robots. Unlike most previous efforts that focus on solving a specific task, HiLMa-Res is designed to be general for various loco-manipulation tasks that require quadrupedal robots to maintain sustained mobility. The novel des… ▽ More This work presents HiLMa-Res, a hierarchical framework leveraging reinforcement learning to tackle manipulation tasks while performing continuous locomotion using quadrupedal robots. Unlike most previous efforts that focus on solving a specific task, HiLMa-Res is designed to be general for various loco-manipulation tasks that require quadrupedal robots to maintain sustained mobility. The novel design of this framework tackles the challenges of integrating continuous locomotion control and manipulation using legs. It develops an operational space locomotion controller that can track arbitrary robot end-effector (toe) trajectories while walking at different velocities. This controller is designed to be general to different downstream tasks, and therefore, can be utilized in high-level manipulation planning policy to address specific tasks. To demonstrate the versatility of this framework, we utilize HiLMa-Res to tackle several challenging loco-manipulation tasks using a quadrupedal robot in the real world. These tasks span from leveraging state-based policy to vision-based policy, from training purely from the simulation data to learning from real-world data. In these tasks, HiLMa-Res shows better performance than other methods. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: IROS 2024

arXiv:2407.05324 [pdf, other]

PICA: Physics-Integrated Clothed Avatar

Authors: Bo Peng, Yunfan Tao, Haoyu Zhan, Yudong Guo, Juyong Zhang

Abstract: We introduce PICA, a novel representation for high-fidelity animatable clothed human avatars with physics-accurate dynamics, even for loose clothing. Previous neural rendering-based representations of animatable clothed humans typically employ a single model to represent both the clothing and the underlying body. While efficient, these approaches often fail to accurately represent complex garment… ▽ More We introduce PICA, a novel representation for high-fidelity animatable clothed human avatars with physics-accurate dynamics, even for loose clothing. Previous neural rendering-based representations of animatable clothed humans typically employ a single model to represent both the clothing and the underlying body. While efficient, these approaches often fail to accurately represent complex garment dynamics, leading to incorrect deformations and noticeable rendering artifacts, especially for sliding or loose garments. Furthermore, previous works represent garment dynamics as pose-dependent deformations and facilitate novel pose animations in a data-driven manner. This often results in outcomes that do not faithfully represent the mechanics of motion and are prone to generating artifacts in out-of-distribution poses. To address these issues, we adopt two individual 3D Gaussian Splatting (3DGS) models with different deformation characteristics, modeling the human body and clothing separately. This distinction allows for better handling of their respective motion characteristics. With this representation, we integrate a graph neural network (GNN)-based clothed body physics simulation module to ensure an accurate representation of clothing dynamics. Our method, through its carefully designed features, achieves high-fidelity rendering of clothed human bodies in complex and novel driving poses, significantly outperforming previous methods under the same settings. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: Project page: https://ustc3dv.github.io/PICA/

arXiv:2407.00617 [pdf, other]

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Authors: Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu

Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-th… ▽ More Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control. △ Less

Submitted 7 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00320 [pdf, other]

LiteSearch: Efficacious Tree Search for LLM

Authors: Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Dian Yu, Haitao Mi, Jinsong Su, Dong Yu

Abstract: Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree s… ▽ More Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19131 [pdf, other]

CELLO: Causal Evaluation of Large Vision-Language Models

Authors: Meiqi Chen, Bo Peng, Yan Zhang, Chaochao Lu

Abstract: Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and… ▽ More Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects. Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study provide valuable insights for future research. Our project page is at https://github.com/OpenCausaLab/CELLO. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.17338 [pdf, other]

Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

Authors: Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, Xin Wang

Abstract: Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propos… ▽ More Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propose an innovative deep learning framework that combines feature decoupling and adaptive adversarial training. Firstly, we employ two iteratively compressed decouplers to supervised decouple common features and specific features related to fatty liver in abdominal ultrasound images. Subsequently, the decoupled features are concatenated with the original image after transforming the color space and are fed into the classifier. During adversarial training, we adaptively adjust the perturbation and balance the adversarial strength by the accuracy of each class. The model will eliminate recognition weaknesses by correctly classifying adversarial samples, thus improving recognition robustness. Finally, the accuracy of our method improved by 4.16%, achieving 82.95%. As demonstrated by extensive experiments, our method is a generalized learning framework that can be directly used to eliminate the recognition weaknesses of any classifier while improving its average performance. Code is available at https://github.com/HP-ML/MICCAI2024. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: MICCAI 2024

arXiv:2406.11937 [pdf, other]

Using graph neural networks to reconstruct charged pion showers in the CMS High Granularity Calorimeter

Authors: M. Aamir, B. Acar, G. Adamov, T. Adams, C. Adloff, S. Afanasiev, C. Agrawal, C. Agrawal, A. Ahmad, H. A. Ahmed, S. Akbar, N. Akchurin, B. Akgul, B. Akgun, R. O. Akpinar, E. Aktas, A. AlKadhim, V. Alexakhin, J. Alimena, J. Alison, A. Alpana, W. Alshehri, P. Alvarez Dominguez, M. Alyari, C. Amendola , et al. (550 additional authors not shown)

Abstract: A novel method to reconstruct the energy of hadronic showers in the CMS High Granularity Calorimeter (HGCAL) is presented. The HGCAL is a sampling calorimeter with very fine transverse and longitudinal granularity. The active media are silicon sensors and scintillator tiles readout by SiPMs and the absorbers are a combination of lead and Cu/CuW in the electromagnetic section, and steel in the hadr… ▽ More A novel method to reconstruct the energy of hadronic showers in the CMS High Granularity Calorimeter (HGCAL) is presented. The HGCAL is a sampling calorimeter with very fine transverse and longitudinal granularity. The active media are silicon sensors and scintillator tiles readout by SiPMs and the absorbers are a combination of lead and Cu/CuW in the electromagnetic section, and steel in the hadronic section. The shower reconstruction method is based on graph neural networks and it makes use of a dynamic reduction network architecture. It is shown that the algorithm is able to capture and mitigate the main effects that normally hinder the reconstruction of hadronic showers using classical reconstruction methods, by compensating for fluctuations in the multiplicity, energy, and spatial distributions of the shower's constituents. The performance of the algorithm is evaluated using test beam data collected in 2018 prototype of the CMS HGCAL accompanied by a section of the CALICE AHCAL prototype. The capability of the method to mitigate the impact of energy leakage from the calorimeter is also demonstrated. △ Less

Submitted 30 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Prepared for submission to JINST

arXiv:2406.11528 [pdf, other]

Optimal Robust Contract Design

Authors: Bo Peng, Zhihao Gavin Tang

Abstract: We consider the robust contract design problem when the principal only has limited information about the actions the agent can take. The principal evaluates a contract according to its worst-case performance caused by the uncertain action space. Carroll (AER 2015) showed that a linear contract is optimal among deterministic contracts. Recently, Kambhampati (JET 2023) showed that the principal's pa… ▽ More We consider the robust contract design problem when the principal only has limited information about the actions the agent can take. The principal evaluates a contract according to its worst-case performance caused by the uncertain action space. Carroll (AER 2015) showed that a linear contract is optimal among deterministic contracts. Recently, Kambhampati (JET 2023) showed that the principal's payoff can be strictly increased via randomization over linear contracts. In this paper, we characterize the optimal randomized contract, which remains linear and admits a closed form of its cumulative density function. The advantage of randomized contracts over deterministic contracts can be arbitrarily large even when the principal knows only one non-trivial action of the agent. Furthermore, our result generalizes to the model of contracting with teams, by Dai and Toikka (Econometrica 2022). △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Full version of EC 2024 paper

arXiv:2406.10938 [pdf, other]

doi 10.14778/3665844.3665854

DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search

Authors: Jiuqi Wei, Botao Peng, Xiaodong Lee, Themis Palpanas

Abstract: Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search in high-dimensional spaces due to its robust theoretical guarantee on query accuracy. Traditional LSH-based methods mainly focus on improving the efficiency and accuracy of the query phase by designing different query strategies, but pay little attention to improving the efficiency of the indexi… ▽ More Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search in high-dimensional spaces due to its robust theoretical guarantee on query accuracy. Traditional LSH-based methods mainly focus on improving the efficiency and accuracy of the query phase by designing different query strategies, but pay little attention to improving the efficiency of the indexing phase. They typically fine-tune existing data-oriented partitioning trees to index data points and support their query strategies. However, their strategy to directly partition the multi-dimensional space is time-consuming, and performance degrades as the space dimensionality increases. In this paper, we design an encoding-based tree called Dynamic Encoding Tree (DE-Tree) to improve the indexing efficiency and support efficient range queries based on Euclidean distance. Based on DE-Tree, we propose a novel LSH scheme called DET-LSH. DET-LSH adopts a novel query strategy, which performs range queries in multiple independent index DE-Trees to reduce the probability of missing exact NN points, thereby improving the query accuracy. Our theoretical studies show that DET-LSH enjoys probabilistic guarantees on query accuracy. Extensive experiments on real-world datasets demonstrate the superiority of DET-LSH over the state-of-the-art LSH-based methods on both efficiency and accuracy. While achieving better query accuracy than competitors, DET-LSH achieves up to 6x speedup in indexing time and 2x speedup in query time over the state-of-the-art LSH-based methods. This paper was published in PVLDB 2024. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Journal ref: PVLDB, 17(9): 2241 - 2254, 2024

arXiv:2406.09399 [pdf, other]

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Authors: Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang

Abstract: Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled archite… ▽ More Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. Code is available at https://github.com/FoundationVision/OmniTokenizer. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08580 [pdf, other]

Anomalous Enhancement of the Electrocatalytic Hydrogen Evolution Reaction in AuPt Nanoclusters

Authors: Jiahui Kang, Jan Kloppenburg, Jiali Sheng, Zhenyu Xu, Kristoffer Meinander, Hua Jiang, Zhong-Peng Lv, Esko I. Kauppinen, Qiang Zhang, Xi Chen, Olli Ikkala, Miguel A. Caro, Bo Peng

Abstract: Energy- and resource-efficient electrocatalytic water splitting is of paramount importance to enable sustainable hydrogen production. The best bulk catalyst for the hydrogen evolution reaction (HER), i.e., platinum, is one of the scarcest elements on Earth. The use of raw material for HER can be dramatically reduced by utilizing nanoclusters. In addition, nanoalloying can further improve the perfo… ▽ More Energy- and resource-efficient electrocatalytic water splitting is of paramount importance to enable sustainable hydrogen production. The best bulk catalyst for the hydrogen evolution reaction (HER), i.e., platinum, is one of the scarcest elements on Earth. The use of raw material for HER can be dramatically reduced by utilizing nanoclusters. In addition, nanoalloying can further improve the performance of these nanoclusters. In this paper, we present results for HER on nanometer-sized ligand-free AuPt nanoclusters grafted on carbon nanotubes. These results demonstrate excellent monodispersity and a significant reduction of the overpotential for the electrocatalytic HER. We utilize atomistic machine learning techniques to elucidate the atomic-scale origin of the synergistic effect between Pt and Au. We show that the presence of surface Au atoms, known to be poor HER catalysts, in a Pt(core)/AuPt(shell) nanocluster structure, drives an anomalous enhancement of the inherently high catalytic activity of Pt atoms. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.06615 [pdf, other]

Language Guided Skill Discovery

Authors: Seungeun Rho, Laura Smith, Tianyu Li, Sergey Levine, Xue Bin Peng, Sehoon Ha

Abstract: Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the "semantic diversity… ▽ More Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the "semantic diversity" of skills. We hypothesize that leveraging the semantic knowledge of large language models (LLMs) can lead us to improve semantic diversity of resulting behaviors. In this sense, we introduce Language Guided Skill Discovery (LGSD), a skill discovery framework that aims to directly maximize the semantic diversity between skills. LGSD takes user prompts as input and outputs a set of semantically distinctive skills. The prompts serve as a means to constrain the search space into a semantically desired subspace, and the generated LLM outputs guide the agent to visit semantically diverse states within the subspace. We demonstrate that LGSD enables legged robots to visit different user-intended areas on a plane by simply changing the prompt. Furthermore, we show that language guidance aids in discovering more diverse skills compared to five existing skill discovery methods in robot-arm manipulation environments. Lastly, LGSD provides a simple way of utilizing learned skills via natural language. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.06525 [pdf, other]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Authors: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

Abstract: We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spa… ▽ More We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Codes and models: \url{https://github.com/FoundationVision/LlamaGen}

arXiv:2406.06326 [pdf, other]

Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Authors: Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Yipeng Zhang, Haitao Mi, Helen Meng

Abstract: Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in ef… ▽ More Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM's ability to effectively acquire new knowledge from raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. In addition, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM's knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on Llama2 family models reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge. △ Less

Submitted 15 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: 30 pages

arXiv:2406.05652 [pdf, other]

Distributed Combinatorial Optimization of Downlink User Assignment in mmWave Cell-free Massive MIMO Using Graph Neural Networks

Authors: Bile Peng, Bihan Guo, Karl-Ludwig Besser, Luca Kunz, Ramprasad Raghunath, Anke Schmeink, Eduard A Jorswieck, Giuseppe Caire, H. Vincent Poor

Abstract: Millimeter wave (mmWave) cell-free massive MIMO (CF mMIMO) is a promising solution for future wireless communications. However, its optimization is non-trivial due to the challenging channel characteristics. We show that mmWave CF mMIMO optimization is largely an assignment problem between access points (APs) and users due to the high path loss of mmWave channels, the limited output power of the a… ▽ More Millimeter wave (mmWave) cell-free massive MIMO (CF mMIMO) is a promising solution for future wireless communications. However, its optimization is non-trivial due to the challenging channel characteristics. We show that mmWave CF mMIMO optimization is largely an assignment problem between access points (APs) and users due to the high path loss of mmWave channels, the limited output power of the amplifier, and the almost orthogonal channels between users given a large number of AP antennas. The combinatorial nature of the assignment problem, the requirement for scalability, and the distributed implementation of CF mMIMO make this problem difficult. In this work, we propose an unsupervised machine learning (ML) enabled solution. In particular, a graph neural network (GNN) customized for scalability and distributed implementation is introduced. Moreover, the customized GNN architecture is hierarchically permutation-equivariant (HPE), i.e., if the APs or users of an AP are permuted, the output assignment is automatically permuted in the same way. To address the combinatorial problem, we relax it to a continuous problem, and introduce an information entropy-inspired penalty term. The training objective is then formulated using the augmented Lagrangian method (ALM). The test results show that the realized sum-rate outperforms that of the generalized serial dictatorship (GSD) algorithm and is very close to the upper bound in a small network scenario, while the upper bound is impossible to obtain in a large network scenario. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.04316 [pdf, other]

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Authors: Jiyao Zhang, Weiyao Huang, Bo Peng, Mingdong Wu, Fei Hu, Zijian Chen, Bo Zhao, Hao Dong

Abstract: 6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a… ▽ More 6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 332K images annotated with over 1.5M annotations across 581 instances in 149 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 475K images created in a mixed reality setting with depth simulation, annotated with over 5M annotations across 4162 instances in the same 149 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we introduce GenPose++, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation. Moreover, we provide a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.02357 [pdf, ps, other]

The complexity of approximate (coarse) correlated equilibrium for incomplete information games

Authors: Binghui Peng, Aviad Rubinstein

Abstract: We study the iteration complexity of decentralized learning of approximate correlated equilibria in incomplete information games. On the negative side, we prove that in $\mathit{extensive}$-$\mathit{form}$ $\mathit{games}$, assuming $\mathsf{PPAD} \not\subset \mathsf{TIME}(n^{\mathsf{polylog}(n)})$, any polynomial-time learning algorithms must take at least $2^{\log_2^{1-o(1)}(|\mathcal{I}|)}$ i… ▽ More We study the iteration complexity of decentralized learning of approximate correlated equilibria in incomplete information games. On the negative side, we prove that in $\mathit{extensive}$-$\mathit{form}$ $\mathit{games}$, assuming $\mathsf{PPAD} \not\subset \mathsf{TIME}(n^{\mathsf{polylog}(n)})$, any polynomial-time learning algorithms must take at least $2^{\log_2^{1-o(1)}(|\mathcal{I}|)}$ iterations to converge to the set of $ε$-approximate correlated equilibrium, where $|\mathcal{I}|$ is the number of nodes in the game and $ε> 0$ is an absolute constant. This nearly matches, up to the $o(1)$ term, the algorithms of [PR'24, DDFG'24] for learning $ε$-approximate correlated equilibrium, and resolves an open question of Anagnostides, Kalavasis, Sandholm, and Zampetakis [AKSZ'24]. Our lower bound holds even for the easier solution concept of $ε$-approximate $\mathit{coarse}$ correlated equilibrium On the positive side, we give uncoupled dynamics that reach $ε$-approximate correlated equilibria of a $\mathit{Bayesian}$ $\mathit{game}$ in polylogarithmic iterations, without any dependence of the number of types. This demonstrates a separation between Bayesian games and extensive-form games. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01239 [pdf, other]

Tighter yet more tractable relaxations and nontrivial instance generation for sparse standard quadratic optimization

Authors: Immanuel Bomze, Bo Peng, Yuzhou Qiu, E. Alper Yildirim

Abstract: The Standard Quadratic optimization Problem (StQP), arguably the simplest among all classes of NP-hard optimization problems, consists of extremizing a quadratic form (the simplest nonlinear polynomial) over the standard simplex (the simplest polytope/compact feasible set). As a problem class, StQPs may be nonconvex with an exponential number of inefficient local solutions. StQPs arise in a multit… ▽ More The Standard Quadratic optimization Problem (StQP), arguably the simplest among all classes of NP-hard optimization problems, consists of extremizing a quadratic form (the simplest nonlinear polynomial) over the standard simplex (the simplest polytope/compact feasible set). As a problem class, StQPs may be nonconvex with an exponential number of inefficient local solutions. StQPs arise in a multitude of applications, among them mathematical finance, machine learning (clustering), and modeling in biosciences (e.g., selection and ecology). This paper deals with such StQPs under an additional sparsity or cardinality constraint, which, even for convex objectives, renders NP-hard problems. One motivation to study StQPs under such sparsity restrictions is the high-dimensional portfolio selection problem with too many assets to handle, in particular, in the presence of transaction costs. Here, relying on modern conic optimization techniques, we present tractable convex relaxations for this relevant but difficult problem. We propose novel equivalent reformulations of these relaxations with significant dimensional reduction, which is essential for the tractability of these relaxations when the problem size grows. Moreover, we propose an instance generation procedure which systematically avoids too easy instances. Our extensive computational results illustrate the high quality of the relaxation bounds in a significant number of instances. Furthermore, in contrast with exact mixed-integer quadratic programming models, the solution time of the relaxations is very robust to the choices of the problem parameters. In particular, the reduced formulations achieve significant improvements in terms of the solution time over their counterparts. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Technical Report, School of Mathematics, The University of Edinburgh, Edinburgh, EH9 3FD, Scotland, United Kingdom

MSC Class: 90C11; 90C20; 90C22

arXiv:2406.01238 [pdf, other]

EffiQA: Efficient Question-Answering with Strategic Multi-Model Collaboration on Knowledge Graphs

Authors: Zixuan Dong, Baoyun Peng, Yufei Wang, Jia Fu, Xiaodong Wang, Yongxue Shan, Xin Zhou

Abstract: While large language models (LLMs) have shown remarkable capabilities in natural language processing, they struggle with complex, multi-step reasoning tasks involving knowledge graphs (KGs). Existing approaches that integrate LLMs and KGs either underutilize the reasoning abilities of LLMs or suffer from prohibitive computational costs due to tight coupling. To address these limitations, we propos… ▽ More While large language models (LLMs) have shown remarkable capabilities in natural language processing, they struggle with complex, multi-step reasoning tasks involving knowledge graphs (KGs). Existing approaches that integrate LLMs and KGs either underutilize the reasoning abilities of LLMs or suffer from prohibitive computational costs due to tight coupling. To address these limitations, we propose a novel collaborative framework named EffiQA that can strike a balance between performance and efficiency via an iterative paradigm. EffiQA consists of three stages: global planning, efficient KG exploration, and self-reflection. Specifically, EffiQA leverages the commonsense capability of LLMs to explore potential reasoning pathways through global planning. Then, it offloads semantic pruning to a small plug-in model for efficient KG exploration. Finally, the exploration results are fed to LLMs for self-reflection to further improve the global planning and efficient KG exploration. Empirical evidence on multiple KBQA benchmarks shows EffiQA's effectiveness, achieving an optimal balance between reasoning accuracy and computational costs. We hope the proposed new framework will pave the way for efficient, knowledge-intensive querying by redefining the integration of LLMs and KGs, fostering future research on knowledge-based question answering. △ Less

Submitted 7 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: 10 pages, 4 figures, 3 tables

arXiv:2406.00989 [pdf, other]

On the exact limit of the time-dependent coupled cluster ansatz and its approximations in the real-time equation-of-motion coupled cluster cumulant Green's function approach

Authors: Bo Peng, Himadri Pathak, Ajay Panyala, Fernando D. Vila, John J. Rehr, Karol Kowalski

Abstract: In this paper, we analyze the properties of the recently proposed real-time equation-of-motion coupled-cluster (RT-EOM-CC) cumulant Green's function approach [J. Chem. Phys. 2020, 152, 174113]. We specifically focus on identifying the limitations of the original time-dependent coupled cluster (TDCC) ansatz and propose an enhanced extended TDCC ansatz ensuring the exactness in the expansion limit.… ▽ More In this paper, we analyze the properties of the recently proposed real-time equation-of-motion coupled-cluster (RT-EOM-CC) cumulant Green's function approach [J. Chem. Phys. 2020, 152, 174113]. We specifically focus on identifying the limitations of the original time-dependent coupled cluster (TDCC) ansatz and propose an enhanced extended TDCC ansatz ensuring the exactness in the expansion limit. Additionally, we introduce a practical cluster-analysis-based approach for characterizing the peaks in the computed spectral function from the RT-EOM-CC cumulant Green's function approach, which is particularly useful for the assignments of satellite peaks when many-body effects dominate the spectra. Our preliminary numerical tests focus on reproducing, approximating, and characterizing the exact impurity Green's function of the three-site and four-site single impurity Anderson models using the RT-EOM-CC cumulant Green's function approach. The numerical tests allow us to have a direct comparison between the RT-EOM-CC cumulant Green's function approach and other Green's function approaches in the numerical exact limit. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00941 [pdf, ps, other]

A Robust Residual-Based Test for Structural Changes in Factor Models

Authors: Bin Peng, Liangjun Su, Yayi Yan

Abstract: In this paper, we propose an easy-to-implement residual-based specification testing procedure for detecting structural changes in factor models, which is powerful against both smooth and abrupt structural changes with unknown break dates. The proposed test is robust against the over-specified number of factors, and serially and cross-sectionally correlated error processes. A new central limit theo… ▽ More In this paper, we propose an easy-to-implement residual-based specification testing procedure for detecting structural changes in factor models, which is powerful against both smooth and abrupt structural changes with unknown break dates. The proposed test is robust against the over-specified number of factors, and serially and cross-sectionally correlated error processes. A new central limit theorem is given for the quadratic forms of panel data with dependence over both dimensions, thereby filling a gap in the literature. We establish the asymptotic properties of the proposed test statistic, and accordingly develop a simulation-based scheme to select critical value in order to improve finite sample performance. Through extensive simulations and a real-world application, we confirm our theoretical results and demonstrate that the proposed test exhibits desirable size and power in practice. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.15526 [pdf]

Syngas conversion to higher alcohols via wood-framed Cu/Co-carbon catalyst

Authors: Guihua Yan, Paulina Pršlja, Gaofeng Chen, Jiahui Kang, Yongde Liu, Miguel A. Caro, Xi Chen, Xianhai Zeng, Bo Peng

Abstract: Syngas conversion into higher alcohols represents a promising avenue for transforming coal or biomass into liquid fuels. However, the commercialization of this process has been hindered by the high cost, low activity, and inadequate C$_{2+}$OH selectivity of catalysts. Herein, we have developed Cu/Co carbon wood catalysts, offering a cost-effective and stable alternative with exceptional selectivi… ▽ More Syngas conversion into higher alcohols represents a promising avenue for transforming coal or biomass into liquid fuels. However, the commercialization of this process has been hindered by the high cost, low activity, and inadequate C$_{2+}$OH selectivity of catalysts. Herein, we have developed Cu/Co carbon wood catalysts, offering a cost-effective and stable alternative with exceptional selectivity for catalytic conversion. The formation of Cu/Co nanoparticles was found, influenced by water-1,2-propylene glycol ratios in the solution, resulting in bidisperse nanoparticles. The catalyst exhibited a remarkable CO conversion rate of 74.8% and a selectivity of 58.7% for C$_{2+}$OH, primarily comprising linear primary alcohols. This catalyst demonstrated enduring stability and selectivity under industrial conditions, maintaining its efficacy for up to 350 h of operation. We also employed density functional theory (DFT) to analyze selectivity, particularly focusing on the binding strength of CO, a crucial precursor for subsequent reactions leading to the formation of CH$_3$OH. DFT identified the pathway of CH$_x$ and CO coupling, ultimately yielding C$_2$H$_5$OH. This computational understanding, coupled with high performance of the Cu/Co-carbon wood catalyst, paves ways for the development of catalytically selective materials tailored for higher alcohols production, thereby ushering in new possibility in this field. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.11126 [pdf, other]

Flexible Motion In-betweening with Diffusion Models

Authors: Setareh Cohan, Guy Tevet, Daniele Reda, Xue Bin Peng, Michiel van de Panne

Abstract: Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a s… ▽ More Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods. △ Less

Submitted 23 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

Comments: SIGGRAPH 2024. For project page and code, see https://setarehc.github.io/CondMDI/

arXiv:2405.08998 [pdf, other]

Puffy Venuses: the Mass-Radius Impact of Carbon-Rich Atmospheres on Lava Worlds

Authors: Bo Peng, Diana Valencia

Abstract: The recent advancements in exoplanet observations enable the potential detection of exo-Venuses, rocky planets with carbon-rich atmospheres. How extended these atmospheres can be, given high carbon abundances, has not been studied. To answer this, we present a model for a theoretical class of exoplanets - puffy Venuses - characterized by thick, carbon-dominated atmospheres in equilibrium with glob… ▽ More The recent advancements in exoplanet observations enable the potential detection of exo-Venuses, rocky planets with carbon-rich atmospheres. How extended these atmospheres can be, given high carbon abundances, has not been studied. To answer this, we present a model for a theoretical class of exoplanets - puffy Venuses - characterized by thick, carbon-dominated atmospheres in equilibrium with global magma oceans. Our model accounts for carbon and hydrogen partition between the atmosphere and the magma ocean, as well as the C-H-O equilibrium chemistry throughout a semi-grey, radiative-convective atmosphere. We find that radius inflation by puffy Venus atmospheres is significant on small and irradiated planets: carbon content of 1200 ppm (or that of ordinary chondrites) can generate an atmosphere of ~0.16 - 0.3 $R_{\oplus}$ for an Earth-mass planet with equilibrium temperatures of 1500 to 2000 K. We identify TOI-561 b as an especially promising puffy Venus candidate, whose under-density could be attributed to a thick C-rich atmosphere. We also advocate for a puffy Venus interpretation of 55 Cancri e, where recent JWST observation indicates the presence of a CO/CO2 atmosphere. Puffy Venuses may thus constitute a testable alternative interpretation for the interior structure of underdense low-mass exoplanets. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: V3, under review in ApJL. We welcome & appreciate your comments

arXiv:2405.07420 [pdf, other]

Robust Inference for High-Dimensional Panel Data Models

Authors: Jiti Gao, Bin Peng, Yayi Yan

Abstract: In this paper, we propose a robust estimation and inferential method for high-dimensional panel data models. Specifically, (1) we investigate the case where the number of regressors can grow faster than the sample size, (2) we pay particular attention to non-Gaussian, serially and cross-sectionally correlated and heteroskedastic error processes, and (3) we develop an estimation method for high-dim… ▽ More In this paper, we propose a robust estimation and inferential method for high-dimensional panel data models. Specifically, (1) we investigate the case where the number of regressors can grow faster than the sample size, (2) we pay particular attention to non-Gaussian, serially and cross-sectionally correlated and heteroskedastic error processes, and (3) we develop an estimation method for high-dimensional long-run covariance matrix using a thresholded estimator. Methodologically and technically, we develop two Nagaev-types of concentration inequalities: one for a partial sum and the other for a quadratic form, subject to a set of easily verifiable conditions. Leveraging these two inequalities, we also derive a non-asymptotic bound for the LASSO estimator, achieve asymptotic normality via the node-wise LASSO regression, and establish a sharp convergence rate for the thresholded heteroskedasticity and autocorrelation consistent (HAC) estimator. Our study thus provides the relevant literature with a complete toolkit for conducting inference about the parameters of interest involved in a high-dimensional panel data framework. We also demonstrate the practical relevance of these theoretical results by investigating a high-dimensional panel data model with interactive fixed effects. Moreover, we conduct extensive numerical studies using simulated and real data examples. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.00622 [pdf, other]

Causal Evaluation of Language Models

Authors: Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu

Abstract: Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive ben… ▽ More Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types. Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale). Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development. Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments. We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements. Project website is at https://opencausalab.github.io/CaLM. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: 315 pages, 230 figures, 21 tables. Project website: https://opencausalab.github.io/CaLM

arXiv:2404.19264 [pdf, other]

DiffuseLoco: Real-Time Legged Locomotion Control with Diffusion from Offline Datasets

Authors: Xiaoyu Huang, Yufeng Chi, Ruofeng Wang, Zhongyu Li, Xue Bin Peng, Sophia Shao, Borivoje Nikolic, Koushil Sreenath

Abstract: This work introduces DiffuseLoco, a framework for training multi-skill diffusion-based policies for dynamic legged locomotion from offline datasets, enabling real-time control of diverse skills on robots in the real world. Offline learning at scale has led to breakthroughs in computer vision, natural language processing, and robotic manipulation domains. However, scaling up learning for legged rob… ▽ More This work introduces DiffuseLoco, a framework for training multi-skill diffusion-based policies for dynamic legged locomotion from offline datasets, enabling real-time control of diverse skills on robots in the real world. Offline learning at scale has led to breakthroughs in computer vision, natural language processing, and robotic manipulation domains. However, scaling up learning for legged robot locomotion, especially with multiple skills in a single policy, presents significant challenges for prior online reinforcement learning methods. To address this challenge, we propose a novel, scalable framework that leverages diffusion models to directly learn from offline multimodal datasets with a diverse set of locomotion skills. With design choices tailored for real-time control in dynamical systems, including receding horizon control and delayed inputs, DiffuseLoco is capable of reproducing multimodality in performing various locomotion skills, zero-shot transfer to real quadrupedal robots, and it can be deployed on edge computing devices. Furthermore, DiffuseLoco demonstrates free transitions between skills and robustness against environmental variations. Through extensive benchmarking in real-world experiments, DiffuseLoco exhibits better stability and velocity tracking performance compared to prior reinforcement learning and non-diffusion-based behavior cloning baselines. The design choices are validated via comprehensive ablation studies. This work opens new possibilities for scaling up learning-based legged locomotion controllers through the scaling of large, expressive models and diverse offline datasets. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.18246 [pdf, other]

AdaFSNet: Time Series Classification Based on Convolutional Network with a Adaptive and Effective Kernel Size Configuration

Authors: Haoxiao Wang, Bo Peng, Jianhua Zhang, Xu Cheng

Abstract: Time series classification is one of the most critical and challenging problems in data mining, existing widely in various fields and holding significant research importance. Despite extensive research and notable achievements with successful real-world applications, addressing the challenge of capturing the appropriate receptive field (RF) size from one-dimensional or multi-dimensional time serie… ▽ More Time series classification is one of the most critical and challenging problems in data mining, existing widely in various fields and holding significant research importance. Despite extensive research and notable achievements with successful real-world applications, addressing the challenge of capturing the appropriate receptive field (RF) size from one-dimensional or multi-dimensional time series of varying lengths remains a persistent issue, which greatly impacts performance and varies considerably across different datasets. In this paper, we propose an Adaptive and Effective Full-Scope Convolutional Neural Network (AdaFSNet) to enhance the accuracy of time series classification. This network includes two Dense Blocks. Particularly, it can dynamically choose a range of kernel sizes that effectively encompass the optimal RF size for various datasets by incorporating multiple prime numbers corresponding to the time series length. We also design a TargetDrop block, which can reduce redundancy while extracting a more effective RF. To assess the effectiveness of the AdaFSNet network, comprehensive experiments were conducted using the UCR and UEA datasets, which include one-dimensional and multi-dimensional time series data, respectively. Our model surpassed baseline models in terms of classification accuracy, underscoring the AdaFSNet network's efficiency and effectiveness in handling time series classification tasks. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: Accepted by IJCNN 2024

arXiv:2404.16807 [pdf, other]

Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning

Authors: Tianhui Zhang, Bei Peng, Danushka Bollegala

Abstract: Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficienc… ▽ More Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 16 pages, 6 figures

arXiv:2404.16522 [pdf, other]

A Deep Learning-Driven Pipeline for Differentiating Hypertrophic Cardiomyopathy from Cardiac Amyloidosis Using 2D Multi-View Echocardiography

Authors: Bo Peng, Xiaofeng Li, Xinyu Li, Zhenghan Wang, Hui Deng, Xiaoxian Luo, Lixue Yin, Hongmei Zhang

Abstract: Hypertrophic cardiomyopathy (HCM) and cardiac amyloidosis (CA) are both heart conditions that can progress to heart failure if untreated. They exhibit similar echocardiographic characteristics, often leading to diagnostic challenges. This paper introduces a novel multi-view deep learning approach that utilizes 2D echocardiography for differentiating between HCM and CA. The method begins by classif… ▽ More Hypertrophic cardiomyopathy (HCM) and cardiac amyloidosis (CA) are both heart conditions that can progress to heart failure if untreated. They exhibit similar echocardiographic characteristics, often leading to diagnostic challenges. This paper introduces a novel multi-view deep learning approach that utilizes 2D echocardiography for differentiating between HCM and CA. The method begins by classifying 2D echocardiography data into five distinct echocardiographic views: apical 4-chamber, parasternal long axis of left ventricle, parasternal short axis at levels of the mitral valve, papillary muscle, and apex. It then extracts features of each view separately and combines five features for disease classification. A total of 212 patients diagnosed with HCM, and 30 patients diagnosed with CA, along with 200 individuals with normal cardiac function(Normal), were enrolled in this study from 2018 to 2022. This approach achieved a precision, recall of 0.905, and micro-F1 score of 0.904, demonstrating its effectiveness in accurately identifying HCM and CA using a multi-view analysis. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.12253 [pdf, other]

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, Dong Yu

Abstract: Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. I… ▽ More Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.11054 [pdf, other]

Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection

Authors: Ying Zhang, Yuezun Li, Bo Peng, Jiaran Zhou, Huiyu Zhou, Junyu Dong

Abstract: The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal… ▽ More The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios. △ Less

Submitted 6 May, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.10685 [pdf, other]

Generating Human Interaction Motions in Scenes with Text Control

Authors: Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe

Abstract: We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model,… ▽ More We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Project Page: https://research.nvidia.com/labs/toronto-ai/tesmo/

arXiv:2404.10099 [pdf, other]

Feature selection in linear SVMs via hard cardinality constraint: a scalable SDP decomposition approach

Authors: Immanuel Bomze, Federico D'Onofrio, Laura Palagi, Bo Peng

Abstract: In this paper, we study the embedded feature selection problem in linear Support Vector Machines (SVMs), in which a cardinality constraint is employed, leading to a fully explainable selection model. The problem is NP-hard due to the presence of the cardinality constraint, even though the original linear SVM amounts to a problem solvable in polynomial time. To handle the hard problem, we first int… ▽ More In this paper, we study the embedded feature selection problem in linear Support Vector Machines (SVMs), in which a cardinality constraint is employed, leading to a fully explainable selection model. The problem is NP-hard due to the presence of the cardinality constraint, even though the original linear SVM amounts to a problem solvable in polynomial time. To handle the hard problem, we first introduce two mixed-integer formulations for which novel SDP relaxations are proposed. Exploiting the sparsity pattern of the relaxations, we decompose the problems and obtain equivalent relaxations in a much smaller cone, making the conic approaches scalable. To make the best usage of the decomposed relaxations, we propose heuristics using the information of its optimal solution. Moreover, an exact procedure is proposed by solving a sequence of mixed-integer decomposed SDPs. Numerical results on classical benchmarking datasets are reported, showing the efficiency and effectiveness of our approach. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Submitted to European Journal of Operational Research. arXiv admin note: text overlap with arXiv:1808.02435 by other authors

MSC Class: 90C22; 90C11 ACM Class: I.5.1; I.2.0

arXiv:2404.09338 [pdf, other]

Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models

Authors: Souvik Das, Lifeng Jin, Linfeng Song, Haitao Mi, Baolin Peng, Dong Yu

Abstract: Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination -- generating content ungrounded in the realities of training data. Recent work has focused on decoding techniques to improve factuality during inference by leveraging LLMs' hierarchical representation of factual knowledge, manipulating the predicted distributions at inference time. Current… ▽ More Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination -- generating content ungrounded in the realities of training data. Recent work has focused on decoding techniques to improve factuality during inference by leveraging LLMs' hierarchical representation of factual knowledge, manipulating the predicted distributions at inference time. Current state-of-the-art approaches refine decoding by contrasting early-exit distributions from a lower layer with the final layer to exploit information related to factuality within the model forward procedure. However, such methods often assume the final layer is the most reliable and the lower layer selection process depends on it. In this work, we first propose extrapolation of critical token probabilities beyond the last layer for more accurate contrasting. We additionally employ layer-wise entropy-guided lower layer selection, decoupling the selection process from the final layer. Experiments demonstrate strong performance - surpassing state-of-the-art on multiple different datasets by large margins. Analyses show different kinds of prompts respond to different selection strategies. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: Work in Progress

arXiv:2404.08549 [pdf]

Benchmarking the Cell Image Segmentation Models Robustness under the Microscope Optical Aberrations

Authors: Boyuan Peng, Jiaju Chen, Qihui Ye, Minjiang Chen, Peiwu Qin, Chenggang Yan, Dongmei Yu, Zhenglin Chen

Abstract: Cell segmentation is essential in biomedical research for analyzing cellular morphology and behavior. Deep learning methods, particularly convolutional neural networks (CNNs), have revolutionized cell segmentation by extracting intricate features from images. However, the robustness of these methods under microscope optical aberrations remains a critical challenge. This study comprehensively evalu… ▽ More Cell segmentation is essential in biomedical research for analyzing cellular morphology and behavior. Deep learning methods, particularly convolutional neural networks (CNNs), have revolutionized cell segmentation by extracting intricate features from images. However, the robustness of these methods under microscope optical aberrations remains a critical challenge. This study comprehensively evaluates the performance of cell instance segmentation models under simulated aberration conditions using the DynamicNuclearNet (DNN) and LIVECell datasets. Aberrations, including Astigmatism, Coma, Spherical, and Trefoil, were simulated using Zernike polynomial equations. Various segmentation models, such as Mask R-CNN with different network heads (FPN, C3) and backbones (ResNet, VGG19, SwinS), were trained and tested under aberrated conditions. Results indicate that FPN combined with SwinS demonstrates superior robustness in handling simple cell images affected by minor aberrations. Conversely, Cellpose2.0 proves effective for complex cell images under similar conditions. Our findings provide insights into selecting appropriate segmentation models based on cell morphology and aberration severity, enhancing the reliability of cell segmentation in biomedical applications. Further research is warranted to validate these methods with diverse aberration types and emerging segmentation models. Overall, this research aims to guide researchers in effectively utilizing cell segmentation models in the presence of minor optical aberrations. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2404.08365 [pdf, other]

Estimation and Inference for Three-Dimensional Panel Data Models

Authors: Guohua Feng, Jiti Gao, Fei Liu, Bin Peng

Abstract: Hierarchical panel data models have recently garnered significant attention. This study contributes to the relevant literature by introducing a novel three-dimensional (3D) hierarchical panel data model, which integrates panel regression with three sets of latent factor structures: one set of global factors and two sets of local factors. Instead of aggregating latent factors from various nodes, as… ▽ More Hierarchical panel data models have recently garnered significant attention. This study contributes to the relevant literature by introducing a novel three-dimensional (3D) hierarchical panel data model, which integrates panel regression with three sets of latent factor structures: one set of global factors and two sets of local factors. Instead of aggregating latent factors from various nodes, as seen in the literature of distributed principal component analysis (PCA), we propose an estimation approach capable of recovering the parameters of interest and disentangling latent factors at different levels and across different dimensions. We establish an asymptotic theory and provide a bootstrap procedure to obtain inference for the parameters of interest while accommodating various types of cross-sectional dependence and time series autocorrelation. Finally, we demonstrate the applicability of our framework by examining productivity convergence in manufacturing industries worldwide. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2404.08341 [pdf, other]

Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts

Authors: Yang Li, Songlin Yang, Wei Wang, Ziwen He, Bo Peng, Jing Dong

Abstract: Highly realistic AI generated face forgeries known as deepfakes have raised serious social concerns. Although DNN-based face forgery detection models have achieved good performance, they are vulnerable to latest generative methods that have less forgery traces and adversarial attacks. This limitation of generalization and robustness hinders the credibility of detection results and requires more ex… ▽ More Highly realistic AI generated face forgeries known as deepfakes have raised serious social concerns. Although DNN-based face forgery detection models have achieved good performance, they are vulnerable to latest generative methods that have less forgery traces and adversarial attacks. This limitation of generalization and robustness hinders the credibility of detection results and requires more explanations. In this work, we provide counterfactual explanations for face forgery detection from an artifact removal perspective. Specifically, we first invert the forgery images into the StyleGAN latent space, and then adversarially optimize their latent representations with the discrimination supervision from the target detection model. We verify the effectiveness of the proposed explanations from two aspects: (1) Counterfactual Trace Visualization: the enhanced forgery images are useful to reveal artifacts by visually contrasting the original images and two different visualization methods; (2) Transferable Adversarial Attacks: the adversarial forgery images generated by attacking the detection model are able to mislead other detection models, implying the removed artifacts are general. Extensive experiments demonstrate that our method achieves over 90% attack success rate and superior attack transferability. Compared with naive adversarial noise methods, our method adopts both generative and discriminative model priors, and optimize the latent representations in a synthesis-by-analysis way, which forces the search of counterfactual explanations on the natural face manifold. Thus, more general counterfactual traces can be found and better adversarial attack transferability can be achieved. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: Accepted to ICME2024

arXiv:2404.07470 [pdf, other]

Scalable Language Model with Generalized Continual Learning

Authors: Bohao Peng, Zhuotao Tian, Shu Liu, Mingchang Yang, Jiaya Jia

Abstract: Continual learning has gained increasing importance as it facilitates the acquisition and refinement of scalable knowledge and skills in language models. However, existing methods typically encounter strict limitations and challenges in real-world scenarios, such as reliance on experience replay, optimization constraints, and inference task-ID. In this study, we introduce the Scalable Language Mod… ▽ More Continual learning has gained increasing importance as it facilitates the acquisition and refinement of scalable knowledge and skills in language models. However, existing methods typically encounter strict limitations and challenges in real-world scenarios, such as reliance on experience replay, optimization constraints, and inference task-ID. In this study, we introduce the Scalable Language Model (SLM) to overcome these limitations within a more challenging and generalized setting, representing a significant advancement toward practical applications for continual learning. Specifically, we propose the Joint Adaptive Re-Parameterization (JARe), integrated with Dynamic Task-related Knowledge Retrieval (DTKR), to enable adaptive adjustment of language models based on specific downstream tasks. This approach leverages the task distribution within the vector space, aiming to achieve a smooth and effortless continual learning process. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting. Moreover, while prior research primarily focused on a single task type such as classification, our study goes beyond, with the large language model, i.e., LLaMA-2, to explore the effects across diverse domains and task types, such that a single language model can be decently scaled to broader applications. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: The Twelfth International Conference on Learning Representations

arXiv:2404.05892 [pdf, other]

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Authors: Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV, Jan Kocoń, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanisław Woźniak, Ruichong Zhang, Bingchen Zhao, Qihang Zhao , et al. (3 additional authors not shown)

Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokeni… ▽ More We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: https://huggingface.co/RWKV Training code at: https://github.com/RWKV/RWKV-LM Inference code at: https://github.com/RWKV/ChatRWKV Time-parallel training code at: https://github.com/RWKV/RWKV-infctx-trainer △ Less

Submitted 10 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.04875 [pdf, other]

NeRF2Points: Large-Scale Point Cloud Generation From Street Views' Radiance Field Optimization

Authors: Peng Tu, Xun Zhou, Mingming Wang, Xiaojun Yang, Bo Peng, Ping Chen, Xiu Su, Yawen Huang, Yefeng Zheng, Chang Xu

Abstract: Neural Radiance Fields (NeRF) have emerged as a paradigm-shifting methodology for the photorealistic rendering of objects and environments, enabling the synthesis of novel viewpoints with remarkable fidelity. This is accomplished through the strategic utilization of object-centric camera poses characterized by significant inter-frame overlap. This paper explores a compelling, alternative utility o… ▽ More Neural Radiance Fields (NeRF) have emerged as a paradigm-shifting methodology for the photorealistic rendering of objects and environments, enabling the synthesis of novel viewpoints with remarkable fidelity. This is accomplished through the strategic utilization of object-centric camera poses characterized by significant inter-frame overlap. This paper explores a compelling, alternative utility of NeRF: the derivation of point clouds from aggregated urban landscape imagery. The transmutation of street-view data into point clouds is fraught with complexities, attributable to a nexus of interdependent variables. First, high-quality point cloud generation hinges on precise camera poses, yet many datasets suffer from inaccuracies in pose metadata. Also, the standard approach of NeRF is ill-suited for the distinct characteristics of street-view data from autonomous vehicles in vast, open settings. Autonomous vehicle cameras often record with limited overlap, leading to blurring, artifacts, and compromised pavement representation in NeRF-based point clouds. In this paper, we present NeRF2Points, a tailored NeRF variant for urban point cloud synthesis, notable for its high-quality output from RGB inputs alone. Our paper is supported by a bespoke, high-resolution 20-kilometer urban street dataset, designed for point cloud generation and evaluation. NeRF2Points adeptly navigates the inherent challenges of NeRF-based point cloud synthesis through the implementation of the following strategic innovations: (1) Integration of Weighted Iterative Geometric Optimization (WIGO) and Structure from Motion (SfM) for enhanced camera pose accuracy, elevating street-view data precision. (2) Layered Perception and Integrated Modeling (LPiM) is designed for distinct radiance field modeling in urban environments, resulting in coherent point cloud representations. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: 18 pages

arXiv:2404.04062 [pdf, other]

Derivative-free tree optimization for complex systems

Authors: Ye Wei, Bo Peng, Ruiwen Xie, Yangtao Chen, Yu Qin, Peng Wen, Stefan Bauer, Po-Yen Tung

Abstract: A tremendous range of design tasks in materials, physics, and biology can be formulated as finding the optimum of an objective function depending on many parameters without knowing its closed-form expression or the derivative. Traditional derivative-free optimization techniques often rely on strong assumptions about objective functions, thereby failing at optimizing non-convex systems beyond 100 d… ▽ More A tremendous range of design tasks in materials, physics, and biology can be formulated as finding the optimum of an objective function depending on many parameters without knowing its closed-form expression or the derivative. Traditional derivative-free optimization techniques often rely on strong assumptions about objective functions, thereby failing at optimizing non-convex systems beyond 100 dimensions. Here, we present a tree search method for derivative-free optimization that enables accelerated optimal design of high-dimensional complex systems. Specifically, we introduce stochastic tree expansion, dynamic upper confidence bound, and short-range backpropagation mechanism to evade local optimum, iteratively approximating the global optimum using machine learning models. This development effectively confronts the dimensionally challenging problems, achieving convergence to global optima across various benchmark functions up to 2,000 dimensions, surpassing the existing methods by 10- to 20-fold. Our method demonstrates wide applicability to a wide range of real-world complex systems spanning materials, physics, and biology, considerably outperforming state-of-the-art algorithms. This enables efficient autonomous knowledge discovery and facilitates self-driving virtual laboratories. Although we focus on problems within the realm of natural science, the advancements in optimization techniques achieved herein are applicable to a broader spectrum of challenges across all quantitative disciplines. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: 39 pages, 3 figures

arXiv:2404.02905 [pdf, other]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Authors: Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang

Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: V… ▽ More We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning. △ Less

Submitted 10 June, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

Comments: Demo website: https://var.vision/

arXiv:2404.00230 [pdf, other]

Latent Watermark: Inject and Detect Watermarks in Latent Diffusion Space

Authors: Zheling Meng, Bo Peng, Jing Dong

Abstract: Watermarking is a tool for actively identifying and attributing the images generated by latent diffusion models. Existing methods face the dilemma of image quality and watermark robustness. Watermarks with superior image quality usually have inferior robustness against attacks such as blurring and JPEG compression, while watermarks with superior robustness usually significantly damage image qualit… ▽ More Watermarking is a tool for actively identifying and attributing the images generated by latent diffusion models. Existing methods face the dilemma of image quality and watermark robustness. Watermarks with superior image quality usually have inferior robustness against attacks such as blurring and JPEG compression, while watermarks with superior robustness usually significantly damage image quality. This dilemma stems from the traditional paradigm where watermarks are injected and detected in pixel space, relying on pixel perturbation for watermark detection and resilience against attacks. In this paper, we highlight that an effective solution to the problem is to both inject and detect watermarks in the latent diffusion space, and propose Latent Watermark with a progressive training strategy. It weakens the direct connection between quality and robustness and thus alleviates their contradiction. We conduct evaluations on two datasets and against 10 watermark attacks. 6 metrics measure the image quality and watermark robustness. Results show that compared to the recently proposed methods such as StegaStamp, StableSignature, RoSteALS, and TreeRing, LW not only surpasses them in terms of robustness but also offers superior image quality. Our code will be available at https://github.com/RichardSunnyMeng/LatentWatermark. △ Less

Submitted 11 July, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

arXiv:2404.00205 [pdf, other]

Conceptual and Unbiased Reasoning in Language Models

Authors: Ben Zhou, Hongming Zhang, Sihao Chen, Dian Yu, Hongwei Wang, Baolin Peng, Dan Roth, Dong Yu

Abstract: Conceptual reasoning, the ability to reason in abstract and high-level perspectives, is key to generalization in human cognition. However, limited study has been done on large language models' capability to perform conceptual reasoning. In this work, we bridge this gap and propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions and gener… ▽ More Conceptual reasoning, the ability to reason in abstract and high-level perspectives, is key to generalization in human cognition. However, limited study has been done on large language models' capability to perform conceptual reasoning. In this work, we bridge this gap and propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions and generate solutions in a verifiable symbolic space. Using this framework as an analytical tool, we show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks compared to direct inference methods. We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making. We propose two techniques to add trustworthy induction signals by generating familiar questions with similar underlying reasoning paths and asking models to perform self-refinement. Experiments show that our proposed techniques improve models' conceptual reasoning performance by 8% to 11%, achieving a more robust reasoning system that relies less on inductive biases. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: Preprint under review

arXiv:2403.17326 [pdf]

Unveiling the origin of unconventional moire ferroelectricity

Authors: Ruirui Niu, Zhuoxian Li, Xiangyan Han, Qianling Liu, Zhuangzhuang Qu, Zhiyu Wang, Chunrui Han, Kenji Watanabe, Takashi Taniguchi, Kaihui Liu, Jinhai Mao, Wu Shi, Bo Peng, Zheng Vitto Han, Zizhao Gan, Jianming Lu

Abstract: Interfacial ferroelectricity emerges in heterostructures consisting of nonpolar van der Waals (vdW) layers, greatly expanding the scope of two dimensional ferroelectrics. In particular, the unconventional moire ferroelectricity observed in bilayer graphene/boron nitride (BN) heterostructures, exhibits promising functionalities with topological current, superconductivity and synaptic responses. How… ▽ More Interfacial ferroelectricity emerges in heterostructures consisting of nonpolar van der Waals (vdW) layers, greatly expanding the scope of two dimensional ferroelectrics. In particular, the unconventional moire ferroelectricity observed in bilayer graphene/boron nitride (BN) heterostructures, exhibits promising functionalities with topological current, superconductivity and synaptic responses. However, the debate about its mechanism - correlation driven charge transfer between two graphene layers - limits device reproducibility and hence large-scale production. Here by designing a single-layer graphene encapsulated by lattice-mismatched WSe2, we identify the ferroelectricity as stemming from - instead of graphene moire bands - the particular BN, where interfacial sliding ferroelectricity must play a role. With similar structures, multilayer twisted MoS2 is found to reproduce the ferroelectricity. The key is a conductive moire ferroelectric, where the screened gate and the pinned domain wall together result in unchanged electronic states, i.e. anomalous screening. The intimate connection to interfacial sliding ferroelectricity thus provides advantages of diverse choices of constituent materials and robust polarization switching while preserving the unique anomalous screening, paving the way to reproducible and reliable memory-based devices in artificial intelligence. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.17137 [pdf, other]

Superlattice induced electron percolation within a single Landau level

Authors: Nilanjan Roy, Bo Peng, Bo Yang

Abstract: We investigate the quantum Hall effect in a single Landau level in the presence of a square superlattice of $δ$-function potentials. The interplay between the superlattice spacing $a_s$ and the magnetic length $\ell_B$ in clean system leads to three interesting characteristic regimes corresponding to $a_s \lt \ell_B$, $a_s \gg \ell_B$ and the intermediate one where $a_s \sim \ell_B$ . In the inter… ▽ More We investigate the quantum Hall effect in a single Landau level in the presence of a square superlattice of $δ$-function potentials. The interplay between the superlattice spacing $a_s$ and the magnetic length $\ell_B$ in clean system leads to three interesting characteristic regimes corresponding to $a_s \lt \ell_B$, $a_s \gg \ell_B$ and the intermediate one where $a_s \sim \ell_B$ . In the intermediate regime, the continuous magnetic translation symmetry breaks down to discrete lattice symmetry. In contrast, we show that in the other two regimes, the same is hardly broken in the topological band despite the presence of the superlattice. In the presence of weak disorder (white-noise) one typically expects a tiny fraction of extended states due to topological protection of the Landau level. Interestingly, we obtain a large fraction of extended states throughout the intermediate regime which maximizes at the special point $a_s = \sqrt{2π} \ell_B$. We argue the superlattice induced percolation phenomenon requires both the breaking of the time reversal symmetry and the continuous magnetic translational symmetry. It could have a direct implication on the integer plateau transitions in both continuous quantum Hall systems and the lattice based anomalous quantum Hall effect. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 5 pages, 4 figures and supplementary materials

arXiv:2403.14418 [pdf, other]

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation

Authors: Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Hengshuang Zhao, Zhuotao Tian, Jiaya Jia

Abstract: The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models, especially in 3D semantic segmentation. However, sparse CNNs are still valuable networks, due to their efficiency treasure, and ease of application. In this work, we reexamine the design distinctions and test the limits of what… ▽ More The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models, especially in 3D semantic segmentation. However, sparse CNNs are still valuable networks, due to their efficiency treasure, and ease of application. In this work, we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically, we propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules, OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes, with much less latency and memory cost. Notably, it achieves 76.1%, 78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation benchmarks respectively, while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: CVPR 2024

Showing 1–50 of 545 results for author: Peng, B