Skip to main content

Showing 1–50 of 179 results for author: Zeng, A

  1. arXiv:2406.12793  [pdf, other

    cs.CL

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Authors: Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang , et al. (32 additional authors not shown)

    Abstract: We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2406.07221  [pdf, other

    cs.CV

    Open-World Human-Object Interaction Detection via Multi-modal Prompts

    Authors: Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang

    Abstract: In this paper, we develop \textbf{MP-HOI}, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textu… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: CVPR24. arXiv admin note: text overlap with arXiv:2305.12252

  3. arXiv:2406.01900  [pdf, other

    cs.CV

    Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

    Authors: Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen

    Abstract: We present Follow-Your-Emoji, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences. The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity. To address these challenges, Follow-Your-Emoji equ… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Project Page: https://follow-your-emoji.github.io/

  4. arXiv:2405.20340  [pdf, other

    cs.CV

    MotionLLM: Understanding Human Behaviors from Human Motions and Videos

    Authors: Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang

    Abstract: This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: MotionLLM version 1.0, project page see https://lhchen.top/MotionLLM

  5. arXiv:2405.16114  [pdf, other

    cs.AI cs.CV cs.LG

    Multi-scale Quaternion CNN and BiGRU with Cross Self-attention Feature Fusion for Fault Diagnosis of Bearing

    Authors: Huanbai Liu, Fanlong Zhang, Yin Tan, Lian Huang, Yan Li, Guoheng Huang, Shenghong Luo, An Zeng

    Abstract: In recent years, deep learning has led to significant advances in bearing fault diagnosis (FD). Most techniques aim to achieve greater accuracy. However, they are sensitive to noise and lack robustness, resulting in insufficient domain adaptation and anti-noise ability. The comparison of studies reveals that giving equal attention to all features does not differentiate their significance. In this… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  6. arXiv:2404.13527  [pdf, other

    cs.GT math.CO

    On the structure of envy-free orientations on graphs

    Authors: Jinghan A Zeng, Ruta Mehta

    Abstract: Fair division is the problem of allocating a set of items among agents in a fair manner. One of the most sought-after fairness notions is envy-freeness (EF), requiring that no agent envies another's allocation. When items are indivisible, it ceases to exist, and envy-freeness up to any good (EFX) emerged as one of its strongest relaxations. The existence of EFX allocations is arguably the biggest… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: 12 pages, 4 figures

  7. arXiv:2404.03570  [pdf, other

    cs.RO

    Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity

    Authors: Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, Vikas Sindhwani

    Abstract: We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for grasping. With sem… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  8. arXiv:2404.02893  [pdf, other

    cs.CL

    ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

    Authors: Yifan Xu, Xiao Liu, Xinghan Liu, Zhenyu Hou, Yueyan Li, Xiaohan Zhang, Zihan Wang, Aohan Zeng, Zhengxiao Du, Wenyi Zhao, Jie Tang, Yuxiao Dong

    Abstract: Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems.In this work, we tailor… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  9. arXiv:2404.00934  [pdf, other

    cs.CL

    ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

    Authors: Zhenyu Hou, Yilin Niu, Zhengxiao Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang, Yuxiao Dong

    Abstract: ChatGLM is a free-to-use AI service powered by the ChatGLM family of large language models (LLMs). In this paper, we present the ChatGLM-RLHF pipeline -- a reinforcement learning from human feedback (RLHF) system -- designed to enhance ChatGLM's alignment with human preferences. ChatGLM-RLHF encompasses three major components: the collection of human preference data, the training of the reward mod… ▽ More

    Submitted 3 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  10. arXiv:2403.17934  [pdf, other

    cs.CV

    AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

    Authors: Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, Zhongang Cai

    Abstract: Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Homepage: https://ttxskk.github.io/AiOS/

  11. arXiv:2403.15796  [pdf, other

    cs.CL cs.AI cs.LG

    Understanding Emergent Abilities of Language Models from the Loss Perspective

    Authors: Zhengxiao Du, Aohan Zeng, Yuxiao Dong, Jie Tang

    Abstract: Recent studies have put into question the belief that emergent abilities in language models are exclusive to large models. This skepticism arises from two observations: 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities. In this paper, we propose to study emergent abilities in the lens of pre-… ▽ More

    Submitted 30 March, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

    Comments: 18 pages, 6 figures

  12. arXiv:2403.11626  [pdf, other

    cs.GR cs.AI cs.CV cs.MM cs.SD eess.AS

    QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

    Authors: Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li

    Abstract: The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted by The Visual Computer Journal

  13. arXiv:2402.11450  [pdf, other

    cs.RO

    Learning to Learn Faster from Human Feedback with Language Model Predictive Control

    Authors: Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore , et al. (25 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for o… ▽ More

    Submitted 31 May, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

  14. arXiv:2402.07872  [pdf, other

    cs.RO cs.CL cs.CV cs.LG

    PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

    Authors: Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, Brian Ichter

    Abstract: Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we ena… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  15. arXiv:2402.05741  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Real-World Robot Applications of Foundation Models: A Review

    Authors: Kento Kawaharazuka, Tatsuya Matsushima, Andrew Gambardella, Jiaxian Guo, Chris Paxton, Andy Zeng

    Abstract: Recent developments in foundation models, like Large Language Models (LLMs) and Vision-Language Models (VLMs), trained on extensive data, facilitate flexible application across different tasks and modalities. Their impact spans various fields, including healthcare, education, and robotics. This paper provides an overview of the practical application of foundation models in real-world robotics, wit… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  16. Generative Expressive Robot Behaviors using Large Language Models

    Authors: Karthik Mahadevan, Jonathan Chien, Noah Brown, Zhuo Xu, Carolina Parada, Fei Xia, Andy Zeng, Leila Takayama, Dorsa Sadigh

    Abstract: People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding to acknowledge a person glancing at them or saying "excuse me" to pass people in a busy corridor. We would like robots to also demonstrate expressive behaviors in human-robot interaction. Prior work proposes rule-based methods that struggle to scale to new communication modalitie… ▽ More

    Submitted 30 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

  17. arXiv:2401.14159  [pdf, other

    cs.CV

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Authors: Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang

    Abstract: We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline.… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  18. arXiv:2401.10215  [pdf, other

    cs.CV

    GPAvatar: Generalizable and Precise Head Avatar from Image(s)

    Authors: Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, Tatsuya Harada

    Abstract: Head avatar reconstruction, crucial for applications in virtual reality, online meetings, gaming, and film industries, has garnered substantial attention within the computer vision community. The fundamental objective of this field is to faithfully recreate the head avatar and precisely control expressions and postures. Existing methods, categorized into 2D-based warping, mesh-based, and neural re… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: ICLR 2024, code is available at https://github.com/xg-chu/GPAvatar

  19. arXiv:2401.07937  [pdf, other

    q-bio.GN cs.LG q-bio.QM

    Integrate Any Omics: Towards genome-wide data integration for patient stratification

    Authors: Shihao Ma, Andy G. X. Zeng, Benjamin Haibe-Kains, Anna Goldenberg, John E Dick, Bo Wang

    Abstract: High-throughput omics profiling advancements have greatly enhanced cancer patient stratification. However, incomplete data in multi-omics integration presents a significant challenge, as traditional methods like sample exclusion or imputation often compromise biological diversity and dependencies. Furthermore, the critical task of accurately classifying new patients with partial omics data into ex… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  20. arXiv:2401.06761  [pdf, other

    cs.CL

    APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

    Authors: Mingdao Liu, Aohan Zeng, Bowen Wang, Peng Zhang, Jie Tang, Yuxiao Dong

    Abstract: The massive adoption of large language models (LLMs) demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving. In this work, we introduce a parallel auto-regressive generation method. By instruct-tuning on general domain data that contains hierarchical structures, we en… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

    Comments: 14 pages

  21. arXiv:2401.06199  [pdf, other

    q-bio.QM cs.AI cs.LG

    xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

    Authors: Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song

    Abstract: Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

  22. arXiv:2401.04747  [pdf, other

    cs.SD cs.AI cs.CV cs.GR eess.AS

    DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

    Authors: Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, Qifeng Chen

    Abstract: We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-… ▽ More

    Submitted 6 April, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

    Comments: Accepted by CVPR 2024. Project page: https://jeremycjm.github.io/proj/DiffSHEG

  23. arXiv:2312.05541  [pdf, other

    cs.CV

    DPoser: Diffusion Model as Robust 3D Human Pose Prior

    Authors: Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Yulun Zhang, Haoqian Wang

    Abstract: This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion m… ▽ More

    Submitted 23 March, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

    Comments: Project Page: https://dposer.github.io; Code Released: https://github.com/moonbow721/DPoser

  24. arXiv:2312.04474  [pdf, other

    cs.CL cs.AI cs.LG cs.RO

    Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

    Authors: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter

    Abstract: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an… ▽ More

    Submitted 7 December, 2023; v1 submitted 7 December, 2023; originally announced December 2023.

  25. arXiv:2312.04393  [pdf, other

    cs.CV cs.GR cs.RO

    PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction

    Authors: Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, Lei Zhang

    Abstract: Humans interact with objects all the time. Enabling a humanoid to learn human-object interaction (HOI) is a key step for future smart animation and intelligent robotics systems. However, recent progress in physics-based HOI requires carefully designed task-specific rewards, making the system unscalable and labor-intensive. This work focuses on dynamic HOI imitation: teaching humanoid dynamic inter… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  26. arXiv:2311.18702  [pdf, other

    cs.CL cs.AI

    CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

    Authors: Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

    Abstract: Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise c… ▽ More

    Submitted 26 June, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: Accepted by ACL 2024 (Main Conference)

  27. arXiv:2311.17954  [pdf, other

    cs.CV

    Transformer-empowered Multi-modal Item Embedding for Enhanced Image Search in E-Commerce

    Authors: Chang Liu, Peng Hou, Anxiang Zeng, Han Yu

    Abstract: Over the past decade, significant advances have been made in the field of image search for e-commerce applications. Traditional image-to-image retrieval models, which focus solely on image details such as texture, tend to overlook useful semantic information contained within the images. As a result, the retrieved products might possess similar image details, but fail to fulfil the user's search go… ▽ More

    Submitted 8 February, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: Accepted by IAAI 2024

  28. arXiv:2311.10678  [pdf, other

    cs.RO cs.AI cs.LG

    Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections

    Authors: Lihan Zha, Yuchen Cui, Li-Heng Lin, Minae Kwon, Montserrat Gonzalez Arenas, Andy Zeng, Fei Xia, Dorsa Sadigh

    Abstract: Today's robot policies exhibit subpar performance when faced with the challenge of generalizing to novel environments. Human corrective feedback is a crucial form of guidance to enable such generalization. However, adapting to and learning from online human corrections is a non-trivial endeavor: not only do robots need to remember human feedback over time to retrieve the right information in new s… ▽ More

    Submitted 21 March, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: 8 pages, 4 figures, videos and code links on website https://sites.google.com/stanford.edu/droc

  29. arXiv:2310.12978  [pdf, other

    cs.CV

    HumanTOMATO: Text-aligned Whole-body Motion Generation

    Authors: Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, Heung-Yeung Shum

    Abstract: This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling i… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: 31 pages, 15 figures, 16 tables. Project page: https://lhchen.top/HumanTOMATO

  30. arXiv:2310.12823  [pdf, other

    cs.CL cs.AI cs.LG

    AgentTuning: Enabling Generalized Agent Abilities for LLMs

    Authors: Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, Jie Tang

    Abstract: Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitatin… ▽ More

    Submitted 22 October, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: 31 pages

  31. arXiv:2310.10625  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Video Language Planning

    Authors: Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, Jonathan Tompson

    Abstract: We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: https://video-language-planning.github.io/

  32. arXiv:2310.08530  [pdf, other

    cs.CV

    UniPose: Detecting Any Keypoints

    Authors: Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang

    Abstract: This work proposes a unified framework called UniPose to detect keypoints of any articulated (e.g., human and animal), rigid, and soft objects via visual or textual prompts for fine-grained vision understanding and manipulation. Keypoint is a structure-aware, pixel-level, and compact representation of any object, especially articulated objects. Existing fine-grained promptable tasks mainly focus o… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

  33. arXiv:2310.04189  [pdf, other

    cs.CV

    Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases

    Authors: Xinpeng Liu, Yong-Lu Li, Ailing Zeng, Zizheng Zhou, Yang You, Cewu Lu

    Abstract: The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walk with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping bet… ▽ More

    Submitted 11 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Yong-Lu Li and Cewu Lu are the corresponding authors. Project page is available at https://foruck.github.io/KP/

  34. arXiv:2310.01506  [pdf, other

    cs.CV

    Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

    Authors: Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu

    Abstract: Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separa… ▽ More

    Submitted 19 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

  35. arXiv:2309.17448  [pdf, other

    cs.CV

    SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

    Authors: Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu

    Abstract: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and tra… ▽ More

    Submitted 30 October, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: Homepage: https://caizhongang.github.io/projects/SMPLer-X/

  36. arXiv:2309.12694  [pdf, other

    cs.LG cs.SI

    Recurrent Temporal Revision Graph Networks

    Authors: Yizhou Chen, Anxiang Zeng, Guangda Huzhang, Qingtao Yu, Kerui Zhang, Cao Yuanpeng, Kangle Wu, Han Yu, Zhiming Zhou

    Abstract: Temporal graphs offer more accurate modeling of many real-world scenarios than static graphs. However, neighbor aggregation, a critical building block of graph networks, for temporal graphs, is currently straightforwardly extended from that of static graphs. It can be computationally expensive when involving all historical neighbors during such aggregation. In practice, typically only a subset of… ▽ More

    Submitted 25 September, 2023; v1 submitted 22 September, 2023; originally announced September 2023.

  37. arXiv:2309.05073  [pdf, other

    cs.CV

    FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

    Authors: Jiong Wang, Fengyu Yang, Wenbo Gou, Bingliang Li, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Yanqing Jing, Ruimao Zhang

    Abstract: Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction, serving as a crucial technique for understanding and interacting with human actions in real-world settings. However, the current datasets, often collected under single laboratory condition… ▽ More

    Submitted 3 April, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

    Comments: CVPR2024 camera ready version. 19 pages, 16 figures. Project page: https://wangjiongw.github.io/freeman/ ; API: https://github.com/wangjiongw/FreeMan_API

  38. arXiv:2308.14508  [pdf, other

    cs.CL

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Authors: Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

    Abstract: Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanis… ▽ More

    Submitted 19 June, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: ACL 2024

  39. arXiv:2308.10174  [pdf, other

    cs.CV

    Neural Interactive Keypoint Detection

    Authors: Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, Lei Zhang

    Abstract: This work proposes an end-to-end neural interactive keypoint detection framework named Click-Pose, which can significantly reduce more than 10 times labeling costs of 2D keypoint annotation compared with manual-only annotation. Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: Accepted to ICCV 2023

  40. arXiv:2308.03688  [pdf, other

    cs.AI cs.CL cs.LG

    AgentBench: Evaluating LLMs as Agents

    Authors: Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

    Abstract: Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Age… ▽ More

    Submitted 25 October, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: 55 pages

  41. arXiv:2307.15880  [pdf, other

    cs.CV

    Effective Whole-body Pose Estimation with Two-stages Distillation

    Authors: Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li

    Abstract: Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. This task is challenging due to multi-scale body parts, fine-grained localization for low-resolution regions, and data scarcity. Meanwhile, applying a highly efficient and accurate pose estimator to widely human-centric understanding and generation tasks is urgent. In this work, we present a two-stage… ▽ More

    Submitted 24 August, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023, CV4Metaverse Workshop

  42. arXiv:2307.04721  [pdf, other

    cs.AI cs.CL cs.RO

    Large Language Models as General Pattern Machines

    Authors: Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, Andy Zeng

    Abstract: We observe that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences -- from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial patterns found in the Abstraction and Reasoning Corpus (ARC), a general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern completion profici… ▽ More

    Submitted 25 October, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: 21 pages, 25 figures. To appear at Conference on Robot Learning (CoRL) 2023

  43. arXiv:2307.03756  [pdf, other

    cs.LG

    FITS: Modeling Time Series with $10k$ Parameters

    Authors: Zhijian Xu, Ailing Zeng, Qiang Xu

    Abstract: In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain. By discarding high-frequency components with negligible impact on time series data, FITS achieves performance comparab… ▽ More

    Submitted 5 January, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

  44. arXiv:2307.01928  [pdf, other

    cs.RO cs.AI stat.AP

    Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners

    Authors: Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, Anirudha Majumdar

    Abstract: Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. In this work, we present KnowNo, which is a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for… ▽ More

    Submitted 4 September, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

    Comments: Conference on Robot Learning (CoRL) 2023, Oral Presentation

  45. arXiv:2307.00818  [pdf, other

    cs.CV

    Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

    Authors: Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang

    Abstract: In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome… ▽ More

    Submitted 26 January, 2024; v1 submitted 3 July, 2023; originally announced July 2023.

    Comments: Accepted by NeurIPS 2023; A large-scale 3D whole-body human motion-text dataset; GitHub: https://github.com/IDEA-Research/Motion-X

  46. arXiv:2307.00206  [pdf, other

    cs.RO cs.AI

    Rearrangement Planning for General Part Assembly

    Authors: Yulong Li, Andy Zeng, Shuran Song

    Abstract: Most successes in autonomous robotic assembly have been restricted to single target or category. We propose to investigate general part assembly, the task of creating novel target assemblies with unseen part shapes. As a fundamental step to a general part assembly system, we tackle the task of determining the precise poses of the parts in the target assembly, which we we term ``rearrangement plann… ▽ More

    Submitted 2 September, 2023; v1 submitted 30 June, 2023; originally announced July 2023.

    Comments: Project website: https://general-part-assembly.github.io/

  47. arXiv:2306.08647  [pdf, other

    cs.RO cs.AI cs.LG

    Language to Rewards for Robotic Skill Synthesis

    Authors: Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, Fei Xia

    Abstract: Large language models (LLMs) have demonstrated exciting progress in acquiring diverse new capabilities through in-context learning, ranging from logical reasoning to code-writing. Robotics researchers have also explored using LLMs to advance the capabilities of robotic control. However, since low-level robot actions are hardware-dependent and underrepresented in LLM training corpora, existing effo… ▽ More

    Submitted 16 June, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: https://language-to-reward.github.io/

  48. arXiv:2306.07906  [pdf, other

    cs.CL cs.AI

    WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences

    Authors: Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, Jie Tang

    Abstract: We present WebGLM, a web-enhanced question-answering system based on the General Language Model (GLM). Its goal is to augment a pre-trained large language model (LLM) with web search and retrieval capabilities while being efficient for real-world deployments. To achieve this, we develop WebGLM with strategies for the LLM-augmented retriever, bootstrapped generator, and human preference-aware score… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

    Comments: Accepted to KDD 2023

  49. arXiv:2306.07265  [pdf, other

    cs.CV

    detrex: Benchmarking Detection Transformers

    Authors: Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang

    Abstract: The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight co… ▽ More

    Submitted 13 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: project link: https://github.com/IDEA-Research/detrex

  50. arXiv:2306.05392  [pdf, other

    cs.CL

    Modular Visual Question Answering via Code Generation

    Authors: Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

    Abstract: We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the o… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: ACL 2023