subscribe to arXiv mailings

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Authors: Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang

Abstract: This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundan… ▽ More This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.02263 [pdf, other]

FreeCG: Free the Design Space of Clebsch-Gordan Transform for Machine Learning Force Field

Authors: Shihao Shao, Haoran Geng, Qinghua Cui

Abstract: The Clebsch-Gordan Transform (CG transform) effectively encodes many-body interactions. Many studies have proven its accuracy in depicting atomic environments, although this comes with high computational needs. The computational burden of this challenge is hard to reduce due to the need for permutation equivariance, which limits the design space of the CG transform layer. We show that, implementin… ▽ More The Clebsch-Gordan Transform (CG transform) effectively encodes many-body interactions. Many studies have proven its accuracy in depicting atomic environments, although this comes with high computational needs. The computational burden of this challenge is hard to reduce due to the need for permutation equivariance, which limits the design space of the CG transform layer. We show that, implementing the CG transform layer on permutation-invariant inputs allows complete freedom in the design of this layer without affecting symmetry. Developing further on this premise, our idea is to create a CG transform layer that operates on permutation-invariant abstract edges generated from real edge information. We bring in group CG transform with sparse path, abstract edges shuffling, and attention enhancer to form a powerful and efficient CG transform layer. Our method, known as FreeCG, achieves State-of-The-Art (SoTA) results in force prediction for MD17, rMD17, MD22, and property prediction in QM9 datasets with notable enhancement. It introduces a novel paradigm for carrying out efficient and expressive CG transform in future geometric neural network designs. △ Less

Submitted 14 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

arXiv:2406.19776 [pdf, other]

MDF: A Dynamic Fusion Model for Multi-modal Fake News Detection

Authors: Hongzhen Lv, Wenzhong Yang, Fuyuan Wei, Jiaren Peng, Haokun Geng

Abstract: Fake news detection has received increasing attention from researchers in recent years, especially multi-modal fake news detection containing both text and images. However, many previous works have fed two modal features, text and image, into a binary classifier after a simple concatenation or attention mechanism, in which the features contain a large amount of noise inherent in the data,which in… ▽ More Fake news detection has received increasing attention from researchers in recent years, especially multi-modal fake news detection containing both text and images. However, many previous works have fed two modal features, text and image, into a binary classifier after a simple concatenation or attention mechanism, in which the features contain a large amount of noise inherent in the data,which in turn leads to intra- and inter-modal uncertainty. In addition, although many methods based on simply splicing two modalities have achieved more prominent results, these methods ignore the drawback of holding fixed weights across modalities, which would lead to some features with higher impact factors being ignored. To alleviate the above problems, we propose a new dynamic fusion framework dubbed MDF for fake news detection. As far as we know, it is the first attempt of dynamic fusion framework in the field of fake news detection. Specifically, our model consists of two main components:(1) UEM as an uncertainty modeling module employing a multi-head attention mechanism to model intra-modal uncertainty; and (2) DFN is a dynamic fusion module based on D-S evidence theory for dynamically fusing the weights of two modalities, text and image. In order to present better results for the dynamic fusion framework, we use GAT for inter-modal uncertainty and weight modeling before DFN. Extensive experiments on two benchmark datasets demonstrate the effectiveness and superior performance of the MDF framework. We also conducted a systematic ablation study to gain insight into our motivation and architectural design. We make our model publicly available to:https://github.com/CoisiniStar/MDF △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2404.17521 [pdf, other]

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Authors: Puhao Li, Tengyu Liu, Yuyang Li, Muzhi Han, Haoran Geng, Shu Wang, Yixin Zhu, Song-Chun Zhu, Siyuan Huang

Abstract: Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task repre… ▽ More Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: Project website and open-source code: https://xiaoyao-li.github.io/research/ag2manip

arXiv:2402.17766 [pdf, other]

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Authors: Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma

Abstract: This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point clo… ▽ More This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/ △ Less

Submitted 12 July, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted at ECCV 2024

arXiv:2402.15824 [pdf, other]

A New Secure Memory System for Efficient Data Protection and Access Pattern Obfuscation

Authors: Haoran Geng, Yuezhi Che, Aaron Dingler, Michael Niemier, Xiaobo Sharon Hu

Abstract: As the reliance on secure memory environments permeates across applications, memory encryption is used to ensure memory security. However, most effective encryption schemes, such as the widely used AES-CTR, inherently introduce extra overheads, including those associated with counter storage and version number integrity checks. Moreover, encryption only protects data content, and it does not fully… ▽ More As the reliance on secure memory environments permeates across applications, memory encryption is used to ensure memory security. However, most effective encryption schemes, such as the widely used AES-CTR, inherently introduce extra overheads, including those associated with counter storage and version number integrity checks. Moreover, encryption only protects data content, and it does not fully address the memory access pattern leakage. While Oblivious RAM (ORAM) aims to obscure these patterns, its high performance costs hinder practical applications. We introduce Secure Scattered Memory (SSM), an efficient scheme provides a comprehensive security solution that preserves the confidentiality of data content without traditional encryption, protects access patterns, and enables efficient integrity verification. Moving away from traditional encryption-centric methods, SSM offers a fresh approach to protecting data content while eliminating counter-induced overheads. Moreover, SSM is designed to inherently obscure memory access patterns, thereby significantly enhancing the confidentiality of memory data. In addition, SSM incorporates lightweight, thus integrated mechanisms for integrity assurance, protecting against data tampering. We also introduce SSM+, an extension that adapts Path ORAM to offer even greater security guarantees for both data content and memory access patterns, demonstrating its flexibility and efficiency. Experimental results show that SSM incurs only a 10% performance overhead compared to non-protected memory and offers a 15% improvement over AES-CTR mode memory protection. Notably, SSM+ provides an 20% improvement against Path ORAM integrated with Intel SGX under the highest security guarantees. △ Less

Submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.09553 [pdf, other]

doi 10.1109/ACCESS.2024.3390089

Statistical and Machine Learning Models for Predicting Fire and Other Emergency Events

Authors: Dilli Prasad Sharma, Nasim Beigi-Mohammadi, Hongxiang Geng, Dawn Dixon, Rob Madro, Phil Emmenegger, Carlos Tobar, Jeff Li, Alberto Leon-Garcia

Abstract: Emergency events in a city cause considerable economic loss to individuals, their families, and the community. Accurate and timely prediction of events can help the emergency fire and rescue services in preparing for and mitigating the consequences of emergency events. In this paper, we present a systematic development of predictive models for various types of emergency events in the City of Edmon… ▽ More Emergency events in a city cause considerable economic loss to individuals, their families, and the community. Accurate and timely prediction of events can help the emergency fire and rescue services in preparing for and mitigating the consequences of emergency events. In this paper, we present a systematic development of predictive models for various types of emergency events in the City of Edmonton, Canada. We present methods for (i) data collection and dataset development; (ii) descriptive analysis of each event type and its characteristics at different spatiotemporal levels; (iii) feature analysis and selection based on correlation coefficient analysis and feature importance analysis; and (iv) development of prediction models for the likelihood of occurrence of each event type at different temporal and spatial resolutions. We analyze the association of event types with socioeconomic and demographic data at the neighborhood level, identify a set of predictors for each event type, and develop predictive models with negative binomial regression. We conduct evaluations at neighborhood and fire station service area levels. Our results show that the models perform well for most of the event types with acceptable prediction errors for weekly and monthly periods. The evaluation shows that the prediction accuracy is consistent at the level of the fire station, so the predictions can be used in management by fire rescue service departments for planning resource allocation for these time periods. We also examine the impact of the COVID-19 pandemic on the occurrence of events and on the accuracy of event predictor models. Our findings show that COVID-19 had a significant impact on the performance of the event prediction models. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Journal ref: IEEE Access 12(2024) 56880-56909

arXiv:2401.05459 [pdf, other]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Authors: Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu

Abstract: Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of smartphones and IoT, computing and sensing de… ▽ More Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of smartphones and IoT, computing and sensing devices have become ubiquitous, greatly expanding the boundaries of IPAs. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing IPAs still have limited practicality and scalability. Recently, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of IPAs. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an in-depth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges. △ Less

Submitted 8 May, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: https://github.com/MobileLLM/Personal_LLM_Agents_Survey

arXiv:2312.16217 [pdf, other]

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Authors: Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, Hao Dong

Abstract: Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that levera… ▽ More Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm. △ Less

Submitted 24 December, 2023; originally announced December 2023.

arXiv:2312.01307 [pdf, other]

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects

Authors: Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, Leonidas Guibas

Abstract: To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a… ▽ More To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-language model (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals. △ Less

Submitted 30 March, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2311.09376 [pdf, other]

DISTA: Denoising Spiking Transformer with intrinsic plasticity and spatiotemporal attention

Authors: Boxun Xu, Hejia Geng, Yuxuan Yin, Peng Li

Abstract: Among the array of neural network architectures, the Vision Transformer (ViT) stands out as a prominent choice, acclaimed for its exceptional expressiveness and consistent high performance in various vision applications. Recently, the emerging Spiking ViT approach has endeavored to harness spiking neurons, paving the way for a more brain-inspired transformer architecture that thrives in ultra-low… ▽ More Among the array of neural network architectures, the Vision Transformer (ViT) stands out as a prominent choice, acclaimed for its exceptional expressiveness and consistent high performance in various vision applications. Recently, the emerging Spiking ViT approach has endeavored to harness spiking neurons, paving the way for a more brain-inspired transformer architecture that thrives in ultra-low power operations on dedicated neuromorphic hardware. Nevertheless, this approach remains confined to spatial self-attention and doesn't fully unlock the potential of spiking neural networks. We introduce DISTA, a Denoising Spiking Transformer with Intrinsic Plasticity and SpatioTemporal Attention, designed to maximize the spatiotemporal computational prowess of spiking neurons, particularly for vision applications. DISTA explores two types of spatiotemporal attentions: intrinsic neuron-level attention and network-level attention with explicit memory. Additionally, DISTA incorporates an efficient nonlinear denoising mechanism to quell the noise inherent in computed spatiotemporal attention maps, thereby resulting in further performance gains. Our DISTA transformer undergoes joint training involving synaptic plasticity (i.e., weight tuning) and intrinsic plasticity (i.e., membrane time constant tuning) and delivers state-of-the-art performances across several static image and dynamic neuromorphic datasets. With only 6 time steps, DISTA achieves remarkable top-1 accuracy on CIFAR10 (96.26%) and CIFAR100 (79.15%), as well as 79.1% on CIFAR10-DVS using 10 time steps. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.07633 [pdf, other]

Rethinking and Benchmarking Predict-then-Optimize Paradigm for Combinatorial Optimization Problems

Authors: Haoyu Geng, Hang Ruan, Runzhong Wang, Yang Li, Yang Wang, Lei Chen, Junchi Yan

Abstract: Numerous web applications rely on solving combinatorial optimization problems, such as energy cost-aware scheduling, budget allocation on web advertising, and graph matching on social networks. However, many optimization problems involve unknown coefficients, and improper predictions of these factors may lead to inferior decisions which may cause energy wastage, inefficient resource allocation, in… ▽ More Numerous web applications rely on solving combinatorial optimization problems, such as energy cost-aware scheduling, budget allocation on web advertising, and graph matching on social networks. However, many optimization problems involve unknown coefficients, and improper predictions of these factors may lead to inferior decisions which may cause energy wastage, inefficient resource allocation, inappropriate matching in social networks, etc. Such a research topic is referred to as "Predict-Then-Optimize (PTO)" which considers the performance of prediction and decision-making in a unified system. A noteworthy recent development is the end-to-end methods by directly optimizing the ultimate decision quality which claims to yield better results in contrast to the traditional two-stage approach. However, the evaluation benchmarks in this field are fragmented and the effectiveness of various models in different scenarios remains unclear, hindering the comprehensive assessment and fast deployment of these methods. To address these issues, we provide a comprehensive categorization of current approaches and integrate existing experimental scenarios to establish a unified benchmark, elucidating the circumstances under which end-to-end training yields improvements, as well as the contexts in which it performs ineffectively. We also introduce a new dataset for the industrial combinatorial advertising problem for inclusive finance to open-source. We hope the rethinking and benchmarking of PTO could facilitate more convenient evaluation and deployment, and inspire further improvements both in the academy and industry within this field. △ Less

Submitted 19 November, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.02787 [pdf, other]

Make a Donut: Hierarchical EMD-Space Planning for Zero-Shot Deformable Manipulation with Tools

Authors: Yang You, Bokui Shen, Congyue Deng, Haoran Geng, Songlin Wei, He Wang, Leonidas Guibas

Abstract: Deformable object manipulation stands as one of the most captivating yet formidable challenges in robotics. While previous techniques have predominantly relied on learning latent dynamics through demonstrations, typically represented as either particles or images, there exists a pertinent limitation: acquiring suitable demonstrations, especially for long-horizon tasks, can be elusive. Moreover, ba… ▽ More Deformable object manipulation stands as one of the most captivating yet formidable challenges in robotics. While previous techniques have predominantly relied on learning latent dynamics through demonstrations, typically represented as either particles or images, there exists a pertinent limitation: acquiring suitable demonstrations, especially for long-horizon tasks, can be elusive. Moreover, basing learning entirely on demonstrations can hamper the model's ability to generalize beyond the demonstrated tasks. In this work, we introduce a demonstration-free hierarchical planning approach capable of tackling intricate long-horizon tasks without necessitating any training. We employ large language models (LLMs) to articulate a high-level, stage-by-stage plan corresponding to a specified task. For every individual stage, the LLM provides both the tool's name and the Python code to craft intermediate subgoal point clouds. With the tool and subgoal for a particular stage at our disposal, we present a granular closed-loop model predictive control strategy. This leverages Differentiable Physics with Point-to-Point correspondence (DiffPhysics-P2P) loss in the earth mover distance (EMD) space, applied iteratively. Experimental findings affirm that our technique surpasses multiple benchmarks in dough manipulation, spanning both short and long horizons. Remarkably, our model demonstrates robust generalization capabilities to novel and previously unencountered complex tasks without any preliminary demonstrations. We further substantiate our approach with experimental trials on real-world robotic platforms. Our project page: https://qq456cvb.github.io/projects/donut. △ Less

Submitted 24 March, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

Comments: 8 pages

arXiv:2310.01441 [pdf, other]

UPAR: A Kantian-Inspired Prompting Framework for Enhancing Large Language Model Capabilities

Authors: Hejia Geng, Boxun Xu, Peng Li

Abstract: Large Language Models (LLMs) have demonstrated impressive inferential capabilities, with numerous research endeavors devoted to enhancing this capacity through prompting. Despite these efforts, a unified epistemological foundation is still conspicuously absent. Drawing inspiration from Kant's a priori philosophy, we propose the UPAR prompting framework, designed to emulate the structure of human c… ▽ More Large Language Models (LLMs) have demonstrated impressive inferential capabilities, with numerous research endeavors devoted to enhancing this capacity through prompting. Despite these efforts, a unified epistemological foundation is still conspicuously absent. Drawing inspiration from Kant's a priori philosophy, we propose the UPAR prompting framework, designed to emulate the structure of human cognition within LLMs. The UPAR framework is delineated into four phases: "Understand", "Plan", "Act", and "Reflect", enabling the extraction of structured information from complex contexts, prior planning of solutions, execution according to plan, and self-reflection. This structure significantly augments the explainability and accuracy of LLM inference, producing a human-understandable and inspectable inferential trajectory. Furthermore, our work offers an epistemological foundation for existing prompting techniques, allowing for a possible systematic integration of these methods. With GPT-4, our approach elevates the accuracy from COT baseline of 22.92% to 58.33% in a challenging subset of GSM8K, and from 67.91% to 75.40% in the causal judgment task. Without using few-shot examples or external tools, UPAR significantly outperforms existing prompting methods on SCIBENCH, a challenging dataset containing collegiate-level mathematics, chemistry, and physics scientific problems. △ Less

Submitted 6 December, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

arXiv:2308.10373 [pdf, other]

HoSNN: Adversarially-Robust Homeostatic Spiking Neural Networks with Adaptive Firing Thresholds

Authors: Hejia Geng, Peng Li

Abstract: While spiking neural networks (SNNs) offer a promising neurally-inspired model of computation, they are vulnerable to adversarial attacks. We present the first study that draws inspiration from neural homeostasis to design a threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model and utilize TA-LIF neurons to construct the adversarially robust homeostatic SNNs (HoSNNs) for improved robus… ▽ More While spiking neural networks (SNNs) offer a promising neurally-inspired model of computation, they are vulnerable to adversarial attacks. We present the first study that draws inspiration from neural homeostasis to design a threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model and utilize TA-LIF neurons to construct the adversarially robust homeostatic SNNs (HoSNNs) for improved robustness. The TA-LIF model incorporates a self-stabilizing dynamic thresholding mechanism, offering a local feedback control solution to the minimization of each neuron's membrane potential error caused by adversarial disturbance. Theoretical analysis demonstrates favorable dynamic properties of TA-LIF neurons in terms of the bounded-input bounded-output stability and suppressed time growth of membrane potential error, underscoring their superior robustness compared with the standard LIF neurons. When trained with weak FGSM attacks (attack budget = 2/255) and tested with much stronger PGD attacks (attack budget = 8/255), our HoSNNs significantly improve model accuracy on several datasets: from 30.54% to 74.91% on FashionMNIST, from 0.44% to 35.06% on SVHN, from 0.56% to 42.63% on CIFAR10, from 0.04% to 16.66% on CIFAR100, over the conventional LIF-based SNNs. △ Less

Submitted 31 May, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

arXiv:2308.02648 [pdf, other]

Privacy Preserving In-memory Computing Engine

Authors: Haoran Geng, Jianqiao Mo, Dayane Reis, Jonathan Takeshita, Taeho Jung, Brandon Reagen, Michael Niemier, Xiaobo Sharon Hu

Abstract: Privacy has rapidly become a major concern/design consideration. Homomorphic Encryption (HE) and Garbled Circuits (GC) are privacy-preserving techniques that support computations on encrypted data. HE and GC can complement each other, as HE is more efficient for linear operations, while GC is more effective for non-linear operations. Together, they enable complex computing tasks, such as machine l… ▽ More Privacy has rapidly become a major concern/design consideration. Homomorphic Encryption (HE) and Garbled Circuits (GC) are privacy-preserving techniques that support computations on encrypted data. HE and GC can complement each other, as HE is more efficient for linear operations, while GC is more effective for non-linear operations. Together, they enable complex computing tasks, such as machine learning, to be performed exactly on ciphertexts. However, HE and GC introduce two major bottlenecks: an elevated computational overhead and high data transfer costs. This paper presents PPIMCE, an in-memory computing (IMC) fabric designed to mitigate both computational overhead and data transfer issues. Through the use of multiple IMC cores for high parallelism, and by leveraging in-SRAM IMC for data management, PPIMCE offers a compact, energy-efficient solution for accelerating HE and GC. PPIMCE achieves a 107X speedup against a CPU implementation of GC. Additionally, PPIMCE achieves a 1,500X and 800X speedup compared to CPU and GPU implementations of CKKS-based HE multiplications. For privacy-preserving machine learning inference, PPIMCE attains a 1,000X speedup compared to CPU and a 12X speedup against CraterLake, the state-of-art privacy preserving computation accelerator. △ Less

Submitted 10 August, 2023; v1 submitted 4 August, 2023; originally announced August 2023.

arXiv:2307.14557 [pdf, other]

Accelerating Polynomial Modular Multiplication with Crossbar-Based Compute-in-Memory

Authors: Mengyuan Li, Haoran Geng, Michael Niemier, Xiaobo Sharon Hu

Abstract: Lattice-based cryptographic algorithms built on ring learning with error theory are gaining importance due to their potential for providing post-quantum security. However, these algorithms involve complex polynomial operations, such as polynomial modular multiplication (PMM), which is the most time-consuming part of these algorithms. Accelerating PMM is crucial to make lattice-based cryptographic… ▽ More Lattice-based cryptographic algorithms built on ring learning with error theory are gaining importance due to their potential for providing post-quantum security. However, these algorithms involve complex polynomial operations, such as polynomial modular multiplication (PMM), which is the most time-consuming part of these algorithms. Accelerating PMM is crucial to make lattice-based cryptographic algorithms widely adopted by more applications. This work introduces a novel high-throughput and compact PMM accelerator, X-Poly, based on the crossbar (XB)-type compute-in-memory (CIM). We identify the most appropriate PMM algorithm for XB-CIM. We then propose a novel bit-mapping technique to reduce the area and energy of the XB-CIM fabric, and conduct processing engine (PE)-level optimization to increase memory utilization and support different problem sizes with a fixed number of XB arrays. X-Poly design achieves 3.1X10^6 PMM operations/s throughput and offers 200X latency improvement compared to the CPU-based implementation. It also achieves 3.9X throughput per area improvement compared with the state-of-the-art CIM accelerators. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: Accepted by 42nd International Conference on Computer-Aided Design (ICCAD)

arXiv:2305.17417 [pdf, other]

Modeling Dynamic Heterogeneous Graph and Node Importance for Future Citation Prediction

Authors: Hao Geng, Deqing Wang, Fuzhen Zhuang, Xuehua Ming, Chenguang Du, Ting Jiang, Haolong Guo, Rui Liu

Abstract: Accurate citation count prediction of newly published papers could help editors and readers rapidly figure out the influential papers in the future. Though many approaches are proposed to predict a paper's future citation, most ignore the dynamic heterogeneous graph structure or node importance in academic networks. To cope with this problem, we propose a Dynamic heterogeneous Graph and Node Impor… ▽ More Accurate citation count prediction of newly published papers could help editors and readers rapidly figure out the influential papers in the future. Though many approaches are proposed to predict a paper's future citation, most ignore the dynamic heterogeneous graph structure or node importance in academic networks. To cope with this problem, we propose a Dynamic heterogeneous Graph and Node Importance network (DGNI) learning framework, which fully leverages the dynamic heterogeneous graph and node importance information to predict future citation trends of newly published papers. First, a dynamic heterogeneous network embedding module is provided to capture the dynamic evolutionary trends of the whole academic network. Then, a node importance embedding module is proposed to capture the global consistency relationship to figure out each paper's node importance. Finally, the dynamic evolutionary trend embeddings and node importance embeddings calculated above are combined to jointly predict the future citation counts of each paper, by a log-normal distribution model according to multi-faced paper node representations. Extensive experiments on two large-scale datasets demonstrate that our model significantly improves all indicators compared to the SOTA models. △ Less

Submitted 27 May, 2023; originally announced May 2023.

Comments: Accepted by CIKM'2022

arXiv:2304.04321 [pdf, other]

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Authors: Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

Abstract: Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's abil… ▽ More Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area. Project website: https://arnold-benchmark.github.io. △ Less

Submitted 11 September, 2023; v1 submitted 9 April, 2023; originally announced April 2023.

Comments: The first two authors contributed equally; 20 pages; 17 figures; project availalbe: https://arnold-benchmark.github.io/ ICCV 2023

arXiv:2304.00464 [pdf, other]

UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-aware Curriculum and Iterative Generalist-Specialist Learning

Authors: Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, He Wang

Abstract: We propose a novel, object-agnostic method for learning a universal policy for dexterous object grasping from realistic point cloud observations and proprioceptive information under a table-top setting, namely UniDexGrasp++. To address the challenge of learning the vision-based policy across thousands of object instances, we propose Geometry-aware Curriculum Learning (GeoCurriculum) and Geometry-a… ▽ More We propose a novel, object-agnostic method for learning a universal policy for dexterous object grasping from realistic point cloud observations and proprioceptive information under a table-top setting, namely UniDexGrasp++. To address the challenge of learning the vision-based policy across thousands of object instances, we propose Geometry-aware Curriculum Learning (GeoCurriculum) and Geometry-aware iterative Generalist-Specialist Learning (GiGSL) which leverage the geometry feature of the task and significantly improve the generalizability. With our proposed techniques, our final policy shows universal dexterous grasping on thousands of object instances with 85.4% and 78.2% success rate on the train set and test set which outperforms the state-of-the-art baseline UniDexGrasp by 11.7% and 11.3%, respectively. △ Less

Submitted 3 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

arXiv:2303.17464 [pdf, other]

HMES: A Scalable Human Mobility and Epidemic Simulation System with Fast Intervention Modeling

Authors: Haoyu Geng, Guanjie Zheng, Zhengqing Han, Hua Wei, Zhenhui Li

Abstract: Recently, the world has witnessed the most severe pandemic (COVID-19) in this century. Studies on epidemic prediction and simulation have received increasing attention. However, the current methods suffer from three issues. First, most of the current studies focus on epidemic prediction, which can not provide adequate support for intervention policy making. Second, most of the current intervention… ▽ More Recently, the world has witnessed the most severe pandemic (COVID-19) in this century. Studies on epidemic prediction and simulation have received increasing attention. However, the current methods suffer from three issues. First, most of the current studies focus on epidemic prediction, which can not provide adequate support for intervention policy making. Second, most of the current interventions are based on population groups rather than fine-grained individuals, which can not make the measures towards the infected people and may cause waste of medical resources. Third, current simulations are not efficient and flexible enough for large-scale complex systems. In this paper, we propose a new epidemic simulation framework called HMES to address the above three challenges. The proposed framework covers a full pipeline of epidemic simulation and enables comprehensive fine-grained control in a large scale. In addition, we conduct experiments on real COVID-19 data. HMES demonstrates more accurate modeling of disease transmission up to 300 million people and up to 3 times acceleration compared to the state-of-the-art methods. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: UIC-2022, code available at https://github.com/hygeng/humanflow

arXiv:2303.16958 [pdf, other]

PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations

Authors: Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao Dong, He Wang

Abstract: Learning a generalizable object manipulation policy is vital for an embodied agent to work in complex real-world scenes. Parts, as the shared components in different object categories, have the potential to increase the generalization ability of the manipulation policy and achieve cross-category object manipulation. In this work, we build the first large-scale, part-based cross-category object man… ▽ More Learning a generalizable object manipulation policy is vital for an embodied agent to work in complex real-world scenes. Parts, as the shared components in different object categories, have the potential to increase the generalization ability of the manipulation policy and achieve cross-category object manipulation. In this work, we build the first large-scale, part-based cross-category object manipulation benchmark, PartManip, which is composed of 11 object categories, 494 objects, and 1432 tasks in 6 task classes. Compared to previous work, our benchmark is also more diverse and realistic, i.e., having more objects and using sparse-view point cloud as input without oracle information like part segmentation. To tackle the difficulties of vision-based policy learning, we first train a state-based expert with our proposed part-based canonicalization and part-aware rewards, and then distill the knowledge to a vision-based student. We also find an expressive backbone is essential to overcome the large diversity of different objects. For cross-category generalization, we introduce domain adversarial learning for domain-invariant feature extraction. Extensive experiments in simulation show that our learned policy can outperform other methods by a large margin, especially on unseen object categories. We also demonstrate our method can successfully manipulate novel objects in the real world. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR2023

arXiv:2303.12341 [pdf, other]

EasyDGL: Encode, Train and Interpret for Continuous-time Dynamic Graph Learning

Authors: Chao Chen, Haoyu Geng, Nianzu Yang, Xiaokang Yang, Junchi Yan

Abstract: Dynamic graphs arise in various real-world applications, and it is often welcomed to model the dynamics directly in continuous time domain for its flexibility. This paper aims to design an easy-to-use pipeline (termed as EasyDGL which is also due to its implementation by DGL toolkit) composed of three key modules with both strong fitting ability and interpretability. Specifically the proposed pipe… ▽ More Dynamic graphs arise in various real-world applications, and it is often welcomed to model the dynamics directly in continuous time domain for its flexibility. This paper aims to design an easy-to-use pipeline (termed as EasyDGL which is also due to its implementation by DGL toolkit) composed of three key modules with both strong fitting ability and interpretability. Specifically the proposed pipeline which involves encoding, training and interpreting: i) a temporal point process (TPP) modulated attention architecture to endow the continuous-time resolution with the coupled spatiotemporal dynamics of the observed graph with edge-addition events; ii) a principled loss composed of task-agnostic TPP posterior maximization based on observed events on the graph, and a task-aware loss with a masking strategy over dynamic graph, where the covered tasks include dynamic link prediction, dynamic node classification and node traffic forecasting; iii) interpretation of the model outputs (e.g., representations and predictions) with scalable perturbation-based quantitative analysis in the graph Fourier domain, which could more comprehensively reflect the behavior of the learned model. Extensive experimental results on public benchmarks show the superior performance of our EasyDGL for time-conditioned predictive tasks, and in particular demonstrate that EasyDGL can effectively quantify the predictive power of frequency content that a model learn from the evolving graph data. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 9 figures, 7 tables

arXiv:2303.00938 [pdf, other]

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Authors: Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, Tengyu Liu, Li Yi, He Wang

Abstract: In this work, we tackle the problem of learning universal robotic dexterous grasping from a point cloud observation under a table-top setting. The goal is to grasp and lift up objects in high-quality and diverse ways and generalize across hundreds of categories and even the unseen. Inspired by successful pipelines used in parallel gripper grasping, we split the task into two stages: 1) grasp propo… ▽ More In this work, we tackle the problem of learning universal robotic dexterous grasping from a point cloud observation under a table-top setting. The goal is to grasp and lift up objects in high-quality and diverse ways and generalize across hundreds of categories and even the unseen. Inspired by successful pipelines used in parallel gripper grasping, we split the task into two stages: 1) grasp proposal (pose) generation and 2) goal-conditioned grasp execution. For the first stage, we propose a novel probabilistic model of grasp pose conditioned on the point cloud observation that factorizes rotation from translation and articulation. Trained on our synthesized large-scale dexterous grasp dataset, this model enables us to sample diverse and high-quality dexterous grasp poses for the object point cloud.For the second stage, we propose to replace the motion planning used in parallel gripper grasping with a goal-conditioned grasp policy, due to the complexity involved in dexterous grasping execution. Note that it is very challenging to learn this highly generalizable grasp policy that only takes realistic inputs without oracle states. We thus propose several important innovations, including state canonicalization, object curriculum, and teacher-student distillation. Integrating the two stages, our final pipeline becomes the first to achieve universal generalization for dexterous grasping, demonstrating an average success rate of more than 60\% on thousands of object instances, which significantly outperforms all baselines, meanwhile showing only a minimal generalization gap. △ Less

Submitted 25 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023

arXiv:2302.03933 [pdf, other]

Graph Signal Sampling for Inductive One-Bit Matrix Completion: a Closed-form Solution

Authors: Chao Chen, Haoyu Geng, Gang Zeng, Zhaobing Han, Hua Chai, Xiaokang Yang, Junchi Yan

Abstract: Inductive one-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (sign… ▽ More Inductive one-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (signal) on the vertices of an item-item graph, then learn structural graph properties to recover the function from its values on certain vertices -- the problem of graph signal sampling. We propose a class of regularization functionals that takes into account discrete random label noise in the graph vertex domain, then develop the GS-IMC approach which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. Theoretical result shows that accurate reconstructions can be achieved under mild conditions. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain and builds upon a prediction-correction update algorithm to obtain the unbiased and minimum-variance reconstruction. Both GS-IMC and BGS-IMC have closed-form solutions and thus are highly scalable in large data. Experiments show that our methods achieve state-of-the-art performance on public benchmarks. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Comments: Published in ICLR 2023

arXiv:2301.12399 [pdf, other]

Learning Analytics from Spoken Discussion Dialogs in Flipped Classroom

Authors: Hang Su, Borislav Dzodzo, Changlun Li, Danyang Zhao, Hao Geng, Yunxiang Li, Sidharth Jaggi, Helen Meng

Abstract: The flipped classroom is a new pedagogical strategy that has been gaining increasing importance recently. Spoken discussion dialog commonly occurs in flipped classroom, which embeds rich information indicating processes and progression of students' learning. This study focuses on learning analytics from spoken discussion dialog in the flipped classroom, which aims to collect and analyze the discus… ▽ More The flipped classroom is a new pedagogical strategy that has been gaining increasing importance recently. Spoken discussion dialog commonly occurs in flipped classroom, which embeds rich information indicating processes and progression of students' learning. This study focuses on learning analytics from spoken discussion dialog in the flipped classroom, which aims to collect and analyze the discussion dialogs in flipped classroom in order to get to know group learning processes and outcomes. We have recently transformed a course using the flipped classroom strategy, where students watched video-recorded lectures at home prior to group-based problem-solving discussions in class. The in-class group discussions were recorded throughout the semester and then transcribed manually. After features are extracted from the dialogs by multiple tools and customized processing techniques, we performed statistical analyses to explore the indicators that are related to the group learning outcomes from face-to-face discussion dialogs in the flipped classroom. Then, machine learning algorithms are applied to the indicators in order to predict the group learning outcome as High, Mid or Low. The best prediction accuracy reaches 78.9%, which demonstrates the feasibility of achieving automatic learning outcome prediction from group discussion dialog in flipped classroom. △ Less

Submitted 29 January, 2023; originally announced January 2023.

arXiv:2211.05272 [pdf, other]

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

Authors: Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, He Wang

Abstract: For years, researchers have been devoted to generalizable object perception and manipulation, where cross-category generalizability is highly desired yet underexplored. In this work, we propose to learn such cross-category skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale p… ▽ More For years, researchers have been devoted to generalizable object perception and manipulation, where cross-category generalizability is highly desired yet underexplored. In this work, we propose to learn such cross-category skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale part-centric interactive dataset, GAPartNet, where we provide rich, part-level annotations (semantics, poses) for 8,489 part instances on 1,166 objects. Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation. Given the significant domain gaps between seen and unseen object categories, we propose a robust 3D segmentation method from the perspective of domain generalization by integrating adversarial learning techniques. Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories. Furthermore, with part segmentation and pose estimation results, we leverage the GAPart pose definition to design part-based manipulation heuristics that can generalize well to unseen object categories in both the simulator and the real world. Our dataset, code, and demos are available on our project page. △ Less

Submitted 26 March, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: To appear in CVPR 2023 (Highlight)

arXiv:2209.12941 [pdf, other]

End-to-End Affordance Learning for Robotic Manipulation

Authors: Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Yaodong Yang, Hao Dong

Abstract: Learning to manipulate 3D objects in an interactive environment has been a challenging problem in Reinforcement Learning (RL). In particular, it is hard to train a policy that can generalize over objects with different semantic categories, diverse shape geometry and versatile functionality. Recently, the technique of visual affordance has shown great prospects in providing object-centric informati… ▽ More Learning to manipulate 3D objects in an interactive environment has been a challenging problem in Reinforcement Learning (RL). In particular, it is hard to train a policy that can generalize over objects with different semantic categories, diverse shape geometry and versatile functionality. Recently, the technique of visual affordance has shown great prospects in providing object-centric information priors with effective actionable semantics. As such, an effective policy can be trained to open a door by knowing how to exert force on the handle. However, to learn the affordance, it often requires human-defined action primitives, which limits the range of applicable tasks. In this study, we take advantage of visual affordance by using the contact information generated during the RL training process to predict contact maps of interest. Such contact prediction process then leads to an end-to-end affordance learning framework that can generalize over different types of manipulation tasks. Surprisingly, the effectiveness of such framework holds even under the multi-stage and the multi-agent scenarios. We tested our method on eight types of manipulation tasks. Results showed that our methods outperform baseline algorithms, including visual-based affordance methods and RL methods, by a large margin on the success rate. The demonstration can be found at https://sites.google.com/view/rlafford/. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 8 pages, 3 figures

arXiv:2204.06517 [pdf, other]

Learning Self-Modulating Attention in Continuous Time Space with Applications to Sequential Recommendation

Authors: Chao Chen, Haoyu Geng, Nianzu Yang, Junchi Yan, Daiyue Xue, Jianping Yu, Xiaokang Yang

Abstract: User interests are usually dynamic in the real world, which poses both theoretical and practical challenges for learning accurate preferences from rich behavior data. Among existing user behavior modeling solutions, attention networks are widely adopted for its effectiveness and relative simplicity. Despite being extensively studied, existing attentions still suffer from two limitations: i) conven… ▽ More User interests are usually dynamic in the real world, which poses both theoretical and practical challenges for learning accurate preferences from rich behavior data. Among existing user behavior modeling solutions, attention networks are widely adopted for its effectiveness and relative simplicity. Despite being extensively studied, existing attentions still suffer from two limitations: i) conventional attentions mainly take into account the spatial correlation between user behaviors, regardless the distance between those behaviors in the continuous time space; and ii) these attentions mostly provide a dense and undistinguished distribution over all past behaviors then attentively encode them into the output latent representations. This is however not suitable in practical scenarios where a user's future actions are relevant to a small subset of her/his historical behaviors. In this paper, we propose a novel attention network, named self-modulating attention, that models the complex and non-linearly evolving dynamic user preferences. We empirically demonstrate the effectiveness of our method on top-N sequential recommendation tasks, and the results on three large-scale real-world datasets show that our model can achieve state-of-the-art performance. △ Less

Submitted 29 March, 2022; originally announced April 2022.

Comments: Published in ICML 2021

arXiv:2112.02231 [pdf, other]

IMCRYPTO: An In-Memory Computing Fabric for AES Encryption and Decryption

Authors: Dayane Reis, Haoran Geng, Michael Niemier, Xiaobo Sharon Hu

Abstract: This paper proposes IMCRYPTO, an in-memory computing (IMC) fabric for accelerating AES encryption and decryption. IMCRYPTO employs a unified structure to implement encryption and decryption in a single hardware architecture, with combined (Inv)SubBytes and (Inv)MixColumns steps. Because of this step-combination, as well as the high parallelism achieved by multiple units of random-access memory (RA… ▽ More This paper proposes IMCRYPTO, an in-memory computing (IMC) fabric for accelerating AES encryption and decryption. IMCRYPTO employs a unified structure to implement encryption and decryption in a single hardware architecture, with combined (Inv)SubBytes and (Inv)MixColumns steps. Because of this step-combination, as well as the high parallelism achieved by multiple units of random-access memory (RAM) and random-access/content-addressable memory (RA/CAM) arrays, IMCRYPTO achieves high throughput encryption and decryption without sacrificing area and power consumption. Additionally, due to the integration of a RISC-V core, IMCRYPTO offers programmability and flexibility. IMCRYPTO improves the throughput per area by a minimum (maximum) of 3.3x (223.1x) when compared to previous ASICs/IMC architectures for AES-128 encryption. Projections show added benefit from emerging technologies of up to 5.3x to the area-delay-power product of IMCRYPTO. △ Less

Submitted 3 December, 2021; originally announced December 2021.

arXiv:2108.01793 [pdf, other]

doi 10.3389/fnins.2024.1412559

Composing Recurrent Spiking Neural Networks using Locally-Recurrent Motifs and Risk-Mitigating Architectural Optimization

Authors: Wenrui Zhang, Hejia Geng, Peng Li

Abstract: In neural circuits, recurrent connectivity plays a crucial role in network function and stability. However, existing recurrent spiking neural networks (RSNNs) are often constructed by random connections without optimization. While RSNNs can produce rich dynamics that are critical for memory formation and learning, systemic architectural optimization of RSNNs is still an open challenge. We aim to e… ▽ More In neural circuits, recurrent connectivity plays a crucial role in network function and stability. However, existing recurrent spiking neural networks (RSNNs) are often constructed by random connections without optimization. While RSNNs can produce rich dynamics that are critical for memory formation and learning, systemic architectural optimization of RSNNs is still an open challenge. We aim to enable systematic design of large RSNNs via a new scalable RSNN architecture and automated architectural optimization. We compose RSNNs based on a layer architecture called Sparsely-Connected Recurrent Motif Layer (SC-ML) that consists of multiple small recurrent motifs wired together by sparse lateral connections. The small size of the motifs and sparse inter-motif connectivity leads to an RSNN architecture scalable to large network sizes. We further propose a method called Hybrid Risk-Mitigating Architectural Search (HRMAS) to systematically optimize the topology of the proposed recurrent motifs and SC-ML layer architecture. HRMAS is an alternating two-step optimization process by which we mitigate the risk of network instability and performance degradation caused by architectural change by introducing a novel biologically-inspired "self-repairing" mechanism through intrinsic plasticity. The intrinsic plasticity is introduced to the second step of each HRMAS iteration and acts as unsupervised fast self-adaptation to structural and synaptic weight modifications introduced by the first step during the RSNN architectural "evolution". To the best of the authors' knowledge, this is the first work that performs systematic architectural optimization of RSNNs. Using one speech and three neuromorphic datasets, we demonstrate the significant performance improvement brought by the proposed automated architecture optimization over existing manually-designed RSNNs. △ Less

Submitted 15 September, 2023; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: 20 pages, 7 figures

Journal ref: Frontiers in Neuroscience, 2024, Vol. 18, Art. e1412559

arXiv:1912.07254 [pdf, other]

VLSI Mask Optimization: From Shallow To Deep Learning

Authors: Haoyu Yang, Wei Zhong, Yuzhe Ma, Hao Geng, Ran Chen, Wanli Chen, Bei Yu

Abstract: VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate… ▽ More VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate the VLSI design cycle. In this paper, we focus on a heterogeneous OPC framework that assists mask layout optimization. Preliminary results show the efficiency and effectiveness of proposed frameworks that have the potential to be alternatives to existing EDA solutions. △ Less

Submitted 16 December, 2019; originally announced December 2019.

Comments: 6 pages; accepted by 25th Asia and South Pacific Design Automation Conference (ASP-DAC 2020)

Showing 1–32 of 32 results for author: Geng, H