-
Token-Mol 1.0: Tokenized drug design with large language model
Authors:
Jike Wang,
Rui Qin,
Mingyang Wang,
Meijing Fang,
Yangyang Zhang,
Yuchen Zhu,
Qun Su,
Qiaolin Gou,
Chao Shen,
Odin Zhang,
Zhenxing Wu,
Dejun Jiang,
Xujun Zhang,
Huifeng Zhao,
Xiaozhe Wan,
Zhourui Wu,
Liwei Liu,
Yu Kang,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug…
▽ More
Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow
Authors:
Zhichao Chen,
Haoxuan Li,
Fangyikang Wang,
Odin Zhang,
Hu Xu,
Xiaoyu Jiang,
Zhihuan Song,
Eric H. Wang
Abstract:
Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within…
▽ More
Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Authors:
Haitao Lin,
Guojiang Zhao,
Odin Zhang,
Yufei Huang,
Lirong Wu,
Zicheng Liu,
Siyuan Li,
Cheng Tan,
Zhifeng Gao,
Stan Z. Li
Abstract:
Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair compariso…
▽ More
Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Length-scale study in deep learning prediction for non-small cell lung cancer brain metastasis
Authors:
Haowen Zhou,
Steven,
Lin,
Mark Watson,
Cory T. Bernadt,
Oumeng Zhang,
Ramaswamy Govindan,
Richard J. Cote,
Changhuei Yang
Abstract:
Deep learning assisted digital pathology has the potential to impact clinical practice in significant ways. In recent studies, deep neural network (DNN) enabled analysis outperforms human pathologists. Increasing sizes and complexity of the DNN architecture generally improves performance at the cost of DNN's explainability. For pathology, this lack of DNN explainability is particularly problematic…
▽ More
Deep learning assisted digital pathology has the potential to impact clinical practice in significant ways. In recent studies, deep neural network (DNN) enabled analysis outperforms human pathologists. Increasing sizes and complexity of the DNN architecture generally improves performance at the cost of DNN's explainability. For pathology, this lack of DNN explainability is particularly problematic as it hinders the broader clinical interpretation of the pathology features that may provide physiological disease insights. To better assess the features that DNN uses in developing predictive algorithms to interpret digital microscopic images, we sought to understand the role of resolution and tissue scale and here describe a novel method for studying the predictive feature length-scale that underpins a DNN's predictive power. We applied the method to study a DNN's predictive capability in the case example of brain metastasis prediction from early-stage non-small-cell lung cancer biopsy slides. The study highlights the DNN attention in the brain metastasis prediction targeting both cellular scale (resolution) and tissue scale features on H&E-stained histological whole slide images. At the cellular scale, we see that DNN's predictive power is progressively increased at higher resolution (i.e., lower resolvable feature length) and is largely lost when the resolvable feature length is longer than 5 microns. In addition, DNN uses more macro-scale features (maximal feature length) associated with tissue organization/architecture and is optimized when assessing visual fields larger than 41 microns. This study for the first time demonstrates the length-scale requirements necessary for optimal DNN learning on digital whole slide images.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Single-shot volumetric fluorescence imaging with neural fields
Authors:
Oumeng Zhang,
Haowen Zhou,
Brandon Y. Feng,
Elin M. Larsson,
Reinaldo E. Alcalde,
Siyuan Yin,
Catherine Deng,
Changhuei Yang
Abstract:
Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, el…
▽ More
Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, eliminating depth ambiguity in the reconstruction, and maintaining high resolution across a large field of view. In this paper, we introduce the QuadraPol point spread function (PSF) combined with neural fields, a novel approach for SVF imaging. This method utilizes a custom polarizer at the back focal plane and a polarization camera to detect fluorescence, effectively encoding the 3D scene within a compact PSF without depth ambiguity. Additionally, we propose a reconstruction algorithm based on the neural fields technique that provides improved reconstruction quality and addresses the inaccuracies of phase retrieval methods used to correct imaging system aberrations. This algorithm combines the accuracy of experimental PSFs with the long depth of field of computationally generated retrieved PSFs. QuadraPol PSF, combined with neural fields, significantly reduces the acquisition time of a conventional fluorescence microscope by approximately 20 times and captures a 100 mm$^3$ cubic volume in one shot. We validate the effectiveness of both our hardware and algorithm through all-in-focus imaging of bacterial colonies on sand surfaces and visualization of plant root morphology. Our approach offers a powerful tool for advancing biological research and ecological studies.
△ Less
Submitted 4 June, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
PPFlow: Target-aware Peptide Design with Torsional Flow Matching
Authors:
Haitao Lin,
Odin Zhang,
Huifeng Zhao,
Dejun Jiang,
Lirong Wu,
Zicheng Liu,
Yufei Huang,
Stan Z. Li
Abstract:
Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure…
▽ More
Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure design. Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design and to allow the training of deep learning methods. Extensive experiments show that PPFlow reaches state-of-the-art performance in tasks of peptide drug generation and optimization in comparison with baseline models, and can be generalized to other tasks including docking and side-chain packing.
△ Less
Submitted 16 June, 2024; v1 submitted 5 March, 2024;
originally announced May 2024.
-
A Curated Rotamer Library for Common Post-Translational Modifications of Proteins
Authors:
Oufan Zhang,
Shubhankar A. Naik,
Zi Hao Liu,
Julie Forman-Kay,
Teresa Head-Gordon
Abstract:
Sidechain rotamer libraries of the common amino acids of a protein are useful for folded protein structure determination and for generating ensembles of intrinsically disordered proteins (IDPs). However much of protein function is modulated beyond the translated sequence through thFiguree introduction of post-translational modifications (PTMs). In this work we have provided a curated set of side c…
▽ More
Sidechain rotamer libraries of the common amino acids of a protein are useful for folded protein structure determination and for generating ensembles of intrinsically disordered proteins (IDPs). However much of protein function is modulated beyond the translated sequence through thFiguree introduction of post-translational modifications (PTMs). In this work we have provided a curated set of side chain rotamers for the most common PTMs derived from the RCSB PDB database, including phosphorylated, methylated, and acetylated sidechains. Our rotamer libraries improve upon existing methods such as SIDEpro and Rosetta in predicting the experimental structures for PTMs in folded proteins. In addition, we showcase our PTM libraries in full use by generating ensembles with the Monte Carlo Side Chain Entropy (MCSCE) for folded proteins, and combining MCSCE with the Local Disordered Region Sampling algorithms within IDPConformerGenerator for proteins with intrinsically disordered regions.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
The Order of the (123, 132)-Avoiding Stack Sort
Authors:
Owen Zhang
Abstract:
Let $s$ be West's deterministic stack-sorting map. A well-known result (West) is that any length $n$ permutation can be sorted with $n-1$ iterations of $s.$ In 2020, Defant introduced the notion of highly-sorted permutations -- permutations in $s^t(S_n)$ for $t \lessapprox n-1.$ In 2023, Choi and Choi extended this notion to generalized stack-sorting maps $s_σ,$ where we relax the condition of bec…
▽ More
Let $s$ be West's deterministic stack-sorting map. A well-known result (West) is that any length $n$ permutation can be sorted with $n-1$ iterations of $s.$ In 2020, Defant introduced the notion of highly-sorted permutations -- permutations in $s^t(S_n)$ for $t \lessapprox n-1.$ In 2023, Choi and Choi extended this notion to generalized stack-sorting maps $s_σ,$ where we relax the condition of becoming sorted to the analogous condition of becoming periodic with respect to $s_σ.$ In this work, we introduce the notion of minimally-sorted permutations $\mathfrak{M}_n$ as an antithesis to Defant's highly-sorted permutations, and show that $\text{ord}_{s_{123, 132}}(S_n) = 2 \lfloor \frac{n-1}{2} \rfloor,$ strengthening Berlow's 2021 classification of periodic points.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Deep Lead Optimization: Leveraging Generative AI for Structural Modification
Authors:
Odin Zhang,
Haitao Lin,
Hui Zhang,
Huifeng Zhao,
Yufei Huang,
Yuansheng Huang,
Dejun Jiang,
Chang-yu Hsieh,
Peichen Pan,
Tingjun Hou
Abstract:
The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead opt…
▽ More
The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead optimization, which refines existing molecules into drug candidates. Among them, lead optimization plays an important role in real-world drug design. For example, it can enable the development of me-better drugs that are chemically distinct yet more effective than the original drugs. It can also facilitate fragment-based drug design, transforming virtual-screened small ligands with low affinity into first-in-class medicines. Despite its importance, automated lead optimization remains underexplored compared to the well-established de novo generative models, due to its reliance on complex biological and chemical knowledge. To bridge this gap, we conduct a systematic review of traditional computational methods for lead optimization, organizing these strategies into four principal sub-tasks with defined inputs and outputs. This review delves into the basic concepts, goals, conventional CADD techniques, and recent advancements in AIDD. Additionally, we introduce a unified perspective based on constrained subgraph generation to harmonize the methodologies of de novo design and lead optimization. Through this lens, de novo design can incorporate strategies from lead optimization to address the challenge of generating hard-to-synthesize molecules; inversely, lead optimization can benefit from the innovations in de novo design by approaching it as a task of generating molecules conditioned on certain substructures.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Combining transition path sampling with data-driven collective variables through a reactivity-biased shooting algorithm
Authors:
Jintu Zhang,
Odin Zhang,
Luigi Bonati,
TingJun Hou
Abstract:
Rare event sampling is a central problem in modern computational chemistry research. Among the existing methods, transition path sampling (TPS) can generate unbiased representations of reaction processes. However, its efficiency depends on the ability to generate reactive trial paths, which in turn depends on the quality of the shooting algorithm used. We propose a new algorithm based on the shoot…
▽ More
Rare event sampling is a central problem in modern computational chemistry research. Among the existing methods, transition path sampling (TPS) can generate unbiased representations of reaction processes. However, its efficiency depends on the ability to generate reactive trial paths, which in turn depends on the quality of the shooting algorithm used. We propose a new algorithm based on the shooting success rate, i.e. reactivity, measured as a function of a reduced set of collective variables (CVs). These variables are extracted with a machine learning approach directly from TPS simulations, using a multi-task objective function. Iteratively, this workflow significantly improves shooting efficiency without any prior knowledge of the process. In addition, the optimized CVs can be used with biased enhanced sampling methodologies to accurately reconstruct the free energy profiles. We tested the method on three different systems: a two-dimensional toy model, conformational transitions of alanine dipeptide, and hydrolysis of acetyl chloride in bulk water. In the latter, we integrated our workflow with an active learning scheme to learn a reactive machine learning-based potential, which allowed us to study the mechanism and free energy profile with an ab initio-like accuracy.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Deep Geometry Handling and Fragment-wise Molecular 3D Graph Generation
Authors:
Odin Zhang,
Yufei Huang,
Shichen Cheng,
Mengyao Yu,
Xujun Zhang,
Haitao Lin,
Yundian Zeng,
Mingyang Wang,
Zhenxing Wu,
Huifeng Zhao,
Zaixi Zhang,
Chenqing Hua,
Yu Kang,
Sunliang Cui,
Peichen Pan,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a co…
▽ More
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level.
△ Less
Submitted 15 March, 2024;
originally announced April 2024.
-
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Authors:
Nathaniel Li,
Alexander Pan,
Anjali Gopal,
Summer Yue,
Daniel Berrios,
Alice Gatti,
Justin D. Li,
Ann-Kathrin Dombrowski,
Shashwat Goel,
Long Phan,
Gabriel Mukobi,
Nathan Helm-Burger,
Rassin Lababidi,
Lennart Justen,
Andrew B. Liu,
Michael Chen,
Isabelle Barrass,
Oliver Zhang,
Xiaoyuan Zhu,
Rishub Tamirisa,
Bhrugu Bharathi,
Adam Khoja,
Zhenqi Zhao,
Ariel Herbert-Voss,
Cort B. Breuer
, et al. (32 additional authors not shown)
Abstract:
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe…
▽ More
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
△ Less
Submitted 15 May, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Mapping the Landscape of Independent Food Delivery Platforms in the United States
Authors:
Yuhan Liu,
Amna Liaqat,
Owen Xingjian Zhang,
Mariana Consuelo Fernández Espinosa,
Ankhitha Manjunatha,
Alexander Yang,
Orestis Papakyriakopoulos,
Andrés Monroy-Hernández
Abstract:
Beyond the well-known giants like Uber Eats and DoorDash, there are hundreds of independent food delivery platforms in the United States. However, little is known about the sociotechnical landscape of these ``indie'' platforms. In this paper, we analyzed these platforms to understand why they were created, how they operate, and what technologies they use. We collected data on 495 indie platforms a…
▽ More
Beyond the well-known giants like Uber Eats and DoorDash, there are hundreds of independent food delivery platforms in the United States. However, little is known about the sociotechnical landscape of these ``indie'' platforms. In this paper, we analyzed these platforms to understand why they were created, how they operate, and what technologies they use. We collected data on 495 indie platforms and detailed survey responses from 29 platforms. We found that personalized, timely service is a central value of indie platforms, as is a sense of responsibility to the local community they serve. Indie platforms are motivated to provide fair rates for restaurants and couriers. These alternative business practices differentiate them from mainstream platforms. Though indie platforms have plans to expand, a lack of customizability in off-the-shelf software prevents independent platforms from personalizing services for their local communities. We show that these platforms are a widespread and longstanding fixture of the food delivery market. We illustrate the diversity of motivations and values to explain why a one-size-fits-all support is insufficient, and we discuss the siloing of technology that inhibits platforms' growth. Through these insights, we aim to promote future HCI research into the potential development of public-interest technologies for local food delivery.
△ Less
Submitted 25 March, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Re-Dock: Towards Flexible and Realistic Molecular Docking with Diffusion Bridge
Authors:
Yufei Huang,
Odin Zhang,
Lirong Wu,
Cheng Tan,
Haitao Lin,
Zhangyang Gao,
Siyuan Li,
Stan. Z. Li
Abstract:
Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation pre…
▽ More
Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation predictions. To fill these gaps, we introduce an under-explored task, named flexible docking to predict poses of ligand and pocket sidechains simultaneously and introduce Re-Dock, a novel diffusion bridge generative model extended to geometric manifolds. Specifically, we propose energy-to-geometry mapping inspired by the Newton-Euler equation to co-model the binding energy and conformations for reflecting the energy-constrained docking generative process. Comprehensive experiments on designed benchmark datasets including apo-dock and cross-dock demonstrate our model's superior effectiveness and efficiency over current methods.
△ Less
Submitted 21 February, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction
Authors:
Yufei Huang,
Siyuan Li,
Jin Su,
Lirong Wu,
Odin Zhang,
Haitao Lin,
Jingqi Qi,
Zihan Liu,
Zhangyang Gao,
Yuyang Liu,
Jiangbin Zheng,
Stan. ZQ. Li
Abstract:
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternati…
▽ More
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.
△ Less
Submitted 19 October, 2023; v1 submitted 14 October, 2023;
originally announced October 2023.
-
Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
Authors:
Jie Li,
Xingyi Guan,
Oufan Zhang,
Kunyang Sun,
Yingze Wang,
Dorian Bagni,
Teresa Head-Gordon
Abstract:
Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparabl…
▽ More
Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage, defined as proteins and ligands with high sequence and structural similarity. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock Vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind consistently perform better, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems.
△ Less
Submitted 2 May, 2024; v1 submitted 18 August, 2023;
originally announced August 2023.
-
On a conjecture on pattern-avoiding machines
Authors:
Christopher Bao,
Giulio Cerbai,
Yunseo Choi,
Katelyn Gan,
Owen Zhang
Abstract:
Let $s$ be West's stack-sorting map, and let $s_{T}$ be the generalized stack-sorting map, where instead of being required to increase, the stack avoids subpermutations that are order-isomorphic to any permutation in the set $T$. In 2020, Cerbai, Claesson, and Ferrari introduced the $σ$-machine $s \circ s_σ$ as a generalization of West's $2$-stack-sorting-map $s \circ s$. As a further generalizati…
▽ More
Let $s$ be West's stack-sorting map, and let $s_{T}$ be the generalized stack-sorting map, where instead of being required to increase, the stack avoids subpermutations that are order-isomorphic to any permutation in the set $T$. In 2020, Cerbai, Claesson, and Ferrari introduced the $σ$-machine $s \circ s_σ$ as a generalization of West's $2$-stack-sorting-map $s \circ s$. As a further generalization, in 2021, Baril, Cerbai, Khalil, and Vajnovski introduced the $(σ, τ)$-machine $s \circ s_{σ, τ}$ and enumerated $|\Sort_{n}(σ,τ)|$ -- the number of permutations in $S_n$ that are mapped to the identity by the $(σ, τ)$-machine -- for six pairs of length $3$ permutations $(σ, τ)$. In this work, we settle a conjecture by Baril, Cerbai, Khalil, and Vajnovski on the only remaining pair of length $3$ patterns $(σ, τ) = (132, 321)$ for which $|\Sort_{n}(σ, τ)|$ appears in the OEIS. In addition, we enumerate $|\Sort_n(123, 321)|$, which does not appear in the OEIS, but has a simple closed form.
△ Less
Submitted 12 September, 2023; v1 submitted 18 August, 2023;
originally announced August 2023.
-
Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation and Elaboration
Authors:
Haitao Lin,
Yufei Huang,
Odin Zhang,
Lirong Wu,
Siyuan Li,
Zhiyuan Chen,
Stan Z. Li
Abstract:
In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are atom-level-based methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose D3FG, a functiona…
▽ More
In recent years, AI-assisted drug design methods have been proposed to generate molecules given the pockets' structures of target proteins. Most of them are atom-level-based methods, which consider atoms as basic components and generate atom positions and types. In this way, however, it is hard to generate realistic fragments with complicated structures. To solve this, we propose D3FG, a functional-group-based diffusion model for pocket-specific molecule generation and elaboration. D3FG decomposes molecules into two categories of components: functional groups defined as rigid bodies and linkers as mass points. And the two kinds of components can together form complicated fragments that enhance ligand-protein interactions.
To be specific, in the diffusion process, D3FG diffuses the data distribution of the positions, orientations, and types of the components into a prior distribution; In the generative process, the noise is gradually removed from the three variables by denoisers parameterized with designed equivariant graph neural networks. In the experiments, our method can generate molecules with more realistic 3D structures, competitive affinities toward the protein targets, and better drug properties. Besides, D3FG as a solution to a new task of molecule elaboration, could generate molecules with high affinities based on existing ligands and the hotspots of target proteins.
△ Less
Submitted 18 March, 2024; v1 submitted 30 May, 2023;
originally announced June 2023.
-
Highly accurate and efficient deep learning paradigm for full-atom protein loop modeling with KarmaLoop
Authors:
Tianyue Wang,
Xujun Zhang,
Odin Zhang,
Peichen Pan,
Guangyong Chen,
Yu Kang,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
Protein loop modeling is the most challenging yet highly non-trivial task in protein structure prediction. Despite recent progress, existing methods including knowledge-based, ab initio, hybrid and deep learning (DL) methods fall significantly short of either atomic accuracy or computational efficiency. Moreover, an overarching focus on backbone atoms has resulted in a dearth of attention given to…
▽ More
Protein loop modeling is the most challenging yet highly non-trivial task in protein structure prediction. Despite recent progress, existing methods including knowledge-based, ab initio, hybrid and deep learning (DL) methods fall significantly short of either atomic accuracy or computational efficiency. Moreover, an overarching focus on backbone atoms has resulted in a dearth of attention given to side-chain conformation, a critical aspect in a host of downstream applications including ligand docking, molecular dynamics simulation and drug design. To overcome these limitations, we present KarmaLoop, a novel paradigm that distinguishes itself as the first DL method centered on full-atom (encompassing both backbone and side-chain heavy atoms) protein loop modeling. Our results demonstrate that KarmaLoop considerably outperforms conventional and DL-based methods of loop modeling in terms of both accuracy and efficiency, with the average RMSD improved by over two-fold compared to the second-best baseline method across different tasks, and manifests at least two orders of magnitude speedup in general. Consequently, our comprehensive evaluations indicate that KarmaLoop provides a state-of-the-art DL solution for protein loop modeling, with the potential to hasten the advancement of protein engineering, antibody-antigen recognition, and drug design.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis
Authors:
Zhiling Zheng,
Oufan Zhang,
Christian Borgs,
Jennifer T. Chayes,
Omar M. Yaghi
Abstract:
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information -- an issue that previously made the use of Large Language Models (LLMs) in scientific fields challenging. Our approach involves the…
▽ More
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information -- an issue that previously made the use of Large Language Models (LLMs) in scientific fields challenging. Our approach involves the development of a workflow implementing three different processes for text mining, programmed by ChatGPT itself. All of them enable parsing, searching, filtering, classification, summarization, and data unification with different tradeoffs between labor, speed, and accuracy. We deploy this system to extract 26,257 distinct synthesis parameters pertaining to approximately 800 MOFs sourced from peer-reviewed research articles. This process incorporates our ChemPrompt Engineering strategy to instruct ChatGPT in text mining, resulting in impressive precision, recall, and F1 scores of 90-99%. Furthermore, with the dataset built by text mining, we constructed a machine-learning model with over 86% accuracy in predicting MOF experimental crystallization outcomes and preliminarily identifying important factors in MOF crystallization. We also developed a reliable data-grounded MOF chatbot to answer questions on chemical reactions and synthesis procedures. Given that the process of using ChatGPT reliably mines and tabulates diverse MOF synthesis information in a unified format, while using only narrative language requiring no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be very useful across various other chemistry sub-disciplines.
△ Less
Submitted 19 July, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes
Authors:
Shruthi Chari,
Prasant Acharya,
Daniel M. Gruen,
Olivia Zhang,
Elif K. Eyigoz,
Mohamed Ghalwash,
Oshani Seneviratne,
Fernando Suarez Saiz,
Pablo Meyer,
Prithwish Chakraborty,
Deborah L. McGuinness
Abstract:
Medical experts may use Artificial Intelligence (AI) systems with greater trust if these are supported by contextual explanations that let the practitioner connect system inferences to their context of use. However, their importance in improving model usage and understanding has not been extensively studied. Hence, we consider a comorbidity risk prediction scenario and focus on contexts regarding…
▽ More
Medical experts may use Artificial Intelligence (AI) systems with greater trust if these are supported by contextual explanations that let the practitioner connect system inferences to their context of use. However, their importance in improving model usage and understanding has not been extensively studied. Hence, we consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state, AI predictions about their risk of complications, and algorithmic explanations supporting the predictions. We explore how relevant information for such dimensions can be extracted from Medical guidelines to answer typical questions from clinical practitioners. We identify this as a question answering (QA) task and employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability. Finally, we study the benefits of contextual explanations by building an end-to-end AI pipeline including data cohorting, AI risk modeling, post-hoc model explanations, and prototyped a visual dashboard to present the combined insights from different context dimensions and data sources, while predicting and identifying the drivers of risk of Chronic Kidney Disease - a common type-2 diabetes comorbidity. All of these steps were performed in engagement with medical experts, including a final evaluation of the dashboard results by an expert medical panel. We show that LLMs, in particular BERT and SciBERT, can be readily deployed to extract some relevant explanations to support clinical usage. To understand the value-add of the contextual explanations, the expert panel evaluated these regarding actionable insights in the relevant clinical setting. Overall, our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding
Authors:
Steven H. Wang,
Antoine Scardigli,
Leonard Tang,
Wei Chen,
Dimitry Levkin,
Anya Chen,
Spencer Ball,
Thomas Woodside,
Oliver Zhang,
Dan Hendrycks
Abstract:
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 3…
▽ More
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
△ Less
Submitted 24 November, 2023; v1 submitted 2 January, 2023;
originally announced January 2023.
-
DiffBP: Generative Diffusion of 3D Molecules for Target Protein Binding
Authors:
Haitao Lin,
Yufei Huang,
Odin Zhang,
Siqi Ma,
Meng Liu,
Xuanjing Li,
Lirong Wu,
Jishui Wang,
Tingjun Hou,
Stan Z. Li
Abstract:
Generating molecules that bind to specific proteins is an important but challenging task in drug discovery. Previous works usually generate atoms in an auto-regressive way, where element types and 3D coordinates of atoms are generated one by one. However, in real-world molecular systems, the interactions among atoms in an entire molecule are global, leading to the energy function pair-coupled amon…
▽ More
Generating molecules that bind to specific proteins is an important but challenging task in drug discovery. Previous works usually generate atoms in an auto-regressive way, where element types and 3D coordinates of atoms are generated one by one. However, in real-world molecular systems, the interactions among atoms in an entire molecule are global, leading to the energy function pair-coupled among atoms. With such energy-based consideration, the modeling of probability should be based on joint distributions, rather than sequentially conditional ones. Thus, the unnatural sequentially auto-regressive modeling of molecule generation is likely to violate the physical rules, thus resulting in poor properties of the generated molecules. In this work, a generative diffusion model for molecular 3D structures based on target proteins as contextual constraints is established, at a full-atom level in a non-autoregressive way. Given a designated 3D protein binding site, our model learns the generative process that denoises both element types and 3D coordinates of an entire molecule, with an equivariant network. Experimentally, the proposed method shows competitive performance compared with prevailing works in terms of high affinity with proteins and appropriate molecule sizes as well as other drug properties such as drug-likeness of the generated molecules.
△ Less
Submitted 14 July, 2024; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Learning to Evolve Structural Ensembles of Unfolded and Disordered Proteins Using Experimental Solution Data
Authors:
Oufan Zhang,
Mojtaba Haghighatlari,
Jie Li,
Joao Miguel Correia Teixeira,
Ashley Namini,
Zi-Hao Liu,
Julie D Forman-Kay,
Teresa Head-Gordon
Abstract:
We have developed a Generative Recurrent Neural Networks (GRNN) that learns the probability of the next residue torsions $X_{i+1}=\ [φ_{i+1},ψ_{i+1},ω_{i+1}, χ_{i+1}]$ from the previous residue in the sequence $X_i$ to generate new IDP conformations. In addition, we couple the GRNN with a Bayesian model, X-EISD, in a reinforcement learning step that biases the probability distributions of torsions…
▽ More
We have developed a Generative Recurrent Neural Networks (GRNN) that learns the probability of the next residue torsions $X_{i+1}=\ [φ_{i+1},ψ_{i+1},ω_{i+1}, χ_{i+1}]$ from the previous residue in the sequence $X_i$ to generate new IDP conformations. In addition, we couple the GRNN with a Bayesian model, X-EISD, in a reinforcement learning step that biases the probability distributions of torsions to take advantage of experimental data types such as J-couplingss, NOEs and PREs. We show that updating the generative model parameters according to the reward feedback on the basis of the agreement between structures and data improves upon existing approaches that simply reweight static structural pools for disordered proteins. Instead the GRNN "DynamICE" model learns to physically change the conformations of the underlying pool to those that better agree with experiment.
△ Less
Submitted 24 July, 2022; v1 submitted 25 June, 2022;
originally announced June 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Learning Correlations between Internal Coordinates to improve 3D Cartesian Coordinates for Proteins
Authors:
Jie Li,
Oufan Zhang,
Seokyoung Lee,
Ashley Namini,
Zi Hao Liu,
João Miguel Correia Teixeira,
Julie D Forman-Kay,
Teresa Head-Gordon
Abstract:
We consider a generic representation problem of internal coordinates (bond lengths, valence angles, and dihedral angles) and their transformation to 3-dimensional Cartesian coordinates of a biomolecule. We show that the internal-to-Cartesian process relies on correctly predicting chemically subtle correlations among the internal coordinates themselves, and learning these correlations increases the…
▽ More
We consider a generic representation problem of internal coordinates (bond lengths, valence angles, and dihedral angles) and their transformation to 3-dimensional Cartesian coordinates of a biomolecule. We show that the internal-to-Cartesian process relies on correctly predicting chemically subtle correlations among the internal coordinates themselves, and learning these correlations increases the fidelity of the Cartesian representation. This general problem has been solved with machine learning for proteins, but with appropriately formulated data is extensible to any type of chain biomolecule including RNA, DNA, and lipids. We show that the internal-to-Cartesian process relies on correctly predicting chemically subtle correlations among the internal coordinates themselves, and learning these correlations increases the fidelity of the Cartesian representation. We developed a machine learning algorithm, Int2Cart, to predict bond lengths and bond angles from backbone torsion angles and residue types of a protein, and allows reconstruction of protein structures better than using fixed bond lengths and bond angles, or a static library method that relies on backbone torsion angles and residue types on a single residue. The Int2Cart algorithm has been implemented as an individual python package at https://github.com/THGLab/int2cart.
△ Less
Submitted 11 May, 2022; v1 submitted 10 May, 2022;
originally announced May 2022.
-
Temperature as Uncertainty in Contrastive Learning
Authors:
Oliver Zhang,
Mike Wu,
Jasmine Bayrooti,
Noah Goodman
Abstract:
Contrastive learning has demonstrated great capability to learn representations without annotations, even outperforming supervised baselines. However, it still lacks important properties useful for real-world application, one of which is uncertainty. In this paper, we propose a simple way to generate uncertainty scores for many contrastive methods by re-purposing temperature, a mysterious hyperpar…
▽ More
Contrastive learning has demonstrated great capability to learn representations without annotations, even outperforming supervised baselines. However, it still lacks important properties useful for real-world application, one of which is uncertainty. In this paper, we propose a simple way to generate uncertainty scores for many contrastive methods by re-purposing temperature, a mysterious hyperparameter used for scaling. By observing that temperature controls how sensitive the objective is to specific embedding locations, we aim to learn temperature as an input-dependent variable, treating it as a measure of embedding confidence. We call this approach "Temperature as Uncertainty", or TaU. Through experiments, we demonstrate that TaU is useful for out-of-distribution detection, while remaining competitive with benchmarks on linear evaluation. Moreover, we show that TaU can be learned on top of pretrained models, enabling uncertainty scores to be generated post-hoc with popular off-the-shelf models. In summary, TaU is a simple yet versatile method for generating uncertainties for contrastive learning. Open source code can be found at: https://github.com/mhw32/temperature-as-uncertainty-public.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Mining for Potent Inhibitors through Artificial Intelligence and Physics: A Unified Methodology for Ligand Based and Structure Based Drug Design
Authors:
Jie Li,
Oufan Zhang,
Yingze Wang,
Kunyang Sun,
Xingyi Guan,
Dorian Bagni,
Mojtaba Haghighatlari,
Fiona L. Kearns,
Conor Parks,
Rommie E. Amaro,
Teresa Head-Gordon
Abstract:
The viability of a new drug molecule is a time and resource intensive task that makes computer-aided assessments a vital approach to rapid drug discovery. Here we develop a machine learning algorithm, iMiner, that generates novel inhibitor molecules for target proteins by combining deep reinforcement learning with real-time 3D molecular docking using AutoDock Vina, thereby simultaneously creating…
▽ More
The viability of a new drug molecule is a time and resource intensive task that makes computer-aided assessments a vital approach to rapid drug discovery. Here we develop a machine learning algorithm, iMiner, that generates novel inhibitor molecules for target proteins by combining deep reinforcement learning with real-time 3D molecular docking using AutoDock Vina, thereby simultaneously creating chemical novelty while constraining molecules for shape and molecular compatibility with target active sites. Moreover, through the use of various types of reward functions, we can generate new molecules that are chemically similar to a target ligand, which can be grown from known protein bound fragments, as well as to create molecules that enforce interactions with target residues in the protein active site. The iMiner algorithm is embedded in a composite workflow that filters out Pan-assay interference compounds, Lipinski rule violations, and poor synthetic accessibility, with options for cross-validation against other docking scoring functions and automation of a molecular dynamics simulation to measure pose stability. Because our approach only relies on the structure of the target protein, iMiner can be easily adapted for future development of other inhibitors or small molecule therapeutics of any target protein.
△ Less
Submitted 10 January, 2024; v1 submitted 4 October, 2021;
originally announced October 2021.
-
NewtonNet: A Newtonian message passing network for deep learning of interatomic potentials and forces
Authors:
Mojtaba Haghighatlari,
Jie Li,
Xingyi Guan,
Oufan Zhang,
Akshaya Das,
Christopher J. Stein,
Farnaz Heidar-Zadeh,
Meili Liu,
Martin Head-Gordon,
Luke Bertels,
Hongxia Hao,
Itai Leven,
Teresa Head-Gordon
Abstract:
We report a new deep learning message passing network that takes inspiration from Newton's equations of motion to learn interatomic potentials and forces. With the advantage of directional information from trainable latent force vectors, and physics-infused operators that are inspired by the Newtonian physics, the entire model remains rotationally equivariant, and many-body interactions are inferr…
▽ More
We report a new deep learning message passing network that takes inspiration from Newton's equations of motion to learn interatomic potentials and forces. With the advantage of directional information from trainable latent force vectors, and physics-infused operators that are inspired by the Newtonian physics, the entire model remains rotationally equivariant, and many-body interactions are inferred by more interpretable physical features. We test NewtonNet on the prediction of several reactive and non-reactive high quality ab initio data sets including single small molecule dynamics, a large set of chemically diverse molecules, and methane and hydrogen combustion reactions, achieving state-of-the-art test performance on energies and forces with far greater data and computational efficiency than other deep learning models.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Single-molecule orientation localization microscopy II: a performance comparison
Authors:
Oumeng Zhang,
Matthew D. Lew
Abstract:
Various techniques have been developed to measure the 2D and 3D positions and 2D and 3D orientations of fluorescent molecules with improved precision over standard epifluorescence microscopes. Due to the challenging signal-to-background ratio in typical single-molecule experiments, it is essential to choose an imaging system optimized for the specific target sample. In this work, we compare the pe…
▽ More
Various techniques have been developed to measure the 2D and 3D positions and 2D and 3D orientations of fluorescent molecules with improved precision over standard epifluorescence microscopes. Due to the challenging signal-to-background ratio in typical single-molecule experiments, it is essential to choose an imaging system optimized for the specific target sample. In this work, we compare the performance of multiple state-of-the-art and commonly used methods for orientation localization microscopy against the fundamental limits of measurement precision. Our analysis reveals optimal imaging methods for various experiment conditions and sample geometries. Interestingly, simple modifications to the standard fluorescence microscope exhibit superior performance in many imaging scenarios.
△ Less
Submitted 31 January, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Single-molecule orientation localization microscopy I: fundamental limits
Authors:
Oumeng Zhang,
Matthew D. Lew
Abstract:
Precisely measuring the three-dimensional position and orientation of individual fluorophores is challenging due to the substantial photon shot noise in single-molecule experiments. Facing this limited photon budget, numerous techniques have been developed to encode 2D and 3D position and 2D and 3D orientation information into fluorescence images. In this work, we adapt classical and quantum estim…
▽ More
Precisely measuring the three-dimensional position and orientation of individual fluorophores is challenging due to the substantial photon shot noise in single-molecule experiments. Facing this limited photon budget, numerous techniques have been developed to encode 2D and 3D position and 2D and 3D orientation information into fluorescence images. In this work, we adapt classical and quantum estimation theory and propose a mathematical framework to derive the best possible precision for measuring the position and orientation of dipole-like emitters for any fixed imaging system. We find that it is impossible to design an instrument that achieves the maximum sensitivity limit for measuring all possible rotational motions. Further, our vectorial dipole imaging model shows that the best quantum-limited localization precision is ~4-8% worse than that suggested by a scalar monopole model. Overall, we conclude that no single instrument can be optimized for maximum precision across all possible 2D and 3D localization and orientation measurement tasks.
△ Less
Submitted 31 January, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Quantum limits for precisely estimating the orientation and wobble of dipole emitters
Authors:
Oumeng Zhang,
Matthew D. Lew
Abstract:
Precisely measuring molecular orientation is key to understanding how molecules organize and interact in soft matter, but the maximum theoretical limit of measurement precision has yet to be quantified. We use quantum estimation theory and Fisher information (QFI) to derive a fundamental bound on the precision of estimating the orientations of rotationally fixed molecules. While direct imaging of…
▽ More
Precisely measuring molecular orientation is key to understanding how molecules organize and interact in soft matter, but the maximum theoretical limit of measurement precision has yet to be quantified. We use quantum estimation theory and Fisher information (QFI) to derive a fundamental bound on the precision of estimating the orientations of rotationally fixed molecules. While direct imaging of the microscope pupil achieves the quantum bound, it is not compatible with widefield imaging, so we propose an interferometric imaging system that also achieves QFI-limited measurement precision. Extending our analysis to rotationally diffusing molecules, we derive conditions that enable a subset of second-order dipole orientation moments to be measured with quantum-limited precision. Interestingly, we find that no existing techniques can measure all second moments simultaneously with QFI-limited precision; there exists a fundamental trade-off between precisely measuring the mean orientation of a molecule versus its wobble. This theoretical analysis provides crucial insight for optimizing the design of orientation-sensitive imaging systems.
△ Less
Submitted 24 June, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks
Authors:
Jeffrey O Zhang,
Alexander Sax,
Amir Zamir,
Leonidas Guibas,
Jitendra Malik
Abstract:
When training a neural network for a desired task, one may prefer to adapt a pre-trained network rather than starting from randomly initialized weights. Adaptation can be useful in cases when training data is scarce, when a single learner needs to perform multiple tasks, or when one wishes to encode priors in the network. The most commonly employed approaches for network adaptation are fine-tuning…
▽ More
When training a neural network for a desired task, one may prefer to adapt a pre-trained network rather than starting from randomly initialized weights. Adaptation can be useful in cases when training data is scarce, when a single learner needs to perform multiple tasks, or when one wishes to encode priors in the network. The most commonly employed approaches for network adaptation are fine-tuning and using the pre-trained network as a fixed feature extractor, among others.
In this paper, we propose a straightforward alternative: side-tuning. Side-tuning adapts a pre-trained network by training a lightweight "side" network that is fused with the (unchanged) pre-trained network via summation. This simple method works as well as or better than existing solutions and it resolves some of the basic issues with fine-tuning, fixed features, and other common approaches. In particular, side-tuning is less prone to overfitting, is asymptotically consistent, and does not suffer from catastrophic forgetting in incremental learning. We demonstrate the performance of side-tuning under a diverse set of scenarios, including incremental learning (iCIFAR, iTaskonomy), reinforcement learning, imitation learning (visual navigation in Habitat), NLP question-answering (SQuAD v2), and single-task transfer learning (Taskonomy), with consistently promising results.
△ Less
Submitted 30 July, 2020; v1 submitted 31 December, 2019;
originally announced December 2019.
-
Learning to Navigate Using Mid-Level Visual Priors
Authors:
Alexander Sax,
Jeffrey O. Zhang,
Bradley Emi,
Amir Zamir,
Silvio Savarese,
Leonidas Guibas,
Jitendra Malik
Abstract:
How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement le…
▽ More
How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement learning framework (see Fig. 1). This skill set ("mid-level vision") provides the policy with a more processed state of the world compared to raw images.
Our large-scale study demonstrates that using mid-level vision results in policies that learn faster, generalize better, and achieve higher final performance, when compared to learning from scratch and/or using state-of-the-art visual and non-visual representation learning methods. We show that conventional computer vision objectives are particularly effective in this regard and can be conveniently integrated into reinforcement learning frameworks. Finally, we found that no single visual representation was universally useful for all downstream tasks, hence we computationally derive a task-agnostic set of representations optimized to support arbitrary downstream tasks.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
Optimal Transport Based Generative Autoencoders
Authors:
Oliver Zhang,
Ruei-Sung Lin,
Yuchuan Gou
Abstract:
The field of deep generative modeling is dominated by generative adversarial networks (GANs). However, the training of GANs often lacks stability, fails to converge, and suffers from model collapse. It takes an assortment of tricks to solve these problems, which may be difficult to understand for those seeking to apply generative modeling. Instead, we propose two novel generative autoencoders, AE-…
▽ More
The field of deep generative modeling is dominated by generative adversarial networks (GANs). However, the training of GANs often lacks stability, fails to converge, and suffers from model collapse. It takes an assortment of tricks to solve these problems, which may be difficult to understand for those seeking to apply generative modeling. Instead, we propose two novel generative autoencoders, AE-OTtrans and AE-OTgen, which rely on optimal transport instead of adversarial training. AE-OTtrans and AEOTgen, unlike VAE and WAE, preserve the manifold of the data; they do not force the latent distribution to match a normal distribution, resulting in greater quality images. AEOTtrans and AE-OTgen also produce images of higher diversity compared to their predecessor, AE-OT. We show that AE-OTtrans and AE-OTgen surpass GANs in the MNIST and FashionMNIST datasets. Furthermore, We show that AE-OTtrans and AE-OTgen do state of the art on the MNIST, FashionMNIST, and CelebA image sets comapred to other non-adversarial generative models.
△ Less
Submitted 16 October, 2019;
originally announced October 2019.
-
Dense Super-Resolution Imaging of Molecular Orientation via Joint Sparse Basis Deconvolution and Spatial Pooling
Authors:
Hesam Mazidi,
Eshan S. King,
Oumeng Zhang,
Arye Nehorai,
Matthew D. Lew
Abstract:
In single-molecule super-resolution microscopy, engineered point-spread functions (PSFs) are designed to efficiently encode new molecular properties, such as 3D orientation, into complex spatial features captured by a camera. To fully benefit from their optimality, algorithms must estimate multi-dimensional parameters such as molecular position and orientation in the presence of PSF overlap and mo…
▽ More
In single-molecule super-resolution microscopy, engineered point-spread functions (PSFs) are designed to efficiently encode new molecular properties, such as 3D orientation, into complex spatial features captured by a camera. To fully benefit from their optimality, algorithms must estimate multi-dimensional parameters such as molecular position and orientation in the presence of PSF overlap and model-experiment mismatches. Here, we present a novel joint sparse deconvolution algorithm based on the decomposition of fluorescence images into six basis images that characterize molecular orientation. The proposed algorithm exploits a group-sparsity structure across these basis images and applies a pooling strategy on corresponding spatial features for robust simultaneous estimates of the number, brightness, 2D position, and 3D orientation of fluorescent molecules. We demonstrate this method by imaging DNA transiently labeled with the intercalating dye YOYO-1. Imaging the position and orientation of each molecule reveals orientational order and disorder within DNA with nanoscale spatial precision.
△ Less
Submitted 12 January, 2019;
originally announced January 2019.
-
Fundamental Limits on Measuring the Rotational Constraint of Single Molecules using Fluorescence Microscopy
Authors:
Oumeng Zhang,
Matthew D. Lew
Abstract:
Optical fluorescence imaging is capable of measuring both the spatial and rotational dynamics of single molecules. However, unavoidable measurement noise will result in inaccurate estimates of rotational dynamics, causing a molecule to appear to be more rotationally constrained than it actually is. We report a mathematical framework to compute the fundamental limit of accuracy in measuring the rot…
▽ More
Optical fluorescence imaging is capable of measuring both the spatial and rotational dynamics of single molecules. However, unavoidable measurement noise will result in inaccurate estimates of rotational dynamics, causing a molecule to appear to be more rotationally constrained than it actually is. We report a mathematical framework to compute the fundamental limit of accuracy in measuring the rotational mobility of dipole-like emitters. By applying our framework to both in-plane and three-dimensional methods, we provide a means to choose the optimal orientation-measurement technique based on experimental conditions.
△ Less
Submitted 19 April, 2019; v1 submitted 21 November, 2018;
originally announced November 2018.
-
Modular Architecture for StarCraft II with Deep Reinforcement Learning
Authors:
Dennis Lee,
Haoran Tang,
Jeffrey O Zhang,
Huazhe Xu,
Trevor Darrell,
Pieter Abbeel
Abstract:
We present a novel modular architecture for StarCraft II AI. The architecture splits responsibilities between multiple modules that each control one aspect of the game, such as build-order selection or tactics. A centralized scheduler reviews macros suggested by all modules and decides their order of execution. An updater keeps track of environment changes and instantiates macros into series of ex…
▽ More
We present a novel modular architecture for StarCraft II AI. The architecture splits responsibilities between multiple modules that each control one aspect of the game, such as build-order selection or tactics. A centralized scheduler reviews macros suggested by all modules and decides their order of execution. An updater keeps track of environment changes and instantiates macros into series of executable actions. Modules in this framework can be optimized independently or jointly via human design, planning, or reinforcement learning. We apply deep reinforcement learning techniques to training two out of six modules of a modular agent with self-play, achieving 94% or 87% win rates against the "Harder" (level 5) built-in Blizzard bot in Zerg vs. Zerg matches, with or without fog-of-war.
△ Less
Submitted 8 November, 2018;
originally announced November 2018.
-
Dyloc: Dynamic and Collaborative User-controlled AOA based Localizing System with your laptops
Authors:
Ouyang Zhang,
Kannan Srinivasan
Abstract:
Currently, accurate localization system based on commodity WiFi devices is not broadly available yet. In the literature, the solutions are based on either network infrastructure like WiFi router, which have at least three antennas, or sacrifice accuracy with coarse grained information like RSSI. In this work, we design a new localizing system which is accurate based on AOA estimation and instantly…
▽ More
Currently, accurate localization system based on commodity WiFi devices is not broadly available yet. In the literature, the solutions are based on either network infrastructure like WiFi router, which have at least three antennas, or sacrifice accuracy with coarse grained information like RSSI. In this work, we design a new localizing system which is accurate based on AOA estimation and instantly deployable on users' devices.
Dyloc is designed to be dynamically constructed with user's devices as network nodes without any network infrastructure. On the platform of laptops, our system achieve comparable localization accuracy with state-of-the-art work despite of the limitation of less number and large separation of antennas. We design multi-stage signal processing to resolve the ambiguity issue arisen in this scenario. To enable dynamic and collaborative construction, our system can accurately conduct self-localization and also eliminate the need of infrastructure anchors, which is due to the dedicated two-layer algorithm design.
△ Less
Submitted 22 March, 2018;
originally announced March 2018.
-
Balanced complexes and effective divisors on $\overline{M}_{0,n}$
Authors:
José Luis González,
Elijah Gunther,
Olivia Zhang
Abstract:
Doran, Jensen and Giansiracusa showed a bijection between homogeneous elements in the Cox ring of $\overline{M}_{0,n}$ not divisible by any exceptional divisor section, and weighted pure-dimensional simplicial complexes satisfying a zero-tension condition. Motivated by the study of the monoid of effective divisors, the pseudoeffective cone and the Cox ring of $\overline{M}_{0,n}$, we point out a s…
▽ More
Doran, Jensen and Giansiracusa showed a bijection between homogeneous elements in the Cox ring of $\overline{M}_{0,n}$ not divisible by any exceptional divisor section, and weighted pure-dimensional simplicial complexes satisfying a zero-tension condition. Motivated by the study of the monoid of effective divisors, the pseudoeffective cone and the Cox ring of $\overline{M}_{0,n}$, we point out a simplification of the zero-tension condition and study the space of balanced complexes. We give examples of irreducible elements in the monoid of effective divisors of $\overline{M}_{0,n}$ for large $n$. In the case of $\overline{M}_{0,7}$, we classify all such irreducible elements arising from nonsingular complexes and give an example of how irreducibility can be shown in the singular case.
△ Less
Submitted 28 September, 2017;
originally announced September 2017.
-
Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification
Authors:
Yoon Kim,
Owen Zhang
Abstract:
We provide a simple but novel supervised weighting scheme for adjusting term frequency in tf-idf for sentiment analysis and text classification. We compare our method to baseline weighting schemes and find that it outperforms them on multiple benchmarks. The method is robust and works well on both snippets and longer documents.
We provide a simple but novel supervised weighting scheme for adjusting term frequency in tf-idf for sentiment analysis and text classification. We compare our method to baseline weighting schemes and find that it outperforms them on multiple benchmarks. The method is robust and works well on both snippets and longer documents.
△ Less
Submitted 28 June, 2014; v1 submitted 14 May, 2014;
originally announced May 2014.
-
Positivity constraints on LECs of $χ$PT lagrangian at $\cO(p^6)$ level
Authors:
Zhi-Hui Guo,
Ou Zhang,
H. Q. Zheng
Abstract:
Positivity constraints on the LECs of $\cO(p^6)$ $χ$PT lagrangian are discussed. We demonstrate that the constraints are automatically satisfied inside the Mandelstam triangle for $ππ$ scatterings, when $N_C$ is large. Numerical tests are made in the $N_C=3$ case, and it is found that these constraints are also well respected.
Positivity constraints on the LECs of $\cO(p^6)$ $χ$PT lagrangian are discussed. We demonstrate that the constraints are automatically satisfied inside the Mandelstam triangle for $ππ$ scatterings, when $N_C$ is large. Numerical tests are made in the $N_C=3$ case, and it is found that these constraints are also well respected.
△ Less
Submitted 23 November, 2009;
originally announced November 2009.
-
Investigations on the Property of $f_0(600)$ and $f_0(980)$ Resonances in $γγ\to ππ$ Process
Authors:
X. G. Wang,
O. Zhang,
L. C. Jin,
H. Q. Zheng,
Z. Y. Zhou
Abstract:
Using dispersion relation technique and experimental data, a coupled channel analysis on $γγ\toππ$ process is made. Di-photon coupling of $f_0(600)$ and $f_0(980)$ resonances are extracted and their dynamical properties are discussed. Especially we study the physical meaning of the coupling constant $g^2_{σππ}$, which maintains a negative real part as determined through dispersive analyses.
Using dispersion relation technique and experimental data, a coupled channel analysis on $γγ\toππ$ process is made. Di-photon coupling of $f_0(600)$ and $f_0(980)$ resonances are extracted and their dynamical properties are discussed. Especially we study the physical meaning of the coupling constant $g^2_{σππ}$, which maintains a negative real part as determined through dispersive analyses.
△ Less
Submitted 2 October, 2009;
originally announced October 2009.
-
A Dispersive Analysis on the $f_0(600)$ and $f_0(980)$ Resonances in $γγ\toπ^+π^-, π^0π^0$ Processes
Authors:
Yu Mao,
Xuan-Gong Wang,
Ou Zhang,
H. Q. Zheng,
Z. Y. Zhou
Abstract:
We estimate the di-photon coupling of $f_0(600)$, $f_0(980)$ and $f_2(1270)$ resonances in a coupled channel dispersive approach. The $f_0(600)$ di-photon coupling is also reinvestigated using a single channel $T$ matrix for $ππ$ scattering with better analyticity property, and it is found to be significantly smaller than that of a $\bar qq$ state. Especially we also estimate the di-photon coupl…
▽ More
We estimate the di-photon coupling of $f_0(600)$, $f_0(980)$ and $f_2(1270)$ resonances in a coupled channel dispersive approach. The $f_0(600)$ di-photon coupling is also reinvestigated using a single channel $T$ matrix for $ππ$ scattering with better analyticity property, and it is found to be significantly smaller than that of a $\bar qq$ state. Especially we also estimate the di-photon coupling of the third sheet pole located near $\bar KK$ threshold, denoted as $f_0^{III}(980)$.
It is argued that this third sheet pole may be originated from a coupled channel Breit-Wigner description of the $f_0(980)$ resonance.
△ Less
Submitted 10 June, 2009; v1 submitted 8 April, 2009;
originally announced April 2009.
-
Ambiversion of X(3872)
Authors:
Ou Zhang,
C. Meng,
H. Q. Zheng
Abstract:
An analysis including most recent Belle data on X(3872) is performed, using coupled channel Flatté formula. A third sheet pole close to but \textit{below} $D^0D^{*0}$ threshold is found, besides the bound state/virtual state pole discussed in previous literature. The co-existence of two poles near the $D^0D^{*0}$ threshold indicates that the X(3872) may be of ordinary $c\bar c$ $2 ^3P_1$ state o…
▽ More
An analysis including most recent Belle data on X(3872) is performed, using coupled channel Flatté formula. A third sheet pole close to but \textit{below} $D^0D^{*0}$ threshold is found, besides the bound state/virtual state pole discussed in previous literature. The co-existence of two poles near the $D^0D^{*0}$ threshold indicates that the X(3872) may be of ordinary $c\bar c$ $2 ^3P_1$ state origin, distorted by strong coupled channel effects. The latter manifests itself as a molecular bound state (or a virtual state).
△ Less
Submitted 6 September, 2009; v1 submitted 12 January, 2009;
originally announced January 2009.