BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (2024)

Artem ZholusInsilico Medicine Canada Inc.Mila – Quebec AI InstitutePolytechnique MontréalMaksim KuznetsovInsilico Medicine Canada Inc.Roman SchutskiInsilico Medicine AI LimitedRim ShayakhmetovInsilico Medicine Canada Inc.Daniil PolykovskiyInsilico Medicine Canada Inc.Sarath ChandarMila – Quebec AI InstitutePolytechnique MontréalCIFAR AI ChairAlex ZhavoronkovInsilico Medicine AI Limited

Abstract

Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein’s binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.

1 Introduction

The landscape of drug discovery presents immense challenges and risks, demanding substantial investments of time and resources to design, test, and deliver new medicines to the market. Within this context, Computer-Aided Drug Design (CADD) (Yu & MacKerell, 2017) stands as a pivotal methodology, harnessing software screenings and physical simulations to facilitate a more efficient exploration of the vast space of drug-like molecules, estimated to be around $10^{60}$ in size (Polishchuk etal., 2013; Gómez-Bombarelli etal., 2018). Deep learning advancements have revolutionized this exploration by leveraging neural generative models trained on extensive compound datasets. Notably, the textual representation of molecular structures using SMILES (Weininger, 1988) and SELFIES Krenn etal. (2020) has enabled the utilization of Language Models for the generation of novel, drug-like molecular compounds (Segler etal., 2018; Bagal etal., 2022).^†^†Correspondence to: artem.zholus@mila.quebec, r.schutski@insilicomedicine.com

Recent research has demonstrated the capability of deep generative models to generate novel molecular compounds directly in 3D, with the flexibility to incorporate protein pocket and ligand subfragment conditions. Among these, diffusion models such as EDM (Hoogeboom etal., 2022) and DiffDock (Corso etal., 2023) initiate the generation process with an arbitrary spatial distribution of atoms and progressively refine their positions to yield physically viable molecular structures. Meanwhile, autoregressive models like Pocket2Mol (Peng etal., 2022) sequentially predict the type and location of each successive atom, building upon the existing molecular framework. Additionally, work by Flam-Shepherd & Aspuru-Guzik (2023) has highlighted the proficiency of language models in handling spatial representations of molecular and protein structures through formats like XYZ, CIF, and PDB. However, it’s noteworthy that most spatial molecular generators focus exclusively on atom types and locations. They depend on supplementary tools, such as OpenBabel (O’Boyle etal., 2011), for the critical task of bond reconstruction. This reliance can introduce vulnerabilities, as the precision required for atom placement means that minor positional adjustments can significantly alter reconstructed molecular bonds or even make the molecular graph disconnected.

In this work, we introduce a novel framework that applies language modeling to the domain of 3D molecular data represented by textual tokens. This entirely data-driven approach, devoid of any inductive biases at both the model and representation levels, capitalizes on the established GPT paradigm, integrating cutting-edge techniques to enhance the scalability of model training and inference. By adopting the language model pretraining paradigm, our framework showcases the ability to foster a powerful causal language model adept at navigating the complex space of 3D molecules. This proficiency is demonstrated through successful applications in downstream tasks, including learning the distribution of 3D molecules, generating 3D conformations and the generation of molecules with targeted binding affinity to specific proteins.

Our main contributions are the following:

•
We introduce BindGPT, a Language Model for handling spatial molecular structures in text format. It uses structural SMILES and spatial XYZ formats to describe molecular graphs and atom locations, eliminating the dependency on external software for graph reconstruction.
•
We propose scalable pretraining-finetuning method for drug discovery in 3D that covers several 3D molecular generation tasks in a single paradigm.
•
We show how BindGPT can create accurate and realistic 3D molecular structures both zero-shot and after finetuning, with the option to include molecular graphs or protein pocket descriptions as prompts. The method offers comparable generation quality to leading approaches with the speedup of up to $100$ x.
•
Finally, we demonstrate the effectiveness of the Reinforcement Learning framework to finetune BindGPT with an external feedback from docking software. We show that the resulting model can find structures with high binding scores for any given protein as a result for the RL finetuning.

2 Background

Molecule Generation Small drug-like molecules can be represented as 2D or 3D graphs with node and edge attributes. However, one of the most popular molecular representation in the machine learning community is SMILES (Weininger, 1988), which can be seen as a compressed textual encoding of the Depth-First-Search applied to the molecular graph. It’s simplicity and expressivity made it work very well with language models - even a simple LSTM (Hochreiter & Schmidhuber, 1997) model can outperform graph-neural networks for the molecule generation task (Flam-Shepherd etal., 2022). In addition, SELFIES (Krenn etal., 2022), is a modification of SMILES which is a robust string representation such that every SELFIES token sequence is a valid molecule and vice versa.

The biological function of small molecules arises through their binding to specific protein pockets. The spatial structure of the protein pocket is an essential domain knowledge to increase the efficiency of molecular generation in drug design tasks. With the increase of molecular structure datasets sizes (Francoeur etal., 2020; Hu etal., 2005) a plethora of pocket-conditioned generators emerged (Peng etal., 2022; Luo etal., 2021; Lin etal., 2022; Corso etal., 2023). The challenge with pocket-conditioned molecular generation arises from a relatively small size of existing 3D binding poses datasets, which motivated a heavy use of specialized architectures, like SE(3) equivariant neural networks (Hoogeboom etal., 2022).

Molecular Generative Models in 3D. Apart from Language Models, there exist other types of generative models that approach molecule generation, including 3D-aware ones. The first group of works uses Diffusion Models, which employ the denoising diffusion process (Ho etal., 2020; Song etal., 2021) to learn to recover the data from noise. The second group of works relies on Graph Neural Networks (GNNs) to autoregressively build 2D or 3D molecular graphs. These approaches can be combined as GNNs can serve as efficient backbones for diffusion process once they are node-equivariant (Niu etal., 2020) (to generate 2D graphs) or SE(3)-equivariant (Peng etal., 2023b) (to generate 3D graphs). SBDD (Luo etal., 2021) model uses autoregressive graph generation for pocket-conditioned molecule generation. TargetDiff (Schneuing etal., 2023) generalizes this model to use a diffusion-GNN for the same task. Pocket2Mol (Peng etal., 2022) uses an informed autoregressive sampling mechanism for efficient pocket-conditioned molecule generation.Another batch of works use the aforementioned methods for the unconditional molecule generation. EDM (Hoogeboom etal., 2022) proposes an E(3) equivariant diffusion model for molecule generation. MolDiff (Peng etal., 2023b) is a diffusion model that addresses the inconsistency problem between generated atoms and bonds connecting them.

3 Method

The key idea of our method is utilizing an autoregressive token generation model, influenced by GPT-based models, to solve several 3D small molecule generation tasks in one simple yet flexible paradigm.The main principle in our approach is to formulate several 3D molecular design task as prompted generation of text. To achieve that, we layout the tokens of a condition before the tokens of the object to generate. For instance, a prompt can be the protein pocket for the pocket-conditioned generation task or the 2D molecular structure for the conformation generation task.

3.1 Architecture and text format

In our work we follow the decoder-only paradigm in Language Models and use the GPT-NeoX¹¹1Despite other architectures existing today (such as LLaMa2 (Touvron etal., 2023)), our main focus was building a general framework of working with 3D molecules that can work with any Language Model architecture. architecture (Black etal., 2022) that utilizes rotary position embeddings (Su etal., 2021). This technique allows for the length generalization, which is required since the sequence lengths may vary significantly between the pretraining, and fine-tuning stages.

Similarly to Flam-Shepherd & Aspuru-Guzik (2023), we use the XYZ representation as a base format to describe the spatial atom allocation. The idea of XYZ format is to represent the atom type and its 3D coordinates within every line on text. The main drawback of this format is the lack of charge and connectivity information. One should use external software like RDKit (Landrum etal., 2024) or OpenBabel (O’Boyle etal., 2011) to reconstruct the molecular graph. It introduces an instability since even small noise in atom positions can drastically change the reconstructed graph or even break it down (Peng etal., 2023a). To alleviate that, we propose to couple the XYZ format with the SMILES format. The latter can efficiently represent the molecular structure, while the former allows describing atom positions. To align these two formats we enforce to have the same atom ordering in both. We also remove the atom symbol from the XYZ representation as it already was shown in SMILES. For proteins, there is no need to describe their connectivity, therefore we simply write atom names grouped by aminoacids.

A schematic example of the two kind of the model input is shown in Figure 2 and the very detailed example of concrete input sequences (including its tokenization) is shown in Figure 6 in Appendix A. In particular, the sequence starts with the <LIGAND> token followed by a SMILES string, tokenized at the character level. Next, there goes the <XYZ> special token marking the end of SMILES and the beginning of the coordinate part of the string. The tokenization strategy uses 6 tokens per 3D position: we use one token for the integer part and one token for the fractional part of the number. When working with protein pockets, we use a similar strategy. Specifically, the sequence begins with the <POCKET> token followed by the sequence of atoms where each atom is a separate token. Since pockets can be hundreds of atoms large, we follow the AlphaFold’s (Jumper etal., 2021) approach and retain only the 3D coordinates of the Alpha-carbon atoms in the corresponding aminoacids. An example of the final representation of pockets is shown in Figure 8.

3.2 Pretraining

In this work, we aim to leverage insights accumulated by the NLP community in the paradigm of Large Language Models: pretraining-finetuning, prompting, scaling, finetuning with Reinforcement Learning, tool use, etc.(Kaplan etal., 2020; Hoffmann etal., 2022; Radford etal., 2019). Since our model covers only a specialized domain of molecular tasks, it does not require trillion-scale diverse datasets for good performance as NLP tasks do. Thus, we use a large-scale but specialized dataset of 3D molecules and protein pockets. During pretraining, we use the model with 108M parameters consisting of 15 layers, 12 heads, and a hidden size of 768. We found this size of the model to be enough for the tasks we care about - generating molecules in 3D (See Appendix C for a justification of the size). Every sequence in the training batch is either a ligand sequence of tokens or a pocket sequence of tokens following the scheme described earlier. Since the dataset has much fewer pockets than ligands, for one epoch of training on ligands, we do 5 epochs of training on proteins, that is, around 8% of all tokens seen by the model are pocket tokens. To speedup and stabilize pretraining, we use large batch training (Keskar etal., 2017) with 1.6M tokens per one training step. We found this many tokens per batch to be important for stable training in this task even with smaller learning rates. The detailed description of the training implementation is provided in Appendix G. Despite the wide use of transformers in drug discovery, the majority of current works in this space do not use recent advancements of efficient Language Models pretraining: neither technical ones, such as Flash-attention (Dao, 2023) or DeepSpeed (Rasley etal., 2020), nor the algorithimic ones, such as learning rate scaling. Our work aims to fill this gap by demonstrating the effectiveness of the pretraining for 3D drug discovery.

3.3 Finetuning

3.3.1 Supervised finetuning

As a result of the pretraining, BindGPT gains an understanding of a broad chemical space. This comprehensive understanding enables us to efficiently narrow it down through the supervised fine-tuning on a specialized dataset. During the supervised fine-tuning phase, we continue model training on CrossDocked 2020(Francoeur etal., 2020), which is a high-quality dataset containing aligned pocket-ligand pairs. Most of the prior methods subsample less than $1\%$ of the best pocket-ligand pairs and they don’t benefit from it’s diversity and scale. To obtain a bigger version of CrossDocked, we extract all intermediate ligand poses (with respect to the docking process), including the lower quality ones. Despite quite large size, CrossDocked was created by docking $14$ k unique molecules into $3$ k pockets (Francoeur etal., 2020).This is why we observed an dramatic overfitting when training on the $1\%$ version of CrossDocked and even on the full one. To alleviate that, we resort to two standard augmentation techniques used in drug discovery. First, we employ SMILES randomization (Bjerrum, 2017), which can heavily randomize one molecule by yielding 100-1000 different SMILES strings (all corresponding to that molecule).Second, we randomly rotate the 3D coordinates of the protein pocket and of the ligand (with the same rotation matrix). This way our model learns to understand structural and spatial properties of molecular binding beyond just token sequences.

Since the pretrained BindGPT is trained on both ligands (starting from the <LIGAND> token) and pockets (starting from the <POCKET> token), the information about the structure of both is learned by the model.In our finetuning setup, we represent each pocket-ligand pair as a string starting with the pocket string representation followed by the string representation of the ligand (See Section 3.1 for their description).Therefore, having learned them separately during pretraining, the finetuning exploits the independent knowledge of both pockets and ligands to learn a conditional dependency between them. In addition to that, since our version of CrossDocked contains both high and low score conformations, we test another version of context where we condition on the pocket and binding energy score obtained from the CrossDocked dataset (which is originally computed through the docking software (Trott & Olson, 2010; Eberhardt etal., 2021)). This way we can perform a variant of contrastive learning by learning the structure of good and bad examples. During evaluation of the model, we can sample molecules conditioned on some desired value of the binding affinity. The input layout for both versions is shown in Figure 3.

3.3.2 Reinforcement Learning

Despite the ubiquitous use of Reinforcement Learning (RL) for language models in Drug Discovery (see Section 2 and Appendix G), we did not find it been used within the pre-training paradigm of modern LLMs (Hoffmann etal., 2022; Kaplan etal., 2020; Ouyang etal., 2022). Our main motivation to use RL after the pretraining/finetuning stages is to use the knowledge distilled into the model from massive amounts of less structured data. We believe, this is the first work performing reinforcement learning on molecules that utilizes knowledge from pretraining and supervised finetuning. Despite there are dozens of works doing RL with LMs on molecules, none of them do that within the LLM paradigm and none of them consider target-conditioned RL problem. In our opinion, the latter is primarily due to pocket-conditioned generation is not possible without large-scale pretraining as we show in the experimental section.

We apply the REINFORCE algorithm (Williams, 1992) for further model finetuning. It allows using the feedback (called reward) from an external oracle to train model to generate even better structures compared to the ones it generates after the SFT stage. The resulting RL-finetuned model can generalize model and produce high affinity molecules even for the new pockets.In our procedure, on each training step we generate 3D structure of ligands for a batch of random protein pockets. Then we compute the reward using an external docking software that estimates the binding energy between the pocket and the generated ligand. The final step involves updating the language model with the batch of prompts (pockets), responses (ligands), and rewards (binding energies). We initially tested PPO (Schulman etal., 2017) and REINVENT (Olivecrona etal., 2017), but found REINFORCE to be more stable for our project, which aligns with another recent finding in the field of RL applied to language models in NLP (Ahmadian etal., 2024). Also, it’s important to mention, that we apply the KL-penalty between the model’s initialized and current state to stabilize the procedure. Further details such as hyperparameters can be found in the Appendix B.

4 Results

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (3)

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (4)

In this section, we describe our experimental results. We start with a brief data description followed by the description of the three 3D molecular generative tasks: the 3D generative modeling of molecules and conformation generation given molecular graph (in Section 4.1) and then pocket-conditioned generation (in Section 4.2).

For pretraining, we use a large 3D molecular dataset proposed by the authors of the Uni-Mol model (Zhou etal., 2023). The dataset contains 208M conformations for 12M molecules and 3.2M spatial structures of protein pockets. For finetuning in the pocket-conditioned generation task, we use the aforementioned CrossDocked dataset which contains aligned pocket-molecule pairs. Our filtration of the dataset has around 27M pocket-ligand pairs covering a cross product of 14k molecules with 3k pockets (not all of the pairs are present, and for some of them, there is more than one pose, each with different score). We also hold out a set of 100 pockets from the training data for evaluating the model performance. For the tasks of 3D molecule and 3D conformer generation (Section 4.1), to make comparisons with baselines more fair, we also finetune the model on the GEOM-DRUGS (Axelrod & Gómez-Bombarelli, 2022) dataset, with drug-like molecules having high-quality 3D molecular conformations. This dataset contains 27M conformations for 300k molecules, and it serves as a standard benchmark for the machine learning-based 3D molecular generators. Finally, we use the Platinum (Friedrich etal., 2017) dataset as a hold-out evaluation dataset to test our model and baselines on zero-shot conformer generation. Platinum dataset contains the best-in-class experimentally validated conformations for testing conformer generation software.

4.1 Generative Modeling of 3D molecules

methodValid ( $\uparrow$ )SA ( $\uparrow$ )QED ( $\uparrow$ )Lipinski ( $\uparrow$ )RMSD ( $\downarrow$ )Time,s ( $\downarrow$ )XYZ-TF $12.87\%$ $0.21$ $0.30$ $4.79$ -165BindGPT (Ours) $\mathbf{98.58\%}$ $\mathbf{0.77}$ $\mathbf{0.59}$ $\mathbf{4.86}$ $\mathbf{0.89}$ $\mathbf{13}$ XYZ-TF (H) $17.86\%$ $0.54$ $0.37$ $4.82$ - $394$ BindGPT (H) (Ours) $\mathbf{77.33\%}$ $\mathbf{0.78}$ $\mathbf{0.61}$ $\mathbf{4.91}$ $\mathbf{3.44}$ $\mathbf{156}$

GroupMetricsEDMMolDiff BindGPT(Ours)DruglikenessQED ( $\uparrow$ )0.5580.6680.616SA ( $\uparrow$ )0.5680.8740.826Lipinski ( $\uparrow$ )4.9234.9864.8963D structuresJS. bond lengths ( $\downarrow$ )0.2460.3650.029JS. bond angles ( $\downarrow$ )0.2820.1550.075JS. dihedral angles ( $\downarrow$ )0.3280.1620.098BondsJS. num. bonds per atoms ( $\downarrow$ )0.1390.1150.160JS. freq. bond types ( $\downarrow$ )0.3780.1630.045JS. freq. bond pairs ( $\downarrow$ )0.3960.1360.043JS. freq. bond triplets ( $\downarrow$ )0.4490.1250.042RingsJS. num. rings ( $\downarrow$ )0.1060.0620.094JS. num. n-sized rings ( $\downarrow$ )0.1070.0920.023Num. Intersecting rings ( $\uparrow$ )3.6678.0009.000Time for 1000 valid molecules, s ( $\downarrow$ ) $1.4\times 10^{6}$ $7500$ $\mathbf{200}$

Metrics. We provide the validity ( $\uparrow$ ) of generated molecules and druglikeness metrics - SA ( $\uparrow$ ), QED ( $\uparrow$ ), and Lipinski ( $\uparrow$ ) that are agnostic to 3D but measure how likely the molecule to be a drug. Also, we adopt a range of distribution metrics that were used for the MolDiff method (Peng etal., 2023a). Those metrics measure the discrepancy between true and modelled molecular distributions by computing the Jensen–Shannon divergences on the set of molecular properties and features distributions. We compute RMSD (Root-Mean-Squared-Distance) ( $\downarrow$ ) - which measures the quality of 3D structures by aligning the generated one with the one from RDkit (i.e., we regenerate conformer via RDkit) and computing the atomwise distance.Finally, we measure the time needed to generate 1K valid 3D molecules on one GPU. Note that this choice of metrics is standard for this task (see Peng etal. (2023a) for a more detailed description of them). For the 3D conformation generation given molecule task, we compute the RMSD-coverage ( $\uparrow$ ) metric. This is a standard performance metric for 3D conformer generation models (see e.g. Jing etal. (2022)). It is represented by the cumulative distribution function of RMSD between generated and reference conformers. The metric is a function of the threshold $x$ : $P(\text{RMSD}<x)$ . An ideal model should have as high metric value as possible for as low thresholds as possible.

Baselines. For the molecule generation task, we consider the current best 3D generative models. EDM (Hoogeboom etal., 2022) and MolDiff (Peng etal., 2023a) are task-specialized diffusion models for 3D molecule generation. XYZ-Transformer (Flam-Shepherd & Aspuru-Guzik, 2023) is another 3D molecular transformer that was proposed for small-scale data. Note that XYZ-TF is the only model capable of large scale pretraining besides our model, so we pretrain only XYZ-TF and BindGPT on the Uni-Mol data. We also do the GEOM-DRUGS evaluation, where we report MolDiff and EDM trained on the full dataset and for BindGPT finetuned on the same version of it. For conformer generation, we compare BindGPT with the current state-of-the-art methods, Torsional Diffusion (Jing etal., 2022) and the Uni-Mol model (Zhou etal., 2023). The former is a specialized SE(3)-equivariant diffusion model capable of conformation generation only. The latter is a modified BERT (Devlin etal., 2019). As a coordinate-level encoder LM, the Uni-Mol model needs input coordinates to generate a conformation, which is why this model uses RDKit as a tool for initializing coordinates.

Results. The molecular generative modeling results are shown in Tables 2 and 1. First, the pretrained BindGPT model consistently outperforms the XYZ-TF baseline both without and with explicit hydrogens. The latter is a much more challenging task and almost no baseline methods can do that (except EDM, which is not scalable) since reconstructing hydrogens can be done on a post-processing step but explicit modeling of them makes the molecule size several times larger. BindGPT is the first model capable of modeling hydrogen explicitly at such large scale. Also, XYZ-TF has a very low validity rate due to the need of graph reconstruction.Next, for the methods trained on the GEOM-DRUGS dataset, BindGPT (being finetuned on this data) shows state-of-the-art performance scores for nearly all distributional evaluation metrics. Even though BindGPT does not outperform MolDiff in Druglikeness, that could be explained by a smaller vocabulary of the (Peng etal., 2023a), containing only frequent atoms.For the conformation generation task, the current best baseline is Torsional Diffusion (TD) (Jing etal., 2022). We use the Platinum dataset to compare TD trained on GEOM-DRUGS with Uni-Mol-BERT and BindGPT, both of which are pretrained and finetuned on the same data. Figure 4 shows the results for zero-shot evaluations on Platinum. Surprisingly, Uni-Mol fails to generalize to this new dataset (even assisted by RDKit), which we think is because of its structural diversity. BindGPT, in contrast, is capable of matching the performance of TD when assisted by the RDKit tool and having a small gap when not. All the above results demonstrate the generalisability of our model - none of the baselines is able to solve this wide range of task at this level of quality.

4.2 Pocket-conditioned Molecule Generation

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (5)

Metrics. The main metrics for this task include the measure of ligand-pocket affinity; and druglikeness of the ligand. The first one is represented via the binding energy ( $\downarrow$ ) computed by the QVINA (Alhossary etal., 2015) docking software, while the second one comprises the aforementioned druglikeness metrics (SA ( $\uparrow$ ), QED ( $\uparrow$ ), and Lipinsky ( $\uparrow$ )). For each baseline we report the time required to generate $100$ valid molecules for one pocket.

Baselines. Apart from the BindGPT model, we include the baselines such as 3D diffusion model (TargetDiff (Guan etal., 2023)) and autoregressive Graph Neural Network (Pocket2Mol (Peng etal., 2022)).Note that none of the baselines perform large-scale pretraining. Instead, they resort to heavy inductive biases to efficiently learn from small-scale data.

Results. Performance of our approach is summarized in Table 3. We depict the performance of three version of BindGPT. First, BindGPT-FT is a model finetuned on the complete CrossDocked data (the data layout as described in Figure 3, top), i.e. both good and bad binding pairs. This model serves as an initialization for the reinforcement learning model. Second, BindGPT-RFT is the model finetuned on CrossDocked with the reward in the context. To get higher affinity molecules from that model, we condition the model on random binding energy values within $[-12,-10]$ , which are the best scores observed by the model (in around $0.1\%$ of examples). Finally, the BindGPT-RL model is trained with RL (see Section 3.3.2 and Appendix B.3) from the BindGPT-FT initialization. Our main conclusion is that the RL finetuned model can learn to search the space of binding molecules much more efficiently and significantly outperforms all the previous best baselines in terms of the binding energy.

MethodVina score ( $\downarrow$ )SA ( $\uparrow$ )QED ( $\uparrow$ )Lipinski ( $\uparrow$ )Pocket2Mol $-7.15\pm 4.89$ $0.75\pm 0.12$ $\mathbold{0.57\pm 0.15}$ $\mathbf{4.88\pm 0.37}$ TargetDiff $-7.80\pm 3.61$ $0.58\pm 0.12$ $0.48\pm 0.19$ $4.51\pm 0.85$ BindGPT-FT (Ours) $-5.44\pm 2.09$ $0.78\pm 0.10$ $0.50\pm 0.17$ $4.72\pm 0.70$ BindGPT-RFT (Ours) $-7.24\pm 1.68$ $0.74\pm 0.11$ $0.48\pm 0.22$ $4.32\pm 1.25$ BindGPT-RL (Ours) $\mathbf{-8.60\pm 1.90}$ $\mathbf{0.84\pm 0.05}$ $0.43\pm 0.17$ $4.81\pm 0.52$

5 Discussion and Conclusion

In this work, we presented BindGPT, a scalable framework for training capable language models that can generate 3D molecules as text. Through a series of studies on a range of different 3D molecular generative tasks, we demonstrate the generality of our approach as it can solve each of them by matching or surpassing the baselines. Notably, our method does not have any inductive biases about the generative domain acting as a general and data-driven approach. Unlike all the baselines which have strong inductive biases, our method solves each downstream task without any such assumptions. The task of a particular interest in our work is the pocket-based molecule generation where our model outperforms all the baselines with a large margin. We show that the large-scale pretraining paradigm can be efficiently transfered from NLP to the 3D drug discovery.

Acknowledgments

Sarath Chandar is supported by the Canada CIFAR AI Chairs program, the Canada Research Chair in Lifelong Machine Learning, and the NSERC Discovery Grant.

References

Ahmadian etal. (2024)Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer,Olivier Pietquin, Ahmet Üstün, and Sara Hooker.Back to basics: Revisiting reinforce style optimization for learningfrom human feedback in llms, 2024.
Alhossary etal. (2015)Amr Alhossary, StephanusDaniel Handoko, Yuguang Mu, and Chee-Keong Kwoh.Fast, accurate, and reliable molecular docking with QuickVina 2.Bioinformatics, 31(13):2214–2216, 022015.ISSN 1367-4803.doi: 10.1093/bioinformatics/btv082.URL https://doi.org/10.1093/bioinformatics/btv082.
Axelrod & Gómez-Bombarelli (2022)Simon Axelrod and Rafael Gómez-Bombarelli.Geom, energy-annotated molecular conformations for propertyprediction and molecular generation.Scientific Data, 9(1):185, Apr 2022.ISSN 2052-4463.doi: 10.1038/s41597-022-01288-4.URL https://doi.org/10.1038/s41597-022-01288-4.
Bagal etal. (2022)Viraj Bagal, Rishal Aggarwal, P.K. Vinod, and U.Deva Priyakumar.Molgpt: Molecular generation using a transformer-decoder model.Journal of Chemical Information and Modeling, 62(9):2064–2076, May 2022.ISSN 1549-9596.doi: 10.1021/acs.jcim.1c00600.URL https://doi.org/10.1021/acs.jcim.1c00600.
Bakan etal. (2011)Ahmet Bakan, LidioM. Meireles, and Ivet Bahar.ProDy: Protein Dynamics Inferred from Theory and Experiments.Bioinformatics, 27(11):1575–1577, 042011.ISSN 1367-4803.doi: 10.1093/bioinformatics/btr168.URL https://doi.org/10.1093/bioinformatics/btr168.
Bjerrum (2017)EsbenJannik Bjerrum.SMILES enumeration as data augmentation for neural network modelingof molecules.CoRR, abs/1703.07076, 2017.URL http://arxiv.org/abs/1703.07076.
Black etal. (2022)Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, LaurenceGolding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler,USVSNSai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, BenWang, and Samuel Weinbach.GPT-NeoX-20B: An open-source autoregressive language model.In Proceedings of the ACL Workshop on Challenges &Perspectives in Creating Large Language Models, 2022.URL https://arxiv.org/abs/2204.06745.
Chilingaryan etal. (2022)Gayane Chilingaryan, Hovhannes Tamoyan, Ani Tevosyan, Nelly Babayan, LusineKhondkaryan, Karen Hambardzumyan, Zaven Navoyan, Hrant Khachatrian, and ArmenAghajanyan.Bartsmiles: Generative masked language models for molecularrepresentations, 2022.
Corso etal. (2023)Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and TommiS.Jaakkola.DiffDock: Diffusion steps, twists, and turns for molecular docking.In The Eleventh International Conference on LearningRepresentations, 2023.URL https://openreview.net/forum?id=kKF8_K-mBbS.
Dao (2023)Tri Dao.FlashAttention-2: Faster attention with better parallelism and workpartitioning.2023.
Dao etal. (2022)Tri Dao, DanielY. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.FlashAttention: Fast and memory-efficient exact attention withIO-awareness.In Advances in Neural Information Processing Systems, 2022.
Devlin etal. (2019)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of deep bidirectional transformers for languageunderstanding.In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota,June 2019. Association for Computational Linguistics.doi: 10.18653/v1/N19-1423.URL https://aclanthology.org/N19-1423.
Eberhardt etal. (2021)Jerome Eberhardt, Diogo Santos-Martins, AndreasF. Tillack, and Stefano Forli.Autodock vina 1.2.0: New docking methods, expanded force field, andpython bindings.Journal of Chemical Information and Modeling, 61(8):3891–3898, Aug 2021.ISSN 1549-9596.doi: 10.1021/acs.jcim.1c00203.URL https://doi.org/10.1021/acs.jcim.1c00203.
Feng etal. (2024)Wei Feng, Lvwei Wang, Zaiyun Lin, Yanhao Zhu, Han Wang, Jianqiang Dong, RongBai, Huting Wang, Jielong Zhou, Wei Peng, BoHuang, and Wenbiao Zhou.Generation of 3d molecules in pockets via a language model.Nature Machine Intelligence, Jan 2024.ISSN 2522-5839.doi: 10.1038/s42256-023-00775-6.URL https://doi.org/10.1038/s42256-023-00775-6.
Flam-Shepherd & Aspuru-Guzik (2023)Daniel Flam-Shepherd and Alán Aspuru-Guzik.Language models can generate molecules, materials, and proteinbinding sites directly in three dimensions as xyz, cif, and pdb files, 2023.
Flam-Shepherd etal. (2022)Daniel Flam-Shepherd, Kevin Zhu, and Alán Aspuru-Guzik.Language models can learn complex molecular distributions.Nature Communications, 13(1), June 2022.ISSN 2041-1723.doi: 10.1038/s41467-022-30839-x.URL http://dx.doi.org/10.1038/s41467-022-30839-x.
Francoeur etal. (2020)PaulG. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, RichardB.Iovanisci, Ian Snyder, and DavidR. Koes.Three-dimensional convolutional neural networks and a cross-dockeddata set for structure-based drug design.Journal of Chemical Information and Modeling, 60(9):4200–4215, Sep 2020.ISSN 1549-9596.doi: 10.1021/acs.jcim.0c00411.URL https://doi.org/10.1021/acs.jcim.0c00411.
Friedrich etal. (2017)Nils-Ole Friedrich, Agnes Meyder, Christina deBruynKops, Kai Sommer, FlorianFlachsenberg, Matthias Rarey, and Johannes Kirchmair.High-quality dataset of protein-bound ligand conformations and itsapplication to benchmarking conformer ensemble generators.Journal of Chemical Information and Modeling, 57(3):529–539, Mar 2017.ISSN 1549-9596.doi: 10.1021/acs.jcim.6b00613.URL https://doi.org/10.1021/acs.jcim.6b00613.
Gómez-Bombarelli etal. (2018)Rafael Gómez-Bombarelli, JenniferN. Wei, David Duvenaud, JoséMiguelHernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla,Jorge Aguilera-Iparraguirre, TimothyD. Hirzel, RyanP. Adams, and AlánAspuru-Guzik.Automatic chemical design using a data-driven continuousrepresentation of molecules.ACS Central Science, 4(2):268–276, Feb2018.ISSN 2374-7943.doi: 10.1021/acscentsci.7b00572.URL https://doi.org/10.1021/acscentsci.7b00572.
Guan etal. (2023)Jiaqi Guan, WesleyWei Qian, Xingang Peng, Yufeng Su, Jian Peng, and JianzhuMa.3D equivariant diffusion for target-aware molecule generation andaffinity prediction.In The Eleventh International Conference on LearningRepresentations, 2023.URL https://openreview.net/forum?id=kJqXEPXMsE0.
Halgren (1996)ThomasA. Halgren.Merck molecular force field. i. basis, form, scope, parameterization,and performance of mmff94.J. Comput. Chem., 17(5-6):490–519, 1996.URLhttp://dblp.uni-trier.de/db/journals/jcc/jcc17.html#Halgren96.
Ho etal. (2020)Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin(eds.), Advances in Neural Information Processing Systems, volume33,pp. 6840–6851. Curran Associates, Inc., 2020.URLhttps://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
Hochreiter & Schmidhuber (1997)Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
Hoffmann etal. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, TrevorCai, Eliza Rutherford, Diego deLasCasas, LisaAnne Hendricks, JohannesWelbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George vandenDriessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, ErichElsen, JackW. Rae, Oriol Vinyals, and Laurent Sifre.Training compute-optimal large language models, 2022.
Hoogeboom etal. (2022)Emiel Hoogeboom, VíctorGarcia Satorras, Clément Vignac, and MaxWelling.Equivariant diffusion for molecule generation in 3D.In Kamalika Chaudhuri, Stefanie Jegelka, LeSong, Csaba Szepesvari,Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39thInternational Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 8867–8887. PMLR,17–23 Jul 2022.URL https://proceedings.mlr.press/v162/hoogeboom22a.html.
Hu etal. (2005)Liegi Hu, MarkL Benson, RichardD Smith, MichaelG Lerner, and HeatherACarlson.Binding moad (mother of all databases).Proteins: Structure, Function, and Bioinformatics, 60(3):333–340, 2005.
Jing etal. (2022)Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, and TommiS.Jaakkola.Torsional diffusion for molecular conformer generation.In AliceH. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho(eds.), Advances in Neural Information Processing Systems, 2022.URL https://openreview.net/forum?id=w6fj2r62r_H.
Jumper etal. (2021)John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov,Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, AugustinŽídek, Anna Potapenko, etal.Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021.
Kaplan etal. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan, TomB. Brown, Benjamin Chess, RewonChild, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models, 2020.
Katharopoulos etal. (2020)A.Katharopoulos, A.Vyas, N.Pappas, and F.Fleuret.Transformers are rnns: Fast autoregressive transformers with linearattention.In Proceedings of the International Conference on MachineLearning (ICML), 2020.
Keskar etal. (2017)NitishShirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy,and Ping TakPeter Tang.On large-batch training for deep learning: Generalization gap andsharp minima.In International Conference on Learning Representations, 2017.URL https://openreview.net/forum?id=H1oyRlYgg.
Krenn etal. (2020)Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and AlanAspuru-Guzik.Self-referencing embedded strings (SELFIES): A 100% robustmolecular string representation.Machine Learning: Science and Technology, 1(4):045024, oct 2020.doi: 10.1088/2632-2153/aba947.URL https://dx.doi.org/10.1088/2632-2153/aba947.
Krenn etal. (2022)Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, NathanC.Frey, Pascal Friederich, Théophile Gaudin, AlbertoAlexander Gayle,KevinMaik Jablonka, RafaelF. Lameiro, Dominik Lemm, Alston Lo,SeyedMohamad Moosavi, JoséManuel Nápoles-Duarte, AkshatKumar Nigam,Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe Schwaller,Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom, GuidoFalkvon Rudorff, Andrew Wang, AndrewD. White, Adamo Young, Rose Yu, andAlán Aspuru-Guzik.Selfies and the future of molecular string representations.Patterns, 3(10):100588, October 2022.ISSN 2666-3899.doi: 10.1016/j.patter.2022.100588.URL http://dx.doi.org/10.1016/j.patter.2022.100588.
Landrum etal. (2024)Greg Landrum, Paolo Tosco, Brian Kelley, Ric, David Cosgrove, sriniker, gedeck,Riccardo Vianello, NadineSchneider, Eisuke Kawashima, Gareth Jones, Dan N,Andrew Dalke, Brian Cole, Matt Swain, Samo Turk, AlexanderSavelyev, AlainVaucher, Maciej Wójcikowski, Ichiru Take, VincentF. Scalfani, DanielProbst, Kazuya Ujihara, guillaume godin, Axel Pahl, Rachel Walker, JuusoLehtivarjo, Francois Berenger, jasondbiggs, and strets123.rdkit/rdkit: 2023_09_4 (q3 2023) release, January 2024.URL https://doi.org/10.5281/zenodo.10460537.
Lewis etal. (2019)Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed,Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.Bart: Denoising sequence-to-sequence pre-training for naturallanguage generation, translation, and comprehension, 2019.
Lin etal. (2022)Haitao Lin, Yufei Huang, Meng Liu, Xuanjing Li, Shuiwang Ji, and StanZ. Li.DiffBP: Generative Diffusion of 3D Molecules for TargetProtein Binding, December 2022.
Loshchilov & Hutter (2019)Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization, 2019.
Luo etal. (2021)sh*tong Luo, Jiaqi Guan, Jianzhu Ma, and Jian Peng.A 3D generative model for structure-based drug design.In M.Ranzato, A.Beygelzimer, Y.Dauphin, P.S. Liang, and J.WortmanVaughan (eds.), Advances in Neural Information Processing Systems,volume34, pp. 6229–6239. Curran Associates, Inc., 2021.URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/314450613369e0ee72d0da7f6fee773c-Paper.pdf.
Mcnu*tt etal. (2021)AndrewT. Mcnu*tt, Paul Francoeur, Rishal Aggarwal, Tomohide Masuda, Rocco Meli,Matthew Ragoza, Jocelyn Sunseri, and DavidRyan Koes.Gnina 1.0: molecular docking with deep learning.Journal of Cheminformatics, 13(1):43, Jun2021.ISSN 1758-2946.doi: 10.1186/s13321-021-00522-2.URL https://doi.org/10.1186/s13321-021-00522-2.
Niu etal. (2020)Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and StefanoErmon.Permutation invariant graph generation via score-based generativemodeling, 2020.
O’Boyle etal. (2011)NoelM. O’Boyle, Michael Banck, CraigA. James, Chris Morley, TimVandermeersch, and GeoffreyR. Hutchison.Open Babel: An open chemical toolbox.Journal of Cheminformatics, 3(1):33, Oct2011.ISSN 1758-2946.doi: 10.1186/1758-2946-3-33.URL https://doi.org/10.1186/1758-2946-3-33.
Olivecrona etal. (2017)Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen.Molecular de novo design through deep reinforcement learning, 2017.
Ouyang etal. (2022)Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, JohnSchulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, AmandaAskell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback,2022.
Paszke etal. (2019)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, AlbanDesmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, AlykhanTejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and SoumithChintala.Pytorch: An imperative style, high-performance deep learning library,2019.
Peng etal. (2022)Xingang Peng, sh*tong Luo, Jiaqi Guan, QiXie, Jian Peng, and Jianzhu Ma.Pocket2mol: Efficient molecular sampling based on 3d protein pockets.In International Conference on Machine Learning, 2022.
Peng etal. (2023a)Xingang Peng, Jiaqi Guan, Qiang Liu, and Jianzhu Ma.MolDiff: Addressing the atom-bond inconsistency problem in 3Dmolecule diffusion generation.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt,Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40thInternational Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pp. 27611–27629. PMLR,23–29 Jul 2023a.URL https://proceedings.mlr.press/v202/peng23b.html.
Peng etal. (2023b)Xingang Peng, Jiaqi Guan, Qiang Liu, and Jianzhu Ma.MolDiff: Addressing the atom-bond inconsistency problem in 3Dmolecule diffusion generation.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt,Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40thInternational Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pp. 27611–27629. PMLR,23–29 Jul 2023b.URL https://proceedings.mlr.press/v202/peng23b.html.
Polishchuk etal. (2013)PG Polishchuk, TI Madzhidov, and AVarnek.Estimation of the size of drug-like chemical space based on GDB-17data.J Comput Aided Mol Des, 27(8):675–679,August 2013.
Radford etal. (2018)Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.Improving language understanding by generative pre-training.2018.
Radford etal. (2019)Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever.Language models are unsupervised multitask learners.2019.
Rajbhandari etal. (2020)Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero: Memory optimizations toward training trillion parameter models,2020.
Rasley etal. (2020)Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He.Deepspeed: System optimizations enable training deep learning modelswith over 100 billion parameters.In Proceedings of the 26th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining, KDD ’20, pp. 3505–3506, New York,NY, USA, 2020. Association for Computing Machinery.ISBN 9781450379984.doi: 10.1145/3394486.3406703.URL https://doi.org/10.1145/3394486.3406703.
Schneuing etal. (2023)Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, WeitaoDu, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein,and Bruno Correia.Structure-based drug design with equivariant diffusion models, 2023.
Schulman etal. (2017)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal policy optimization algorithms, 2017.
Segler etal. (2018)Marwin H.S. Segler, Thierry Kogej, Christian Tyrchan, and MarkP. Waller.Generating focused molecule libraries for drug discovery withrecurrent neural networks.ACS Central Science, 4(1):120–131, Jan2018.ISSN 2374-7943.doi: 10.1021/acscentsci.7b00512.URL https://doi.org/10.1021/acscentsci.7b00512.
Song etal. (2021)Yang Song, Jascha Sohl-Dickstein, DiederikP. Kingma, Abhishek Kumar, StefanoErmon, and Ben Poole.Score-based generative modeling through stochastic differentialequations, 2021.
Su etal. (2021)Jianlin Su, YuLu, Shengfeng Pan, BoWen, and Yunfeng Liu.Roformer: Enhanced transformer with rotary position embedding.CoRR, abs/2104.09864, 2021.URL https://arxiv.org/abs/2104.09864.
Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, GuillemCucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, SagharHosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux,Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, XavierMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, AndrewPoulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, RuanSilva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang,Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, IliyanZarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurelienRodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models, 2023.
Trott & Olson (2010)Oleg Trott and ArthurJ. Olson.AutoDock Vina: Improving the speed and accuracy of docking witha new scoring function, efficient optimization, and multithreading.Journal of Computational Chemistry, 31(2):455–461, 2010.doi: https://doi.org/10.1002/jcc.21334.URL https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.21334.
von Werra etal. (2020)Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, TristanThrush, Nathan Lambert, and Shengyi Huang.Trl: Transformer reinforcement learning.https://github.com/huggingface/trl, 2020.
Weininger (1988)David Weininger.Smiles, a chemical language and information system. 1. introductionto methodology and encoding rules.Journal of Chemical Information and Computer Sciences,28(1):31–36, Feb 1988.ISSN 0095-2338.doi: 10.1021/ci00057a005.URL https://doi.org/10.1021/ci00057a005.
Williams (1992)R.J. Williams.Simple statistical gradient-following algorithms for connectionistreinforcement learning.Machine Learning, 8:229–256, 1992.
Wolf etal. (2020)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, JoeDavison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, JulienPlu, Canwen Xu, TevenLe Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,and AlexanderM. Rush.Transformers: State-of-the-art natural language processing.In Proceedings of the 2020 Conference on Empirical Methods inNatural Language Processing: System Demonstrations, pp. 38–45, Online,October 2020. Association for Computational Linguistics.URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Yu & MacKerell (2017)Wenbo Yu and AlexanderD. MacKerell.Computer-Aided Drug Design Methods, pp. 85–106.Springer New York, New York, NY, 2017.ISBN 978-1-4939-6634-9.doi: 10.1007/978-1-4939-6634-9˙5.URL https://doi.org/10.1007/978-1-4939-6634-9_5.
Zhou etal. (2023)Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei,Linfeng Zhang, and Guolin Ke.Uni-mol: A universal 3d molecular representation learning framework.In The Eleventh International Conference on LearningRepresentations, 2023.URL https://openreview.net/forum?id=6K2RM6wVqKu.

Appendix A Tokenization and Data Representation

Appendix B Technical Description of the Training Pipeline

B.1 Pretraining

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (6)

To achieve efficient pretraining, we use large batch training (Keskar etal., 2017) with 1.6M tokens per batch. We set microbatch size to the maximal that fits to the GPU memory and we do gradient accumulation to get large enough batch size (to eventually have 1.6M tokens per batch). Since training sequences have variable length (which comes from the fact that molecules have different sizes), only a part of tokens contribute to the loss so we make sure we have at least 1.6M such “enabled” tokens. We use learning rate warmup of 2000 steps, followed by cosine annealing of the learning rate. The maximal learning rate during pre-training is $10^{-3}$ regardless of the model size. We found this many tokens per batch to be important for stable training in this task even with smaller learning rates, especially for models larger than 100M parameters. We use AdamW optimizer (Loshchilov & Hutter, 2019) with a weight decay factor of $10^{-2}$ . We use gradient clipping with the maximal grad norm of $1.0$ . The pretraining takes around 55k optimization steps over 36 hours on one compute node with 8 A6000 GPUs. We employ Flash-Attention2 Dao (2023) and DeepSpeed optimization accelerator. To use more performant tensor cores, we train with mixed precision where computation is done within the bfloat16 datatype. As the distributed optimizer, we use DeepSpeed ZeRO Stage-3 optimizer (Rajbhandari etal., 2020). We train the model for $1$ epoch only. The amount of tokens in the dataset is $42$ B for the version without explicit hydrogens and $90$ B tokens for the version with explicit hydrogens. The total size of the Uni-Mol pretraining dataset is around 150GB.

B.2 Supervised Finetuning

We use the public CrossDocked version v1.3 and for each molecule (except the ones optimized by the Gnina model (Mcnu*tt etal., 2021) as it yields too many bad itermediate samples) we take its “minimized” and “docked” formats and extract all intermediate molecules from their files. For each such molecule we cut the pocket with the ProDy (Bakan etal., 2011) tool. As a result of this process we obtain around 27M pocket-ligand pairs. The size of the CrossDocked that we use is around 50GB.

We use the same recipe for finetuning as for the pretraining with a few changes in hyperparameters. In particular, we use maximal learning rate of $5\times 10^{-4}$ and only $100$ warump steps. The learning rate schedule, weight decay, optimizer, maximal gradient norm are the same as for the pretraining. The only substantial difference from the pretraining stage is the weighted loss which we use for the CrossDocked finetuning. Specifically, we weight tokens that correspond to different parts of the output, differently. For example, the SMILES tokens have the weight of $1$ while tokens that correspond to the XYZ coordinates placed after SMILES have the weight of $5$ . The tokens corresponding to the pocket have the weight of $0$ since they are used as the context only and we don’t intend to generate them. As it was described in the Section 3.3.1, we do SMILES randomization (see Bjerrum (2017) for implementation details) and rotate pocket it’s ligand randomly - first we sample a random 3D rotation vector, we convert it to a rotation matrix and apply it to the coordinates of both. Also, we enforce the origin of their coordinates to be the same, namely, the coordinate center of the ligand (i.e. we guarantee that the model will generate coordinates around the origin). We train the model on the CrossDocked dataset for $1$ epoch. As it was mentioned in Section 3.3.1, we extract the full version of the CrossDocked data.

For the finetuning on the GEOM-DRUGS dataset, we use the same hyperparameters as in the SFT stage for CrossDocked with only two differences. First, we weight the loss for all tokens with the same weight of $1$ . Second, we don’t rotate 3D coordinates of the molecule but only do SMILES randomization.

B.3 Reinforcement Learning Funetuning description

The last stage of our pipeline is Reinforcement Learning. We use a distributed Reinforcement Learning algorithm based on the TRL (von Werra etal., 2020) training loop. That is, we launch multiple GPU-workers, where each repeatedly samples experiences, computes rewards, computes the update for the policy (i.e. the transformer language model), synchronizes them, and then performs the gradient update. We use $8$ gpu workers, each with the local batch size of $16$ . At every step, we sample a batch of pockets, sample molecules for them, then we compute the rewards via docking tool and perform only one gradient update. We found this to be crucial for our task as otherwise the training might diverge. Even algorithms that are believed to be more powerful, such as PPO (Schulman etal., 2017), experience instabilities when the policy lag is bigger. Our surrogate loss for Reinforcement Learning has the following form:

	$\displaystyle L(\theta)=\mathbb{E}_{s\sim\mathcal{D},a\sim p_{\theta}(a\mid s)%}L(\theta,s,a)$
	$\displaystyle L(\theta,s,a)=-R(s,a)\frac{1}{\|a\|}\log p_{\theta}(a\mid s)+%\alpha\text{KL}(p_{\theta_{0}}(\cdot\mid s)\\|p_{\theta}(\cdot\mid s))$

Here $s$ is the tokenized representation of the pocket and $a$ is the tokenized representation of the generated molecule. $p_{\theta}$ is the current version of the language model being finetuned while $p_{\theta_{0}}$ is the result of the SFT stage. $\log p_{\theta}(a\mid s)$ is the sum of generated token log-probabilities. $\mathcal{D}$ is the dataset of prompts (i.e. the pockets-only subset of CrossDocked). $R(s,a)$ is the vina score computed for the corresponding pocket-molecule pair. Finally, we compute the distillation style KL since we want to keep the output distribution of the RL model wide. The KL weight $\alpha$ in our experiments is $\alpha=0.05$ . We use a flat learning rate of $1.4\times 10^{-5}$ and no weight decay. Like before, we clip the gradient norm at $1.0$ . In our surrogate loss function, the loglikelihood of the token sequence is averaged (instead of being summed). We found this crucial for training stability.

Appendix C Evaluating Pretraining at Different Scales

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (7)

As described in the Section 3.2, we pretrain the model on 208M 3D conformations of molecules. By experiment with different models size, we observed that the model scales well up until the size of 300M parameters, where its perplexity shows overfitting. We, therefore, stick to the 100M model in our later experiments as we found it yielding the best results. Figure 11 shows the hold-out test set perplexity on ZINC-250k²²2The same procedure was conducted for collecting 3D conformations for ZINC-250k as for the original pretraining data. for sizes 11M, 58M, 108M, 304M. Note that high value of perplexity is dictated by the highly stochastic nature of 3D molecule coordinates. We believe the model quality can be improved further by increasing the amount of the data for pretraining and the current 108M model obtains the best performance given the pretraining dataset.

Appendix D RL finetuning Training Curves

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (8)

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (9)

Appendix E More Samples From the BindGPT Model

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (11)

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (12)

Appendix F Augmenting BindGPT with an External Tool for Assisted Generation

For the unconditional molecule generation tasks and the conformation generation tasks, we enhance the 3D generative abilities of the BindGPT model though the use of the RDKit tool. However, we use it only as a scoring mechanism while our model still acts as the proposal distribution. The scoring happens in the following way: first, we generate SMILES string only (we skip this step if we need to do the conformation generation task). Then, we generate $N$ different conformations from our model and score each with the MMFF (Halgren, 1996) energy from RDKit. After that, we select the generated conformation with minimal energy and return it as a sample. It’s important to note that we don’t use MMFF to optimize the conformation, and also we don’t provide the model with any information about the 3D structure. That is, for example, the BindGPT, finetuned on the GEOM-DRUGS will still generate samples within the GEOM conformation distribution (which is different from the distribution produced by MMFF) as can be seen in Table 2. In addition, when comparing with even higher quality, real world-like conformations, such as the ones of the Platinum dataset, the assisted generation boosts the performance of the model (as can be seen in Figure 4), making it’s distribution more close to the read world conformations, despite MMFF being just a theoretical approximation. This confirms that such a selection process describes above does not bias the distribution of models outputs, but rather helps it to eliminate generation errors (e.g. atom misplacements). Notably, we don’t use the assisted generation for the protein-ligand binding task. In our experiments we use $N=10$ .

Appendix G Efficient Training and Inference

Despite the wide use of transformers in drug discovery, the majority of current works in this space do not use recent technical advancements that make Large Language Models efficient (for example, Flash Attention (Dao, 2023)). The reason is simply that the pretraining paradigm is still coming to drug discovery, with perhaps only one example of the Uni-Mol (Zhou etal., 2023) model being a multi-task pre-trained transformer model, which unlike this work uses an Encoder-only model that follows the BERT (Devlin etal., 2019) architecture. Therefore, small models are simply trained directly on downstream datasets for which training time optimizations are not crucial. For the case of pretraining, even for a small model, the pretraining over 90B tokens can take a significant time but it obtains a speedup of almost $3\times$ as a result of just using a combination of Flash Attention (Dao etal., 2022; Dao, 2023) and DeepSpeed (Rasley etal., 2020).

To facilitate efficient training and inference, we use the transformers (Wolf etal., 2020) library from Hugginface with the PyTorch framework (Paszke etal., 2019). We use Flash Attention 2 (Dao, 2023) implementation of self-attention, and the DeepSpeed (Rasley etal., 2020) distributed training framework. During autoregressive sampling, we use Key-Value caching and Flash Attention 2 to speed up decoding. Despite being just implementation optimization tricks, these two techniques can make a big difference for sampling as they speed up decoding by two orders of magnitude compared to the naive approach, making sampling with transformer decoders significantly faster compared to the sampling from diffusion models.For example, KV-caching reuses past attention keys and values resulting in $\mathcal{O}(1)$ MLP forward passes instead of $\mathcal{O}(L)$ at each decoding step, where $L$ is the prefix length. Thus, the total number of forward passes with decoding length $L$ is $\mathcal{O}(L)$ instead of $\mathcal{O}(L^{2})$ . We hope that our work will promote a more wide use of the transformers best practices within the drug discovery ML community.